Hacker News new | past | comments | ask | show | jobs | submit

Data Science at the Command Line, 2nd Edition (2021)

https://jeroenjanssens.com/dsatcl/
loading story #40269986
loading story #40272279
This is a great book, there’s a few tools I would add.

Datasette, clickhouse local (cli) and duckdb.

I think ripgrep is a big omission, ripgrep | xargs jq and find -exec jq is one of my most common data science workflows cause you can get stuff done in a few minutes. An example of where I use this is to debug Infrastructure as Code that is generated for many regions and AZs quickly.

Another one I like in this space is Bioninformatics Data Skills, for some reason Bioinformaticians use CLI workflows a lot and this book covers a lot of good info for those who are just starting out like tmux, make, git, ssh, background processes.

Two other techniques I like are git-scraping (tracking the changes of data over time or just saving snapshots of your data to git so you can diff it): https://simonwillison.net/2020/Oct/9/git-scraping/ I most recently have used this technique to diff changes to build artifacts over time.

This technique is not really CLI related per se, but I really like the http range query technique of hosting data: https://news.ycombinator.com/item?id=27016630 There’s simple ways to use this idea (like cooking up a quick h2o web server conf) to host data quickly.

I also like the makefile data pipeline idea, I believe the technique is described in the book but I first heard of it from this HN comment: https://news.ycombinator.com/item?id=18896204 The basic idea is that you can use make to orchestrate the steps of your command line data science workflow and let make figure out when your intermediate data needs to be generated again. A good example is this map reduce with make here: https://www.benevolent.com/news-and-media/blog-and-videos/ho...

loading story #40269630
loading story #40272863
loading story #40272623
loading story #40244112
loading story #40275974
loading story #40275941