Hacker News Re-Imagined

One-liner for running queries against CSV files with SQLite

  • 747 points
  • 15 days ago

  • @jdblair
  • Created a post

One-liner for running queries against CSV files with SQLite


@throwaway892238 15 days

Replying to @jdblair 🎙

  sqlite3 :memory: -cmd '.mode csv' ...
It should be a war crime for programs in 2022 to use non-UNIX/non-GNU style command line options. Add it to the Rome Statute's Article 7 list of crimes against humanity. Full blown tribunal at The Hague presided over by the international criminal court. Punishable by having to use Visual Basic 3.0 for all programming for the rest of their life.

Reply


@alana314 15 days

Replying to @jdblair 🎙

I like using q for querying CSVs on the command line: https://github.com/harelba/q

Reply


@kbouck 14 days

Replying to @jdblair 🎙

> sqlite3 :memory: -cmd '.mode csv' ...

csv mode can also be set with a switch:

    sqlite3 :memory: -csv

Reply


@valw 15 days

Replying to @jdblair 🎙

Btw, am I alone in thinking that DataFrame abstractions in OOP languages (like Pandas in Python) are oftentimes simply inferior to relational algebra? I'm not sure that many Data Scientists are aware of the expressive power of SQL.

Reply


@practice9 15 days

Replying to @jdblair 🎙

In the past I've used https://github.com/BurntSushi/xsv to query some large CSVs

Reply


@ArchD 15 days

Replying to @jdblair 🎙

If you need to query against multiple CSVs, e.g. using joins, you could use QHS: https://github.com/itchyny/qhs

Reply


@tgtweak 15 days

Replying to @jdblair 🎙

This is far more useful for SQL users than chaining several sed/awk/sort commands on pipe (although a bit against nix principles).

Reply


@mrfusion 15 days

Replying to @jdblair 🎙

How smart is SQLite at detecting column types from Csv data?

I once wrote a Python script to load csv files into SQLite. It had a whole hierarchy of rules to determine the data type of each column.

Reply


@rgovostes 15 days

Replying to @jdblair 🎙

SQLite's virtual table API (https://www.sqlite.org/vtab.html) makes it possible to access other data structures through the query engine. You don't need to know much if anything about how the database engine executes queries, you only need to implement the callbacks it needs to do its job. A few years ago I wrote an extension to let me search through serialized Protobufs which were stored as blobs in a regular database.

https://github.com/rgov/sqlite_protobuf

Reply


@kitd 15 days

Replying to @jdblair 🎙

I had to do something very similar for analysing CVE information recently, but I don't remember having to use the :memory: option. I suspect it defaults to that if no .db file is specified.

Slightly tangentially, when doing aggregated queries, SQLite has a very useful group_concat(..., ',') function that will concatenate the expression in the first arg for each row in the group, separated by the separator in the 2nd arg.

In many situations SQLite is a suitable alternative to jq for simple tabular JSON.

Reply


@ultrasounder 15 days

Replying to @jdblair 🎙

This comes at the most opportune time when I am cranking through selectstarsql,sqlbolt,schemaverse,pgexcercise to bone up on my SQL skills for my upcoming data engineering interviews. SQL IS the secret super power that devs don't know they possess.

Reply


@gigatexal 15 days

Replying to @jdblair 🎙

I’ve been doing this. But I hate it. CSVs need to die. They’re terrible data formats. But here we are. And SQLlite makes things amazing.

Reply


@swhalemo 14 days

Replying to @jdblair 🎙

I'm genuinely curious about the use case of this. whenever i have CSVs I either import then into R/python of they're small or some DBMS first if they're large. can somebody tell me what niche this is filling?

Reply


@a-dub 15 days

Replying to @jdblair 🎙

i used to daydream about adding an ALLOW DOING IT LIVE option to cassandra's csql client. in the event that your where clause was incompatible with your table's key, it would just wholesale dump the table in question into a sqlite while indexing the appropriate columns, run the query in question, actually return the result and then properly clean up.

Reply


@adamgordonbell 15 days

Replying to @jdblair 🎙

I've become a fan of using SQLite-utils to work with CSV or JSON files.

It's a two step process though. One to create and insert into a DB and a second to select from and return.

https://sqlite-utils.datasette.io/en/stable/index.html

Reply


@blacksqr 15 days

Replying to @jdblair 🎙

Squawk: An Awk-like program that uses SQL. Can parse, format, filter, and combine data from multiple files. Powered by SQLite.

https://wiki.tcl-lang.org/page/Sqawk

Reply


@adren67 15 days

Replying to @jdblair 🎙

Miller is also a great tool to do computation, transformations on CSV/TSV/JSON/SQLite files...

Here is the way to pretty-print the same result with mlr:

mlr --icsv --opprint --barred stats1 -a count,mean -f total_amount -g passenger_count then sort -f passenger_count taxi.csv

Reply


@spapas82 15 days

Replying to @jdblair 🎙

Can somebody post the equivalent in Windows cmd?

Reply


@dotancohen 15 days

Replying to @jdblair 🎙

Lately I've been using Visidata for any text file that looks like a table or other squarish data source, including JSON.

https://www.visidata.org/

Reply


@keybored 15 days

Replying to @jdblair 🎙

   MODE is one of:
     ascii     Columns/rows delimited by 0x1F and 0x1E
Yes!

Reply


@chrisweekly 15 days

Replying to @jdblair 🎙

One of my all-time favorite (and somehow still-obscure / relatively unknown) tools is called `lnav` ^1. It's a mini-ETL powertool with embedded SQLite, perfect for wrangling log files or other semi-structured data (a few millions of rows are no problem), it's intuitive and flexible...

1. https://lnav.org

Reply


@flusteredBias 15 days

Replying to @jdblair 🎙

I am a data scientists. I have used a lot of tools/libraries to interact with data. SQLite is my favorite. It is hard to beat the syntax/grammar.

Also, when I use SQLite I do not output using column mode. I pipe to `tv` (tidy-viewer) to get a pretty output.

https://github.com/alexhallam/tv

transparency: I am the dev of this utility

Reply


@mytherin 15 days

Replying to @jdblair 🎙

Using DuckDB [1]:

  duckdb -c "SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi.csv GROUP BY ALL"
DuckDB will automatically infer you are reading a CSV file from the extension, then automatically infer column names from the header, together with various CSV properties (data types, delimiter, quote type, etc). You don't even need to quote the table name as long as the file is in your current directory and the file name contains no special characters.

DuckDB uses the SQLite shell, so all of the commands that are mentioned in the article with SQLite will also work for DuckDB.

[1] https://github.com/duckdb/duckdb

Disclaimer: Developer of DuckDB

Reply


@avogar 15 days

Replying to @jdblair 🎙

Using ClickHouse you can also process local files in one line using clickhouse-local command tool. And it will look a lot easier:

clickhouse local -q "SELECT passenger_count, COUNT(*), AVG(total_amount) FROM file(taxi.csv, 'CSVWithNames') GROUP BY passenger_count"

And ClickHouse supports a lot of different file formats both for import and export (you can see all of them here https://clickhouse.com/docs/en/interfaces/formats/).

There is an example of using clickhouse-local with taxi dataset mentioned in the post: https://colab.research.google.com/drive/1tiOUCjTnwUIFRxovpRX...

Reply


@Uptrenda 15 days

Replying to @jdblair 🎙

I'm looking through this guys website for 'today I learned' and at first I'm impressed by how many of them there are. But then I start thinking: when you're trying to solve a problem you search for a lot of data. None of his posts are attributed. He's getting all his information from somewhere and then he goes and posts these articles just ripping off other sources.

I can understand when its based on your original work but this website reads more like basic questions posted on Stackoverflow. E.g. 'how to connect to a website with IPv6." Tell me he didn't just Google that and post the result. 0/10

Reply


@mattewong 15 days

Replying to @jdblair 🎙

Over all these CLIs, I prefer zsv ( https://github.com/liquidaty/zsv )-- then again, I wrote it so my preference should come as no surprise. On my Mac was 340% faster than the OP command, and does a lot more than just SQL/sqlite3-related

Reply


@neycmrtn 15 days

Replying to @jdblair 🎙

There is also our somewhat older sqlet.py at http://www.sqlet.com with multiple input files, column index shortcuts.. (and probably overdue for an update).

Reply


@cube2222 15 days

Replying to @jdblair 🎙

Since many people are sharing one-liners with various tools...

OctoSQL[0]:

  octosql 'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi.csv GROUP BY passenger_count'
It also infers everything automatically and typechecks your query for errors. You can use it with csv, json, parquet but also Postgres, MySQL, etc. All in a single query!

[0]:https://github.com/cube2222/octosql

Disclaimer: author of OctoSQL

Reply


@eli 15 days

Replying to @jdblair 🎙

the .import command used for actually loading the CSV is kinda picky about your CSVs being well-formatted. I don't think it supports embedded newlines at all.

Reply


@ttyprintk 15 days

Replying to @jdblair 🎙

You can also prefix a SQLite import command with |, which hopefully produces text. At the system level, some import scripts can be entirely in one sql file.

Reply


@wilsonfiifi 15 days

Replying to @jdblair 🎙

Another great tool written in Go is CSVQ [0][1] that can be used as a command line or a library.

  csvq 'select id, name from `user.csv`'

[0] https://github.com/mithrandie/csvq

[1] https://mithrandie.github.io/csvq/

Reply


About Us

site design / logo © 2022 Box Piper