Hacker News Re-Imagined

Inserting One Billion Rows in SQLite Under a Minute

5 hours ago

Created a post 296 points @todsacerdoti

Inserting One Billion Rows in SQLite Under a Minute

@einpoklum 3 hours

Replying to @todsacerdoti 🎙

> Recently, I ran into a situation where I needed a test database with lots of rows and needed it fast. So I did what any programmer would do: wrote a Python script to generate the DB.

The author must have a rather low opinion of programmers... generating data with a script is good for 1,000 rows. Maybe 10,000 rows. 100,000 is probably stretching it. Beyond that the slowness will be meaningful.

Anyway, if the "test database" data is given in advance, you want an in-memory DBMS which can load data from files. If it isn't - use an in-memory DBMS which supports generated columns, or apply your random-number-generating function on a trivial column.

(MonetDB standalone/embedded would be a decent option; maybe the columnar storage mechanism in MariaDB? And of course there are commercial offerings.)

> Unfortunately, it was slow. Really slow. So I did what any programmer would do: went down the rabbit hole of learning more about SQLite...

That's the wrong approach. SQLite isn't called "lite" for nothing. It's not what you would use for bulk operations on large tables.

Read the SQLite FAQ: https://www.sqlite.org/whentouse.html

it oversells itself a bit, but still says its (only) advantages for data analysis are: "easier to install and use and the resulting database is a single file that can be written to a USB memory stick or emailed."


@siscia 4 hours

Creator of RediSQL / zeeSQL (https://zeesql.com/)

Insertion performance on a single table are very very hard to optimize.

A single process looping it is your best bet.

I would just increase the batch size, which is the most influent factor.

Then another point... When you do batches, you do

    for i in range(1, 50):
You do not create a long list of parameters.




@avinassh 4 hours

Hey all, author here. I didn't expect this to reach front page of HN. I came here to submit and it was here already! I am looking for more ideas to experiment. Here's one idea which someone from another forum gave me: exploring recursive queries and using SQLite random methods to do the insertions.

Another crazy idea is to learn about SQLite file format and just write the pages to disk.


@mongol 2 hours

1 billion rows, 8GB RAM. I think there will be some limitation from available RAM as well.


@asicsp 4 hours

>The larger DB required more than 30 minutes to generate. So I spent 30+ hours to reduce the 30 mins of running time :p

Worth it.

>I could just run s the script again.

Found a typo.


@pksebben 16 minutes

I'm pretty new to this sort of optimization stuff, so forgive me if this is ridiculous for any reason, but I'm wondering about two things reading this:

1: is it useful / feasible to set a sort of "lower bound" to these optimizations by profiling the raw write time on a set of a certain size?

2. assuming that writes are ultimately the limiting factor, could you gain performance by calculating the minimum batch size as it's write time intersects in-memory calculation, and as soon as you hit that size, pop a thread to write that batch to disk - then return some async signal when the write is done to trigger the next batch of writes? is it possible to "stitch together" these batches post-writes?

edit: mobile formatting is le suck


@ThePadawan 4 hours

There are some cool ideas in there!

I wonder how much I could speed up my test suites (that don't rely on transaction rollbacks) by disabling journalling. Those milliseconds per query add up to seconds per tests and minutes per PR.


@jonnycomputer 4 hours

>The generated data would be random with following constraints: The area column would hold six digits area code (any six digits would do, no validation). The age would be any of 5, 10, or 15. The active column is either 0 or 1.

That is about 6 million combinations. This is not such a big space that it would be impractical to precompute all possible combinations. I wonder if hashing in to such a precomputed table might help. Might not if the overhead of maintaining the lookup table is too high.


@nzealand 1 hour

If you insert 4 rows into a table, and join it on itself without any conditional criteria, it will result in a cartesian product of 16 rows (Select * from a, a)

Do that cartesian join three more times, and you have over 4 billion results.

Then you simply need to use the random function in conjunction with division, rounding and case statements to get the desired random numbers.


@orf 3 hours

How long does it take when using the native CSV import features?



@roman-holovin 5 minutes

INSERT INTO user (area, age, active) SELECT abs(random()) % 1000000, (abs(random()) % 3 + 1) * 5, abs(random()) % 2 FROM generate_series(1, 100000000, 1)

Faster by 10% than fastest author implementation on my machine - 19 seconds against 21 for 'threaded_batched'.


@arusahni 3 hours

[ @dang, please delete if/when you get the chance, thank you ]


@lmz 4 hours

I wonder if defining a virtual table (https://www.sqlite.org/vtab.html) and just doing

INSERT INTO t (...) SELECT ... from virtual_table

would be any faster.


@bob1029 4 hours

You can go even faster if you can organize your problem such that you simply start from a copy of a template SQLite database each time (it is just a file/byte sequence).

We do SQL evaluation for a lot of business logic throughout our product, and we have found that starting from a template database (i.e. one with the schema predefined and canonical values populated) can save a lot of time when working in tight loops.


@perlgeek 4 hours

I wonder if something like https://paulbradley.org/sqlite-test-data/ could be used to generate the test data in sqlite3 directly, and if it would be any faster.

(It would likely be faster for other DB engines, because there is network overhead in the communication between the program and the DB; no such things for sqlite).


@kebman 2 hours

Mini summary: Do not use this in production. xD


@svdr 3 hours

I thought one billion was 1000M (and not 100M)?


@mpweiher 1 hour


I was actually just working on SQLite speed for Objective-S (http://objective.st), partly as a driver for getting some of the more glaring inefficiencies out.

Using a "to do list" schema, I currently get the 100M rows out in 56 seconds, which is around half the speed of the Rust example given here, but 3 times the speed of PyPy and almost 10x faster than Python.

This is from an interpreted script that not only does the inserts and creates the objects to insert in the first place, but also defines the actual class.

The lower-level code is written in Objective-C, like the rest of Objective-S.

Class definition:

    class Task {
      var <int> id.
      var <bool> done.
      var <NSString> title.
      -description { "<Task: title: {this:title} done: {this:done}>". }
      +sqlForCreate {
          '( [id] INTEGER PRIMARY KEY, [title] NVARCHAR(220) NOT NULL, [done] INTEGER );'.

Code to insert a computed array of tasks 10 times:

     1 to:10 do: {
         this:tasksTable insert:taskList.


@mimir 4 hours

Database optimization posts are always interesting, but it's really hard to do any apples to apples comparison. Your performance is going to depend mostly on your hardware, internal database settings and tunings, and OS level tunings. I'm glad this one included some insight into the SQLite settings disabled, but there's always going to be too many factors to easily compare this to your own setup.

For most SQL systems, the fastest way to do inserts is always just going to batched inserts. There's maybe some extra tricks to reduce network costs/optimize batches [0], but at it's core you are still essentially inserting into the table through the normal insert path. You can basically then only try and reduce the amount of work done on the DB side per insert, or optimize your OS for your workload.

Some other DB systems (more common in NoSQL) let you actually do real bulk loads [1] where you are writing direct(ish) database files and actually bypassing much of the normal write path.

[0] https://dev.mysql.com/doc/refman/5.7/en/insert-optimization.... [1] https://blog.cloudera.com/how-to-use-hbase-bulk-loading-and-...


@MobiusHorizons 2 hours

I would recommend looking into dtrace or other observability tools to figure out what to optimize next. So far you have basically had to guess where the slowness is and optimize that.

Lastly, how long does it take to make a copy of an already prepared 1b row SQLite file?that seems easier than generating a new one.


@mirekrusin 4 hours

If it's single, in memory table, is there really need to use database? Won't language provided data structures suffice?


@pachico 1 hour

I've been dealing with lots of data and SQLite and my outtakes are:

- language has very little to do since the bottleneck will most likely be the way you insert data

- indeed prepared statements are useful but the performance didn't change much when I did long transactions and commit every certain amount of thousand of rows

- having lots of rows in your table is good but certain queries, like aggregation over many rows, are not what SQLite is great about.

- ClickHouse can easily ingest that and more in a laptop without even any scripting language.


@bilekas 2 hours

> Threads / async may not be faster always

This is actually a little bit mind blowing for me. I'm gonna go and play with this. But what a cool read!


@elephantum 49 minutes

I bet, that this is mostly random.randint benchmark and not SQLite.

Also interesting whether batched numpy version would compare better to Rust.


@IfOnlyYouKnew 1 hour

I ran this and got the following results:

      /fast-sqlite3-inserts (master)> time make busy-rust
      Sun Jul 18 17:04:59 UTC 2021 [RUST] busy.rs (100_000_000) iterations
      real 0m9.816s
      user 0m9.380s
      sys 0m0.433s
      Executed in    9.92 secs    fish           external
      usr time    9.43 secs    0.20 millis    9.43 secs
      sys time    0.47 secs    1.07 millis    0.47 secs
      fast-sqlite3-inserts (master)> time make busy-rust-thread
      Sun Jul 18 17:04:48 UTC 2021 [RUST] threaded_busy.rs (100_000_000) iterations
      real 0m2.104s
      user 0m13.640s
      sys 0m0.724s
      Executed in    2.33 secs    fish           external
      usr time   13.68 secs    0.20 millis   13.68 secs
      sys time    0.78 secs    1.18 millis    0.78 secs
I'm probably doing something wrong. Or I'm getting the pace needed for the billion?

This is on a M1 MacBook Air.


@rmetzler 4 hours

Is PRAGMA temp_store = MEMORY the same as putting the file on a TempFS?


About Us

site design / logo © 2021 Box Piper