His code, which he previously had accessing files in the directory, is now taking many times longer to run when it accesses 'files' through the SQLite database. It turned out that the SQLite database was being queried by filename (which was one of the columns), but that column was not indexed. Since there was no index, SQLite was doing a full table scan on every 'open file' query.
I would like to respond to both points, since this is at least partially in my wheelhouse (thanks to my day job).
Mr. Erz, you are correct that there is no single drive that can hold 281 TB. But, even with macOS, it would be possible to directly connect enough storage to hold 281 TB. For example, starting with a Mac with Thunderbolt 3, you could…
• Connect via Thunderbolt to a ATTO ThunderLink SH 3128 SAS adapter
• Connect via two SAS cables to a Colfax CX42412s-JBOD 24-drive JBOD, containing 24 HGST 0F38357 16-TB SAS drives.
You can then use macOS RAID Assistant to format the drives RAID 0, giving you ~384 TB raw capacity, or ~349 TiB raw capacity, which should be enough to hold 281 TB.
I should say, it would be very unlikely that someone would do this. Instead, that JBOD would probably be connected to a server, which would then server those ~349 TiB (raw) to your computer over a network protocol like SMB. There are also network filesystems like Lustre, which use clusters of servers, each with its own direct-attached storage. In such an environment, software running on the clients takes commands like `list directory`, connects to and queries the relevant servers, and returns the results. My colleague Stéphane runs a service (Oak, https://uit.stanford.edu/service/oak-storage) which uses Lustre, and has a capacity of many Petabytes.
There may be other reasons why having a single 281 TB SQLite database is a bad idea, but that is out of scope in this case.
Next, a side comment on storing 600,000 files in a single directory. On the systems I run at work, such a directory would also take a long time to list. That is common for most environments where the `list contents of directory` operation is synchronous; the client (you) will have to wait for the OS to gather the information and organize it. Languages like Python also see this issue; it is the reason why `os.scandir` is often better to use than `os.listdir` (see https://stackoverflow.com/questions/59749854/how-does-os-lis...).
Finally, storing 'files' into a SQLite database. SQLite actually ships a tool for making "SQLite archive" files, which are what you describe: A SQLite database with a single table, containing each file's contents in a column. The `sqlar` schema is described in https://www.sqlite.org/sqlar.html, and it addresses the concern you raised in your article: The `sqlar` schema has the filename as a primary key. Doing so does not eliminate the integer row ID, but it does automatically create an index on the filename; it also ensures that filenames are unique.
`sqlar` also has another benefit, something you wanted : The file contents are stored in compressed format. The database itself is not compressed, but the file is.
In summary, you are correct that no single hard drive would store so much data, but even on macOS it is (theoretically) possible to have that much data directly connected, and it is practically possible with a SMB connection to a storage server, not to mention network file system platforms like Lustre. As for SQLite, the problem of not indexing is a vexing one, and a problem that database developers will always encounter (cf https://news.ycombinator.com/item?id=31170370; for this context, sharding may be thought of as another form of indexing). But please do not "throw the baby out with the bath water". The SQLite team themselves document and implement a schema you could have used. I ask you to reëvaluate your dislike of SQLite, and if you have some time, maybe try again, using the `sqlar` schema this time :-)
Though in this case, I believe both DCA Tower and Approach were aware of this operation, and so the pilots are OK, just the Capitol Police didn't get the word. I'm certain they have radar (or at least a radar feed), and so would have been justifiably concerned.
The author is at the University of Copenhagen (per the Google Scholar link above), so it's entirely possible that at least some of the funding for his employment is coming from sources that use citation counts as an indicator that they are "getting their money's worth" by continuing to fund (at least part of) Mr. Tange's employment.
> == Is the citation notice compatible with GPLv3? ==
> Yes. The wording has been cleared by Richard M. Stallman to be compatible with GPLv3. This is because the citation notice is not part of the license, but part of academic tradition.
If you are of the view that clearance by RMS is clearance by FSF/GNU, then that's how they find it acceptable. If you take a different view, then the next part of that section applies:
> If you disagree with Richard M. Stallman's interpretation and feel the citation notice does not adhere to GPLv3, you should treat the software as if it is not available under GPLv3. And since GPLv3 is the only thing that would give you the right to change it, you would not be allowed to change the software.
There's also an interesting comparison to be made:
> == How do I silence the citation notice? ==
> Run this once:
> parallel --citation
> It takes less than 10 seconds to do and is thus comparable to an 'OK. Do not show this again'-dialog box seen in LibreOffice, Firefox and similar programs.