Yes, I do. Really.
And if current mmap() implementations aren't up to the task, can we fix mmap()?
RavenDB's response to this paper: https://ayende.com/blog/196161-C/re-are-you-sure-you-want-to...Reply
why settle for errno when you can have a segfault.Reply
I worked at a company that developed its own proprietary database for a financial application. The entire database, several gigabytes, not large by today's standards, was mmap'd and read at startup to warm the page cache. We also built in-memory "indexes" at startup.
This was back in the early 2000's, when having 4 gigabytes of RAM would be considered large. The "database server" was single threaded, and all changes were logged to a WAL-ish file before updating the mmap'd database. It was fun stuff to work on. It worked well, but it wasn't a general purpose DB.Reply
Check out the creative use of emoji in the running header!Reply
I have a great deal of experience in running very large memory-mapped databases using LMDB.
The default Linux settings dealing with memory mapped files are pretty horrible. The observed poor performance is directly related to not configuring several very important kernel parameters.Reply
Anyone have any clue why there's a 3-phase sine wave showing up in mmap performance? (Figures 2a/2b)Reply
One possible advantage of using mmap over a buffer pool can be programmer ergonomics.
Reading data into a buffer pool in process RAM takes time to warm up, and the pool can only be accessed by a single process. In contrast, for an mmap-backed data structure, assuming that files are static once written (which can be the case for an multi-version concurrency control (MVCC) architecture), you open an mmap read-only connection from any process and the so long as the data is already in the OS cache, you get instant fast reads. This makes managing database connections much easier, since connections are cheap and the programmer can just open as many as they want whenever and wherever they want.
It is true that cache eviction strategy used by the OS is likely to be suboptimal. So if you're in a position to only run a single database process, you might decide to make different tradeoffs.Reply
The pragmatic consideration that usually influences the decision to use mmap() is the large discontinuity in skill and expertise required to replace it. Writing your own alternative to mmap() can be significantly superior in terms of performance and functionality, and often lends itself to a cleaner database architecture. However, this presumes a sufficiently sophisticated design for an mmap() replacement. The learning curve is steep and the critical nuances of sophisticated and practical designs are poorly explored in readily available literature, providing little in the way of "how-to" guides that you can lean on.
As a consequence, early attempts to replace mmap() are often quite poor. You don't know what you don't know, and details of the implementation that are often glossed over turn out to be critical in practice. For example, most people eventually figure out that LRU cache replacement is a bad idea, but many of the academic alternatives cause CPU cache thrashing in real systems, replacing one problem with another. There are clever and non-obvious design elements that can greatly mitigate this but they are treated as implementation details in most discussions of cache replacement and largely not discoverable if you are writing one for the first time.
While mmap() is a mediocre facility for a database, I think we also have to be cognizant that replacing it competently is not a trivial ask for most software engineers. If their learning curve is anything like mine, I went from mmap() to designing obvious alternatives with many poorly handled edge cases, and eventually figuring out how to design non-obvious alternatives that could smoothly handled very diverse workloads. That period of "poor alternatives" in the middle doesn't produce great databases but it almost feels necessary to properly grok the design problem. Most people would rather spend their time working on other parts of a database.Reply
Choosing mmap() gets you something that works sooner than later.
But then you have a pile of blocking-style synchronous code likely exploiting problematic assumptions to rewrite when you realize you want something that doesn't just work, but works well.Reply
Interesting parallels in this work to Tanenbaum's "RPC Considered Harmful"†; in both cases, you've got an abstraction that papers over a huge amount of complexity, and it ends up burning you because a lot of that complexity turns out to be pretty important and the abstraction has cost you control over it.Reply
Most of the times I used mmap I wasn't happy in the end.
I went through a phase when I thought it was fun to do extreme random access on image files, archives and things like that. At some point I think "I want to do this for a file I fetch over the network" and that needs a rewrite.Reply
How do you implement lockless atomic updates for multiple writers across multiple threads & processes without mmap?
With mmap it is straight forward for processes to open persistent arrays of atomics as a file, and use compare and exchange operations to prevent data races when multiple threads or processes update the same page without any file locks, advisory locks, or mutexes.
With manual read() and write() calls, the data may be overwritten by another writer before the update is committed.Reply
I am convinced. Great video.Reply
This is a great write-up!
Makes me wonder if there is an alternative universe in which there is a syscall with semantics similar to mmap that avoids these pitfalls. It's not like mmap's semantics are the only semantics that we could have for memory-mapped IO.Reply
Do they mean: mmap() ?Reply