Using Janet as Database

You must clone this repo and read it together with this article. Otherwise you won’t understand this article.

Two branches of the repo are of importance.

  • branch sqlite: the version that uses SQLite3 for persistence
  • branch janet: the version that uses Janet for persistence

A few days ago, I was testing out writing HTTP application in Zig, with htmx. (unrelated: htmx is very good)

When I added persistency to the project, at first I chose SQLite. It worked. However sqlite-interfacing code felt like too “loose and moving” in the whole project. It felt flaky.

In pursuit of mathematical soundness, I swapped out SQLite with Janet. The result is pleasing to me, so I wrote about it here.

Here’s how I used Janet the programming language as database of my toy application.

Query and Data Model

When using Janet as a database, the programming language itself is the query language.

Since the application is a simple counter, I used (def stuff {:counter 0}).

To get the data, I used (stuff :counter). To set the data, I used (set (stuff :counter) 1).

If the data model of your application is more complex, then I highly recommend reading the official Janet documentation. I also wrote another article to describe how to mix Zig data and Janet data.

Persistence

To save data, I simply used the POSIX file-system.

First, I used the Zig API janet.marshal to serialize stuff to a string. The string is then saved in a file with the following steps.

  • create file database~ with exclusive lock (man 2 open)
  • write data
  • fsync (man 2 fsync)
  • rename/link file to database (man 2 link) This step is atomic.

P.S. fdatasync on Linux doesn’t flush change to file size (see man 2 fdatasync)

To read the data, I did the opposite with janet.unmarshal.

The most important different of Janet from SQL database is that you can serialize cyclic data and functions! By using direct reference to objects, we avoid SQL JOIN hell. In a sense, Janet is a network database.

In Janet REPL, the database can be loaded from disk with (unmarshal (slurp "database")). This is useful for debugging, or to play around with the data.

Why use SQLite then

Persistence performance. I haven’t benchmarked this yet, but I assume that SQLite is faster, as it won’t write the whole database to file every time it needs to persist stuff.

Query performance. I don’t believe this is a problem. Many people write network-facing applications in Python. Janet is as fast as CPython, so it should be ok. If I hit a performance bottleneck here, I would rather use BQN or Polars than SQL, or I can rewrite that part in Zig.

Type check. SQL has built-in type check. Janet can do this too with built-in functions, and the data model is not restricted to match relational algebra.

What can be improved

To save disk space, a key-value database can be used instead of storing multiple <4kb files. From the source code of Python’s shelve module, shelve uses a gdbm for backing storage, which is a key-value database.

To store cyclic data without pain, we need GC. This is the case for database as well (if it supports auto-delete either/both nodes of an edge if the edge was deleted (in graph database)). With some effort, maybe I can hack Janet and dbm together, so that they share the same GC.

With Zig’s explicit allocator design and compile-time code execution, it is possible to “relocate” nested structures into the same place by replacing pointers with relative offset. It works like a moving GC but without GC. With some effort, maybe I can add cyclic support to s2s. The downside of this is, of course, potential footgun from moving memories around.

Finishing Thoughts

For the past few years, there was this popular business practice where they make a new database out of nowhere and call it something new. Here are some examples:

  • create a new query language that is better than SQL but still bad, and implement a new database (surreal)
  • add a new algorithm to SQL, and implement a new database (one of those “machine learning” databases)
  • use Prolog syntax for relational algebra, and implement a new database (cozo)

Since I am affine to data, those “products” are very confusing to me. Here’s what I observed in practice:

  • the fastest way to store and process data is to store it in memory and process it with compiled code (Rust, Zig, whatever)
  • using a key-value database like lmdb need ~10x time of above
  • using a relational database need ~10x time of above
  • if the system needs to fetch data across network, there is relatively no performance

I don’t know how Janet fit in this performance scale, as I haven’t used it as database before.

Again, I am in awe of how useful Janet is. It is the same feeling I got when I first learned that a Lua file can act as a configuration file – “code as data”.

sync_file_range(2) on Linux (Update: 2024-07-05)

A friend had trouble with a database, which prompted me to review the syscalls regarding file writes again.

The default behavior of file writes on POSIX systems is as follows:

  • If you call fdatasync(2), the data is written to disk immediately
  • Otherwise, the data is written to disk eventually

How late is “eventually”? Don’t know. The kernel should provide a consistent view of files, so fdatasync(2) is not necessary for processes to understand each other. fdatasync(2) is for when the kernel unexpectly exits.

Linux has io_uring. You can send write(2) syscalls in parallel. fdatasync(2) and fsync(2), however, only works per file descriptor. If you want multiple write groups to happen in parallel, you need to have multiple file descriptors.

With sync_file_range(2), you can have effectively infinite many write groups with only one file descriptor. It’s only available on Linux, though.