part of: my Diploma Thesis

The Data engine

This chapter is about the implementation of the so called “Data engine” that will form the basis of the Mize platform. Other projects or products with the same or similar goals were examined in \ref{a-look-at-existing-data-management-systems}. My previous attempts at making a general data management system and how the idea came to be were looked at in \ref{hist}. In this chapter we look at the latest attempt, which will hopefully also be the last. It incorporates all the learnings from previous ones as well as a lot more experience with computer systems and programming.

The Item

Each piece of data is called an Item in this data engine. An Item works much like a mixture of the well established data storage concepts File and Folder. Files hold a list of bytes and Folders hold a list of links to other files of folders. An Item has both a list of bytes (also called the value of an Item) and a list of links to other Items.

The Type System

In an ordinary filesystem the type of the data in a file is defined by the string after the last dot or by the first few bytes of the file being a “magic value” (\cite{Purohit2024Jan}). An Item of the Mize data engine, in addition to having a list of bytes and a list of links, also has a type. The type of an Item is stored as a string at the path “type”. This type string is a with space separated list of the names of one ore multiple types. A type can specify how to interpret the bytes in the value of the Item and also what sub Items at what paths with what types the Item needs to have. This includes multiple levels, so also the sub Items of sub Items and so on are defined by a type. The type of the sub Item is then whatever the parent Item defines it to be and the type of the parent with a slash and the path, of where the sub Item is at, added to it. There can however only be one type, that says how to interpret the value bytes. All the other types can only have path definitions or further define the interpretation.

This type system allows Items to have really precise types, where some application needs to add special data to it, but also if an application only deals with a very generic type, it can just ignore the more specific types in the type string.

Lets look at the type system with the example of a note in my Obsidian Vault (\cite{BibObsidianVault}). It has the type string: “Note ObsidianNote MarkdownNote File MarkdownFile LinuxFile PosixFile NFSv4ACL”.

The type “Note”: defines, that the bytes in the value should be interpreted as a UTF-8 string. Applications that want to just modify or display the text of the note can see the Item as just that, a string. The type Note can therefore be seen as just an alias for the type “String”.
“MarkdownNote”: further details that the string is actually Markdown source code. Also markdown notes can have yaml properties at the top of the source, so MarkdownNote also defines, that there is a path “properties”, where all properties are mapped into.
“ObsidianNote”: is the type, which has paths, to put all information that is specific to the obsidian note taking application (\cite{BibObsidian}). For example in Obsidian every note is part of a so called vault (\cite{BibObsidianVault}) so the type ObsidianNote will include a link at the path obsidian_vault to the Item, that is a the vault the note belongs to.
“File”: Obsidian stores every note as just a markdown file on your system. So our note can also be seen as a “File” stored on some filesystem. (\cite{BibFilesystem}) This type means the value of our note Item is the content of the File. And we have sub Items at certain paths for file metadata, that files have on all platforms.
“PosixFile”: There is file metadata unique to UNIX systems like the permission if a user is able to execute this file (\cite{BibUnixPerms}). Things like this are again stored in sub Items at certain paths, which give an application access to this UNIX specific data.
“NFSv4ACL”: The file that stores our note, could have special attributes called ACLs or access control lists. There was an attempt to standardize ACLs in the POSIX standard, but that was withdrawn (\cite{BibPosixACL}). Many POSIX operating systems, including Linux, implement the ACLs as defined in the NFSv4 standard but because they are not part of the POSIX standard, they should not be part of the “PosixFile” type, but rather in their own type.

The Instance

The Instance is the main concept of the Mize data engine. Everywhere some code needs to access, store or update some data an Instance will be present and the interacting with data will be done through methods of the Instance. There is a Rust struct called Instance that holds all necessary state and implements those methods in the file ./src/core/instance/mod.rs of the Mize source code.

The Namespace

Every Instance has a Namespace associated with it, which will make it uniquely identifiable anywhere. For Instances that are more consuming data or user facing the Namespace will be a UUID. When you create an Instance and don’t specifically configure some Namespace a random UUID will be generated. For Instances that are supposed to own data a domain should be used. So I will have for example one Instance, that is my home server, which is reachable from the internet under the domain “c2vi.dev” and also has the Namespace “c2vi.dev”.

Additionally to having one Namespace, an Instance can also be part of one Namespace. An Instance on my local laptop would be setup to be part of the Namespace “c2vi.dev”. This would make every Item stored on and owned by my home server Instance by default and more importantly means I address the same Items across all my devices, because all my devices are part of the same Namespace.

The MizeId

Each Item needs to be identifiable somehow. This is what the MizeId is for. It is essentially a path, but with two extra concepts.

The first element of the path is usually the so called “store part”. It is generated by the storage part of the Mize data engine, which can be swapped out for different implementations. Depending on the what storage part is in use and user configuration it can be just an incrementing number like github issues, a UUID, random Base64 strings like YouTube videos or something like Snowflake (\cite{BibSnowFlake}) from X.com.
Before the first part there is the optional Namespace. It can be omitted, when you want to address an Item on the Namespace your Instance is part of. To address Items from another Namespace you add this Namespace to the front of the MizeId separated with a colon.

A MizeId without a Namespace can for example look like 0/inst/config. And one with a Namespace like 462acca5-81aa-4da2-bddc-da00d126ba9a:22/type or c2vi.dev:0.

A MizeId can also be represented as a URL (\cite{BibURL}), which would use the scheme named Mize, the authority part would be the Namespace, then the store part as the first element behind the slash and finally rest of the path look as following: mize://<namespace>/rest/of/path. Parameters will be ok to be used for URL like MizeIds and even non URL MizeIds, but for now there was not yet found a use for them in the Mize data engine.

Network of Instances

It is important for Instances to be able to talk to one another, since one of the main goals of the Mize platform is that any data is usable on any device just like it was local.

Topology

The Topology is implemented in a peer-to-peer way. An Instance can establish a connection to some other Instance using one of many transport layers like tcp, quick, websockets, ipc sockets, Bluetooth, serial, shared memory, canbus and usb through which messages about data are then exchanged.

There is however a quite hard problem that exists in such an architecture. How does an Instance know what other Instances need to know about some change in some data? What if two Instances what to update the same data at the same time? This problem gets even harder when an Instance is offline for some time. This can happen if the hardware the Instance is running on is turned off, has no power or no connection to the other Instance. Systems with a distributed peer-to-peer architecture can get quite complex, because of that. There is for example the concept of CRDTs (\cite{BibCRDT}) which stands for “conflict free replicated data types”. CRDTS were explored for use in this project, but not used because of their complexity. The possibility to add CRDT functionality later is planned for.

Because a Server-Client architecture is so much simpler, Mize uses an architecture similar to that on top of the peer-to-peer connections. There is always one Instance that has ownership of an Item, it’s master Instance. With this there is always one Instance knowing the newest state of an Item and all updates to an Item have to eventually go through this one Instance. With the ability to send special “maybe updates” to peers directly. Such updates may then be reverted or changed by the master Instance of the Item. This is useful to display updates to the user faster, in case the master Instance is further away network wise, or if it is not reachable at all. The master Instance of an Item can also change dynamically if this is needed.

In stark contrast to the established Server-Client model, not all Items of a Namespace have to have the same master Instance. This will be useful for larger deployments where the traffic for all the Items of the Namespace would be too much for one server to handle. In such a deployment the intended Topology is as follows. There are frontend Instances that are distributed across the globe, acting as sort of a CDN. Users connect to the frontend Instance, that is geographically closest to them, which is done using the DNS System. And then there are backend Instances that are the actual master Instances of the hosted Items. Each frontend Instance has to take care of only some of the users of the service and each backend Instance has to be master of only some of the Items, or even only one Item, if necessary. No single Instance has to deal with all the traffic. With this setup also services with a lot of demand should be possible.

Cbor

Cbor (= Concise Binary Object Representation) is a data encoding format similar to JSON, but binary instead of Text based, making it not human readable. A binary encoding scheme was chosen instead of the Text based JSON for two advantages. Firstly less overhead is added by the encoding itself and secondly it supports encoding arbitrary byte sequences. With JSON such arbitrary data sequences would be encoded into Base64 first, which further increases the size of encoded data by a third.

Protocol

In order to exchange data and changes to data over the connections between Instances, a protocol is needed. The protocol of the Mize Data engine works by exchanging different types of Messages between the Instances. A Message is made up of Cbor encoded bytes, that represent the Cbor type called “map”. A Cbor map maps keys to values and is comparable to an Object in JSON. The value mapped by the key “1” is the Command field or the type of the message. Different types of messages then have different keys that they need. The key “2” denotes the MizeId, that the message is about.

Types of Messages

At the current state, where simple data exchanges can take place, only eight types of messages are implemented, as seen in ./src/core/proto.rs

CMD_GET … sent to get the data of the Item specified by the id field.
CMD_GIVE … sent in reply to a CMD_GET containing the data of an Item.
CMD_UPDATE … sent by the Instance with ownership of an Item, to tell other Interfaces about changes in that Item.
CMD_UPDATE_REQUEST … sent to request a change in an Item where the sending Instance is not the owner of.
CMD_SUB … sent to the Instance with ownership, to let it know, that you want to be notified of updates to an Item.
CMD_GET_SUB … the same as sending a CMD_SUB and a CMD_GET as separate messages, because getting the data and subscribing to updates are often needed at the same time.
CMD_CREATE … ask an Instance to create an Item, which it will then be the owner of.
CMD_CREATE_REPLY … sent in reply to a CMD_CREATE containing the MizeId of the newly created Item.

The fact, that a message is just a Cbor map, makes the protocol extremely extensible. This is by design and inspired by the HTTP Headers, which are of a similar data structure that allowed HTTP to extend it’s capabilities by defining many new Headers.

Portability

Before commit a16c2f92217c79445650ce1ce2e8ef6391e849c3 the implementation plan was to have one server, which was written in Rust, that would take care of storing all Items and handling updates and so on. There would then be client implementations in any language Mize can be used with. This implementation plan can be seen in older commits of the Mize repository and was already syncing data between the server and a JavaScript client implementation.

Around the above mentioned commit it was realized, that a lot of the logic is implemented multiple times. Once in the server and once in the client. The client for example also has to store the Items even if only in ram. But it may be even desirable for it to store them on disk, to be kept across restarts. Also all the logic to update the data of Items has to be re implemented in every client language.

One library for all languages

The question then was, can a, what was since then called Instance, be written in a language, that can then be used as a library by any other language. So that this Instance code would become the server as well as the client. It turns out, that almost any programming language can somehow interact with a C library. C is a very old language that is still widely used today. Any language needs to call C functions from the C standard library if it wants anything from the Operating System anyway, so almost all languages have a way to call functions from a shared C Library. In Java you can for example use System.loadLibrary("libname") which defines special to Java exported C functions. In python there is the CDLL function and then you can call any normal C function of the shared library. This works the same in any language I can find.

JavaScript running under for example NodeJs can use shared C libraries in the same way as any other language, but what about JavaScript that runs in the Browser? There is a project called asm.js (\cite{BibASMJs}), which can compile C into a JavaScript, using only a subset of JavaScript’s operations, expressions and functions in order to run faster. Using this we can compile our Mize Instance C Library to JavaScript and use it in the Browser. In recent years a standard called WebAssembly was developed, which allows to run C and most other compiled languages in the Browser. WebAssembly is as the name suggests an Assembly like language, a list of instructions similar to what a CPU would execute. The instructions that WebAssembly defines are however optimized to be run by a virtual machine, making it also similar to the byte code Java is compiled into, which is then run by the JVM (= Java virtual machine). Such virtual WebAssembly machines are now part of every modern web browser.

It might seem that the logical conclusion would be that the Instance code has to be written in C, but this is not the case. It is not just easy to use C shared libraries in many languages, many languages also make it easy to make C like shared Libraries with them. Rust is one of them. In Rust a function can be declared with a special extern "C" and #[no_mangle] directives, which will make the compiler compile it in a C way. You can then also tell the Rust compiler to create a C shared library in addition to the rlib (= Rust library) file.

JVM languages

Languages that run in the JVM (= Java Virtual Machine, \cite{BibJVM}) like Java and Kotlin are not compiled to machine code, but to JVM byte code, which is then executed by the JVM. This enables a compiled program to run on any system which has a JVM. JVM programs can use C like libraries, which contain machine code, through the use of the JNI (= Java Native Interface \cite{BibJNI}). This however removes the portability of JVM applications. It is possible to include multiple versions of the C library and choose the right one for the system at runtime, but this is far from a clean solution. There is a project called “Asmble” (\cite{BibAsmble}), which claims it can transpile WebAssembly Instructions to JVM bytecode. This would allow to distribute the Mize library for JVM Languages as only JVM bytecode, improving portability by a lot.

Cleaner Codebase

The decision that the Rust code for the Instance should run anywhere to be uses as a library by any programming language, made it necessary to separate the logic of the Instance itself (which will be the same no matter the environment and language) from the logic that interacts with the platform beneath, on which the Instance code runs on. Where the platform beneath could be an operating system, the browser, the hardware directly in case of embedded systems without an operating system, the JVM and many more. This restriction, that logic, that is part of the Instance can not do any direct interaction with for example the operating system, made the code a lot cleaner. All the Code that is part of the Instance logic, is in the folder src/core and all logic, that has to do with a particular platform is bound in src/platform.

PPC Wiki

Explorer

The Data Engine (c2vi Diploma Thesis)