-
-
Notifications
You must be signed in to change notification settings - Fork 606
Filesystems
OSv supports a variety of filesystems, which are described in the paragraphs below. Layers-wise, it comes with the VFS layer (see fs/vfs/*
) and particular filesystem implementations found under fs/**/*
, except for ZFS found under bsd/sys/cddl/compat/opensolaris
and bsd/sys/cddl/contrib/opensolaris
. The ext2/3/4
filesystem is implemented by the modules libext
and lwext4
.
During boot time, OSv initially mounts the BootFS filesystem (see vfs_init()
and mount_rootfs()
) and then proceeds to mount and pivot to a 'real' filesystem like RoFS, ZFS or VirtioFS (for details see this code in loader.cc
) unless the --nomount
kernel option was specified. The root filesystem can be explicitly selected using the --rootfs
option; otherwise, the loader will try to discover it by trying RoFS, then Virtio-FS, Ext, and eventually ZFS.
Please note that OSv also supports the /etc/fstab
file, where one could add an extra filesystem mount point. In addition, one can mount an extra filesystem by prepending appropriate options to the command line like so:
./scripts/run.py --execute='--rootfs=rofs --mount-fs=zfs,/dev/vblk0.2,/data /hello
To build an image with a specific type of filesystem, you need to specify the fs
option (which defaults to zfs
) like so:
./scripts/build image=tests fs=rofs
In March 2024, the ext2/3/4 filesystem support was added in the form of a shared pluggable module libext on top of the lwext4 library. Since this initial commit, it has been improved to add thread safety, and fix various bugs identified when running unit tests on an ext image.
The libext
module acts as an adapter between the VFS layer and the lower-level implementation of the ext2/3/4
filesystem driver provided by the lwext4
module. The lwext4
is a fork of the original lwext4 library to fix a csum bug and customize building the shared library, among other things. The libext
module provides a set of the ext_*()
functions that fill the vfsops
and vnops
tables and delegate to the lwext4
layer.
The intention is to provide a lightweight read-write filesystem alternative to ZFS. It comes with the following benefits:
- familiar to Linux users with tools available on most distributions
- small binary of ~100K compared to ~800K large
libsolaris.so
; thelibext.so
is 32K andliblwext4
is 68K as of this writing - faster mount and boot time, similar to RoFS
- smaller memory footprint
- no dedicated kernel threads overhead (see https://github.com/cloudius-systems/osv/issues/247)
The main drawback is the I/O handling speed - ZFS is more sophisticated and thus faster.
The ideal use cases for ext would involve almost-stateless (not completely ephemeral) applications or microservices that need to read AND write some data (logs, modifiable configuration, etc) to a disk. Now, any more serious data applications like databases would greatly benefit from ZFS.
Please note the ext support is also pretty minimal - it does not support xattr (Extended Attributes), journal recovery and transactions, and sparse files. As far as caching is concerned, the lwext4 implements a simple RB-tree-based write-back cache for metadata blocks to efficiently read from and write to the i-node and block group tables. The file data, on the other hand, is read from and written to a block device directly without any page cache.
./scripts/build fs=ext image=native-example #Builds image with ext mounted at /
./scripts/build fs=rofs_with_ext image=native-example -j$(nproc) #Builds image with rofs at / and ext mounted at /data
One can also use the new ext-disk-utils.sh
script to mount an OSv ext image to inspect and manipulate its contents:
./scripts/ext-disk-utils.sh mount build/last/usr.img
ll build/release/usr.img.image/ #The contents of usr.img are available to read and write on the host
./scripts/ext-disk-utils.sh unmount build/last/usr.img /dev/nbd0
For more details on how to mount a secondary disk with the ext filesystem, please read this readme.
- https://blogs.oracle.com/linux/post/understanding-ext4-disk-layout-part-1
- https://blogs.oracle.com/linux/post/understanding-ext4-disk-layout-part-2
The ZFS code has been based on the FreeBSD implementation as of circa 2014 and has since been adapted to work in OSv. The ZFS is a sophisticated filesystem that traces its roots in Solaris, and you can find some resources about it on this Wiki page. The majority of the ZFS code can be found under the subtree bsd/sys/cddl/
. The ZFS filesystem driver has been fairly recently extracted from the kernel as a separate shared library libsolaris.so
which is dynamically loaded during boot time from a different filesystem (most likely BootFS or RoFS) before the ZFS filesystem can be mounted.
There are three ways ZFS can be mounted on OSv:
- The first and the original one assumes mounting ZFS at the root (
/
) from the 1st partition of the 1st disk -/dev/vblk0.1
. - The second one involves mounting ZFS from the 2nd partition of the 1st disk -
/dev/vblk0.2
at an arbitrary non-root mount point, for example/data
. - Similarly, the third way involves mounting ZFS from the 1st partition of the 2nd or higher disk - for example,
/dev/vblk1.1
at an arbitrary non-root mount point as well. Please note that both the second and third options assume that the root filesystem is non-ZFS - most likely RoFS or Virtio-FS.
The disadvantage of the 1st option is that the code and data live in the same read-write filesystem, whereas the other two options allow one to isolate code from mutable data. Ideally, one would put all code and configuration on the RoFS partition, colocated on the same disk (2) or not (3), and mutable data on a separate partition on the same disk (2) or different (3). It has been shown that booting and mounting ZFS from a separate disk is also slightly faster (by 30-40ms) compared to the original option 1.
Below are the examples of building and running OSv with ZFS
This is the original and default method. Please note that the libsolaris.so
is part of the loader.elf
and loaded from BootFS, which makes the kernel larger by ~800K.
./scripts/build image=native-example fs=zfs #The fs defaults to zfs
./scripts/run.py
OSv v0.56.0-152-gfd716a77
...
devfs: created device vblk0.1 for a partition at offset:4194304 with size:532676608
virtio-blk: Add blk device instances 0 as vblk0, devsize=536870912
...
zfs: driver has been initialized!
VFS: mounting zfs at /zfs
zfs: mounting osv/zfs from device /dev/vblk0.1
...
This is a fairly new method that allows mounting ZFS at a non-root mount point like /data
, for example, and mixed with another filesystem on the same disk. Please note that libsolaris.so
is placed on a root filesystem (typically RoFS) under /usr/lib/fs/
and loaded from it automatically. The build
script will implicitly add the relevant mount point line to the /etc/fstab
.
./scripts/build image=native-example,zfs fs=rofs_with_zfs #Has to add zfs module that adds /usr/lib/fs/libsolaris.so to RoFS
./scripts/run.py
OSv v0.56.0-152-gfd716a77
...
devfs: created device vblk0.1 for a partition at offset:4194304 with size:191488
devfs: created device vblk0.2 for a partition at offset:4385792 with size:532676608
virtio-blk: Add blk device instances 0 as vblk0, devsize=537062400
...
VFS: mounting rofs at /rofs
zfs: driver has been initialized!
VFS: initialized filesystem library: /usr/lib/fs/libsolaris.so
VFS: mounting devfs at /dev
VFS: mounting procfs at /proc
VFS: mounting sysfs at /sys
VFS: mounting ramfs at /tmp
VFS: mounting zfs at /data
zfs: mounting osv/zfs from device /dev/vblk0.2
...
This fairly new method is similar to the above, in that it also allows ZFS to be mounted at a non-root mount point like /data
, but this time from a different disk. Please note that libsolaris.so
is placed on a root filesystem (typically RoFS) under /usr/lib/fs/
and loaded from it automatically as well. Similar to the above, the build
script will implicitly add the relevant mount point line to the /etc/fstab
.
./scripts/build image=native-example,zfs fs=rofs --create-zfs-disk #Creates empty disk at build/last/zfs_disk.img with ZFS filesystem
./scripts/run.py --second-disk-image build/last/zfs_disk.img
OSv v0.56.0-152-gfd716a77
...
devfs: created device vblk0.1 for a partition at offset:4194304 with size:1010688
virtio-blk: Add blk device instances 0 as vblk0, devsize=5204992
devfs: created device vblk1.1 for a partition at offset:512 with size:536870400
virtio-blk: Add blk device instances 1 as vblk1, devsize=536870912
...
VFS: mounting rofs at /rofs
zfs: driver has been initialized!
VFS: initialized filesystem library: /usr/lib/fs/libsolaris.so
VFS: mounting devfs at /dev
VFS: mounting procfs at /proc
VFS: mounting sysfs at /sys
VFS: mounting ramfs at /tmp
VFS: mounting zfs at /data
zfs: mounting osv/zfs from device /dev/vblk1.1
...
However, with a different disk setup, you can manually make OSv mount a different disk and partition by explicitly using the --mountfs
boot option like so:
#Build ZFS disk somehow differently and make sure the `build` does not append ZFS mount point (inspect build/last/fstab)
./scripts/run.py --execute='--rootfs=rofs --mount-fs=zfs,/dev/vblk1.1,/data /hello' --second-disk-image <disk_path>
Please note that in the examples above, the ZFS pool and filesystem are created using the zfs_loader.elf
version of OSv that executes zpool.so
, zfs.so
, and cpiod.so
among others. This is actually quite fast and efficient, but recently we have enhanced the build mechanism to create ZFS disks using the zpool
and zfs
on a Linux host, provided you have OpenZFS installed (see more info here).
To that end, there is a fairly new script zfs-image-on-host.sh
that can be used to either mount an existing OSv ZFS disk or create a new one. The latter actually can be orchestrated by the build
script if one passes the option --use-openzfs
like so:
./scripts/build image=native-example fs=zfs -j$(nproc) --use-openzfs
Some help from the zfs-image-on-host.sh
:
Manipulate ZFS images on the host using OpenZFS - mount, unmount, and build.
Usage: zfs-image-on-host.sh mount <image_path> <partition> <pool_name> <filesystem> |
build <image_path> <partition> <pool_name> <filesystem> <populate_image> |
unmount <pool_name>
Where:
image_path path to a qcow2 or raw ZFS image; defaults to build/last/usr.img
partition partition of disk above; defaults to 1
pool_name name of ZFS pool; defaults to osv
filesystem name of ZFS filesystem; defaults to zfs
populate_image boolean value to indicate if the image should be populated with content
from build/last/usr.manifest; defaults to true, but only used with the 'build' command
Examples:
zfs-image-on-host.sh mount # Mount OSv image from build/last/usr.img under /zfs
zfs-image-on-host.sh mount build/last/zfs_disk.img 1 # Mount OSv image from build/last/zfs_disk.img 2nd partition under /zfs
zfs-image-on-host.sh unmount # Unmount OSv image from /zfs
Using the same script, you can always mount any ZFS disk on a host, inspect any files and modify them if you want, and unmount it. OSv will now see all changes if run with the same disk:
./scripts/zfs-image-on-host.sh mount build/last/zfs_disk.img
Connected device /dev/nbd0 to the image build/last/zfs_disk.img
Imported pool osv
Mounted osv/zfs at /zfs
[wkozaczuk@fedora-mbpro osv]$ find /zfs/
/zfs/
/zfs/seaweedfs
/zfs/seaweedfs/logs
/zfs/seaweedfs/logs/weed.osv.osv.log.WARNING.20220726-181118.2
/zfs/seaweedfs/logs/weed.osv.osv.log.INFO.20220726-180155.2
/zfs/seaweedfs/logs/weed.WARNING
/zfs/seaweedfs/logs/weed.INFO
/zfs/seaweedfs/master
/zfs/seaweedfs/master/snapshot
find: ‘/zfs/seaweedfs/master/snapshot’: Permission denied
/zfs/seaweedfs/master/log
/zfs/seaweedfs/master/conf
./scripts/zfs-image-on-host.sh unmount