Tux3 Versioning Filesystem

Overview

Background

Since everybody seems to be having fun building new filesystems these days, I thought I should join the party. Tux3 is the spiritual and moral successor of Tux2, the most famous filesystem that was never released.[1] In the ten years since Tux2 was prototyped on Linux 2.2.13 we have all learned a thing or two about filesystem design. Tux3 is a write anywhere, atomic commit, btree based versioning filesystem. As part of this work, the venerable HTree design used in Ext3 and Lustre is getting a rev to better support NFS and possibly be more efficient.

The main purpose of Tux3 is to embody my new ideas on storage data versioning. The secondary goal is to provide a more efficient snapshotting and replication method for the Zumastor NAS project, and a tertiary goal is to be better than ZFS.

Tux3 is big endian, as the great penguin intended.

General Description

In broad outline, Tux3 is a conventional node/file/directory design with wrinkles. A Tux3 inode table is a btree with versioned attributes at the leaves. A file is an inode attribute that is a btree with versioned extents at the leaves. Directory indexes are mapped into directory file blocks as with HTree. Free space is mapped by a btree with extents at the leaves.

The interesting part is the way inode attributes and file extents are versioned. Unlike the currently fashionable recursive copy on write designs with one tree root per version[2], Tux3 stores all its versioning information in the leaves of btrees, using the versioned pointer algorithms described in detail here:

http://lwn.net/Articles/288896/

This method promises a significant shrinkage of metadata for heavily versioned filesystems as compared to ZFS and Btrfs. The distinction between Tux3 style versioning and WAFL style versioning used by ZFS and a similar method used by Btrfs is analogous to the distinction between delta and weave encoding for version control systems. In fact Tux3's pointer versioning algorithms were derived from a binary weave technique I worked on earlier for version control back in the days when we were all racing for the glory of replcaing the proprietary Bitkeeper system for kernel version control.[3]

Filesystem Characteristics and Limits

Transactional Consistency

Atomic update in Tux3 is via a new method called forward logging that avoids the double writes of conventional journalling and the recursive copy behavior of copy on write.

Forward Logging

Atomic update is via a new technique called forward logging. Each update transaction is a series of data blocks followed by a commit block. Each commit block either stores the location where the next commit block will be if it is known, otherwise it is the end of a chain of commits. The start of each chain of commits is referenced from the superblock.

Commit data may be logical or physical. Logical updates are first applied to the target structure (usually a inode table block or file index block) then the changed blocks are logged physically before being either applied to the final destination or implicitly merged with the destination via logical logging. This multi level logging sounds like a lot of extra writing, but it is not because the logical updates are more compact than physical block updates and many of them can be logged with a periodic rollup pass to perform the physical or higher level logical updates.

A typical write transaction therefore looks like a single data extent followed by one commit block, which can be written anywhere on the disk. This is about as efficient as it is possible to do an atomic update.

Benefits of Forward Logging

Another very nice benefit of Tux3-style forward logging is that transaction size is limited only by available free space on the volume, because log commit blocks can be anywhere.

I dreamed up a log commit block variant where the associated transaction data blocks can be physically discontiguous, to make this assertion come true, shortly after reading Stephen Tweedie's horror stories about what you have to do if you must work around not being able to put an entire, consistent transaction onto stable media in one go:

http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html

Most of the nasties he mentions just vanish if:

He did remind me (via time travel from year 2000) of some details I should write into the design explicitly, for example, logging orphan inodes that are unlinked while open, so they can be deleted on replay after a crash. Another nice application of forward logging, which avoids the seek-happy linked list through the inode table that Ext3 does.

Metadata Backpointers

While working on the tux3 announcement I noticed yet another advantage of versioned pointers vs the currently fashionable copy-on-write methods: metadata backpointers. Versioning is only done in two places:

In both cases it is possible to store a single backpointer from the leaf block to the parent btree index node. This is impractical with a copy-on-write scheme where each version has its own tree root. Then, multiple backpointers would be required. Btrfs instead stores a "hint" consisting of the btree level and logical offset of the leaf. Tux3 can simply use direct backpointers with a corresponding increase in fsck robustness.

I should also note that versioned pointers require dramatically less metadata than copy-on-write with multiple roots. If a data block is altered in a file then the versioned pointer method only requires a single new (64 bit) pointer to be added to a terminal index node. Copy-on-write requires the entire parent block to be copied, and its parent and so on. Versioned pointer metadata representation is thus at least 512 times more space efficient than copy on write.

All Shutdowns are "Clean"

I'm just writing down points as they occur to me now. One small detail of Tux3 that should have quite a bit impact on robustness: Tux3 will always "initiate recovery" on mount, which is to say that it will read its forward logs and apply them to the memory image of btrees etc, to return to just the state it was in when it stopped. It will not roll the logs up, for example, and write out "perfect" copies of the btrees.

This will not only save time on shutdown, but it will give constant exercise to the recovery logic, which all too often is the last part of a filesystem to be perfected, if it ever is perfected.

Addendum to the last note: that is a 512 times advantage for changing one block of a file. If many or all blocks of a file are changed (a common case) then recursive copy-on-write gets closer to the space efficiency of versioned pointers, but it never quite gets there.

In the case of indexed directories it is very common for a single block to be changed, and in that case the metadata efficiency advantage of versioned pointers will be large.

End of content for now...

Fragmention Control

Versioning filesystems on rotating media are prone to fragmentation. With the write-anywhere strategy we have a great deal of latitude in choosing where to write, but translating this into minimzing seeks on read is far from easy. At least for metadata, we can fall back to update in place using the forward logging method acting as a "write twice" journal. Because the metadata is very small (at least when the filesystem is not heavily versioned) sufficient space of the single commit block required to logically log a metadata update will normally be available near the original location and the fallback will not often be needed.

When there is no choice but to write new data far away from the original location, a method called "write bouncing" is to be used, where the allocation target is redirected according to a repeatable generating function to a new target zone some distance away from the original one. If that zone is too crowded, then the next candidate zone further away will be checked and so on, until the entire volume has been checked for available space. (Remotely analogous to a quadratic hash.) What this means is, even though seeking to changed blocks of a large file is not entirely avoided, at least it can be kept down to a few seeks in most cases. File readahead will help a lot here, as a number of outlying extents can be picked up in one pass. Physical readahead would help even more, to deal with cross-file fragmentation in a directory.

Inode attributes

An inode is a variable sized item indexed by its inode number in the inode btree. It consists of a list of attributes blocks, where standard attributes are grouped together according the their frequency of updating, and extended attributes. Each standard attribute block carries a version label at which the attribute was last changed. Extended attributes have the same structure as files, and in fact file data is just an extended attribute. Extended attributes are not versioned in the inode, but at the index leaf blocks. The atime attribute is handled separately from the inode table to avoid polluting the inode table with versions generated by mere filesystem reads.

Unversioned attributes:

Standard attribute block:

Write attribute block:

Data attribute:

Immediate data attributes:

Unversioned reference to versioned attributes:

Versioned link count:

None of the above:

Note: an inode is never reused unless it is free in all versions.

Atom table

Directory Index

Thus there are two "leaf" layers in a PHTree: 1) the terminal nodes of the index btree and 2) the directory data blocks containing dirents. This requires one extra lookup per operation versus HTree, which is regretable, but it solves the physical stability problem that has caused so much grief in the past with NFS support. With PHTree, dirent blocks are never split and dirents are never moved, allowing the logical file offset to serve as a telldir/seekdir position just as it does for primitive filesystems like Ext2 and UFS, on which the Posix semantics of directory traversal are sadly based.

There are other advantages to physical stability of dirents besides supporting brain damaged NFS directory traversal semantics: because dirents and inodes are initially allocated in the same order, a traveral of the leaf blocks in physical order to perform deletes etc, will tend to access the inodes in ascending order, which reduces cache thrashing of the inode table, a problem that has been observed in practice with schemes like HTree that traverse directories in hash order.

Because leaf blocks in PHTree are typically full instead of 75% full as in HTree, the space usage ends up about the same. PHTree does btree node merging on delete (which HTree does not) so fragmentation of the hash key space is not a problem and a slightly less uniform but far more efficient hash function can be used, which should deliver noticeably better performance.

HTree always allocates new directories enties into btree leaf nodes, splitting them if necessary, so it does not have to worry about free space management at all. PHTree does however, since gaps in the dirent blocks left by entry deletions have to be recycled. A linear scan for free space would be far too inefficient, so instead, PHTree uses a lazy method of recording the maximum sized dirent available in each directory block. The actual largest free dirent may be smaller, and this will be detected when a search fails, causing the lazy max to be updated. We can safely skip searching for free space in any block for which the lazy max is less than the needed size. One byte is sufficient for the lazy max, so one 4K block is sufficient to keep track of 2^12 * 2^12 bytes worth of directory blocks, a 16 meg directory with about half a million entries. For larger directories this structure becomes a radix tree, with lazy max recorded also at each index pointer for quick free space searching without having to examine every lazy map.

Like HTree, a PHTree is a btree embedded in the logical blocks of a file. Just like a file, a directory is read into a page cache mapping as its blocks are accessed. Except for cache misses, the highly efficient page cache radix tree mechanism is used to resolve btree pointers, avoiding many file metadata accesses. A second consequence of storing directory indexes in files is that the same versioning mechanism that versions a file also versions a directory, so nothing special needs to be done to support namespace versioning in Tux3.

Scaling to large number of versions

The main worry is what happens as number of versions becomes very large. Then a lot of metadata for unrelated versions may have to be loaded, searched and edited. A relatively balanced symmetric version tree can be broken up into a number of subtrees. Sibling subtrees cannot possibly affect each other. O(log(subtrees)) subtrees need to be loaded and operated on for any given version.

What about scaling of completely linear version chain? Then data is heavily inherited and thus compressed. What if data is heavily versioned and therefore not inherited much? Then we should store elements in stable sort order and binsearch, which works well in this case because not many parents have to be searched.

Filesystem expansion and shrinking

(What could possibly go wrong?)

Multiple Filesystems sharing the same Volume

This is just a matter of providing multiple inode btrees sharing the same free tree. Not much of a challenge, and somebody may have a need for it. Is there really any advantage if the volume manager and filesystem support on-demand expansion and shrinking?

Quotas

Linux quotas are derived from the "melbourne quotas" scheme from BSD. Userspace support and man pages are obtained by installing the "quota" package on Debian.

This includes documentation of the orignal BSD system, which is slightly different from the Linux incarnation, but the latter doesn't seem to be too well documented:

/usr/share/doc/quota/quotas.preformated.gz

The quota syscall internal is largely defined by include/linux/quota.h in the kernel source. The syscall is sys_quotactl:

http://lxr.linux.no/linux+v2.6.26/fs/quota.c#L365

The VFS handles the high level quota logic by methods that may be overridden by the filesystem, defined by struct quotactl_ops in quota.h. Ext3 only overrides the quota_on method, and then only to check that a valid quota file has been specified and issue dire warnings if the quota file is not in the root directory. Otherwise, Ext3 just specifies that the default VFS library quota methods are used, which call back to low level quota methods in the filesystem, specified by struct dquot_operations. These are:

http://lxr.linux.no/linux+v2.6.26/include/linux/quota.h#L286 initialize, drop, alloc_space, alloc_inode, free_space, free_inode, transfer, write_dquot, acquire_dquot, release_dquot, mark_dirty, write_info

Sample implementations can be found here:

http://lxr.linux.no/linux+v2.6.26/fs/ext3/super.c#L2605

Confusingly, one finds "old" and "new" quota formats implemented there. I have not dug deeply into the history or implications of this distinction. The artist here appears to be Jan Kara of Suse.

The actual accounting is implemented in ext3/balloc.c. See pdquot_freed_blocks.

It looks like the vast majority of quota implementation is already done for us, and we mainly need to worry about syncing the quota file efficiently. We can take advantage of logical forward logging here and record the allocation changes in the commit block, provided the quota is not too near a user or group hard limit. Then roll up the updates into the quota file and sync it periodically.

The quota file does not have to be versioned. Maybe. Hmm. Well, maybe users are going to get annoyed if they can't get back to quota because a snapshot is still hanging onto blocks they tried to free. OK, the quota file has to be versioned. Well that makes it like any other file.

There is no notion of directory quotas anywhere to be seen, so we do not have to feel compelled to implement that. XFS does implement directory quotas apparently, so there is a way. Just has to be done without (much) help from the VFS and no syscall interface.

The idea of inode quotas seems bogus to me because we have a 48 bit inode space. Who is going to use up 280 trillion inodes? I say we should just ignore inode quota processing and only account blocks. (With extents, "blocks" means "granularity of an extent".)

This should provide a reasonable orientation for somebody who wants to do the quota hookup once prototype filesystem code is available. Note: this would be a nice way to learn things about the VFS if you have never delved into this before, and not too challenging because most of the sweat has been expended at the VFS level (however rambling the implementation may be) and there are several existing implementations to use as models.

We do not have to wait for sample code in order to plan out the details of how the quota file is to be versioned, and how changes to the quota file sums should be logged efficiently.

New User Interfaces for Version Control

The standard method for accessing a particular version of a volume is via a version option on the mount command. But it is also possible to access file versions via several other methods, including a new variant of the open syscall with a version tag parameter.

"Version transport" allows the currently mounted version to be changed to some other. All open files will continue to access the version under which they were opened, but newly opened files will have the new version. This is the "Git Cache" feature.

Tux3 introduces the idea of a version link, much the same as a symlink, but carries a version tag so that the opened file is some other than the mounted version. Like symlinks, there is no requirement that the referenced object be valid. Version links to not introduce any inter-version consistency requirement, and are therefore robust. Unlink symlinks, version links are not followed by default. This makes it easy to implement a Netapp-like feature of having a hidden .snapshot subdirectory in each directory by which periodic snapshots can be accessed by a user.

Summary of data structures

Superblock

Metablocks

Inode table

Atime table

Free tree

Atom table

Forward log commit block

Directory Index

Implementation

Implementation work has begun. Much of the implementation consists of cutting and pasted bits of code I have developed over the years, for example, bits of HTree and ddsnap. The immediate goal is to produce a working prototype that cuts a lot of corners, for example block pointers instead of extents, allocation bitmap instead of free extent tree, linear search instead of indexed, and no atomic commit at all. Just enough to prove out the versioning algorithms and develop new user interfaces for version control.

The Tux3 project home is here:

http://tux3.org/

A mailing list is here:

http://tux3.org/cgi-bin/mailman/listinfo/tux3

All interested parties welcome. Hackers especially welcome.

Prototype code proving the versioning algorithms is here:

http://tux3.org/source/version.c

A Mercurial tree is coming soon.

[1] For the whole story: google "evil+patents+sighted"

[2] Copy on write versioning, which I had a hand in inventing.

[3] Linus won. A major design element of Git (the directory manifest) was due to me, and of course Graydon Hoare (google Quicksort) deserves more credit than anyone.

+++

From phillips at phunq.net  Thu Jul 24 13:26:27 2008
From: phillips at phunq.net (Daniel Phillips)
Date: Thu, 24 Jul 2008 13:26:27 -0700
Subject: [Tux3] Comparison to Hammer fs design
Message-ID: <200807241326.28447.phillips@phunq.net>

I read Matt Dillon's Hammer filesystem design with interest:

   http://apollo.backplane.com/DFlyMisc/hammer01.pdf

Link kindly provided by pgquiles.  The big advantage Hammer has over 
Tux3 is, it is up and running and released in the Dragonfly distro.  
The biggest disadvantage is, it runs on BSD, not Linux, and it so 
heavily implements functionality that is provided by the VFS and block 
layer in Linux that a port would be far from trivial.  It will likely 
happen eventually, but probably in about the same timeframe that we can 
get Tux3 up and stable.

Tux3 is a simpler design than Hammer as far as I can see, and stays 
closer to our traditional ideas of how a filesystem behaves, for 
example there is no requirement for a background process to be 
continuously running through the filesystem reblocking it to recover 
space.  Though Tux3 does have the notion of followup metadata passes 
to "promote" logical forward log changes to physical changes to btree 
nodes etc.  However this does not have to be a daemon, it can just be 
something that happens every so many write transactions in the process 
context that did the write.  Avoiding daemons in filesystems is good - 
each one needs special attention to avoid deadlock, and they mess up 
the the ps list, a minor but esthetic consideration.

Matt hit on a similar idea to versioned pointers, that is, his birth and 
death version numbers for disk records.  So we both saw independently 
that recursive copy on write as in WAFL, ZFS and Btrfs is suboptimal.

I found that only the birth version is actually required, simply because 
file data elements never actually die, they is only ever overwritten or 
truncated away.  Therefore, a subsequent birth always implies the death 
of the previous data element, and only the birth version has to be 
stored, which I simply call the "version".  Data element death by 
truncate is handled by the birth of a new (versioned) size attribute.

Eventually Matt should realize that too, and rev Hammer to improve its 
storage efficiency.  Oddly, Hammer only seems to support a linear chain 
of versions, whereas I have shown that with no increase in the size of 
metadata (except for the once-per-volume version tree) you can store 
writable versions with arbitrary parentage.  I think Matt should take 
note of that too and incorporate it in Hammer.

Some aspects of the Hammer seem quite inefficient, so I wonder what he 
means when he says it performs really well.  In comparison to what?  
Well I don't have a lot more to say about that until Tux3 gets to the 
benchmark stage, and they we will be benchmarking mainly against Ext3, 
XFS and Btrfs.

Matt seems somewhat cavalier about running out of space on small 
volumes, whereas I think a filesystem should scale all the way from a 
handful of meg to at least terabytes and preferably petabytes.  The 
heavy use of a vacuum like reblocking process seems undesirable to me.  
I like my disk lights to go out as soon as the data is safely on the 
platter, not continue flashing for minutes or hours after every period 
of activity.  Admittedly, I did contemplate something similar for 
ddsnap, to improve write efficiency.  I now think that fragmentation 
can be held down to a dull roar without relying on a defragger, and 
that defragging should only be triggered at planned times by an 
administrator.  We will see what happens in practice.

Tux3 has a much better btree fanout than Hammer, 256 vs 64 for Hammer 
using the same size 4K btree index blocks.  Fanout is an important 
determinant of the K in O(log(N)) btree performance, which turns out to 
be very important when comparing different filesystems, all of which 
theoretically are Log(N), but some of which have an inconveniently 
large K (ZFS comes to mind).  I always try to make the fanout as high 
as possible in my btree, which for example is a major reason that the 
HTree index for Ext3 performs so well.

Actually, I think I can boost the Tux3 inode table btree fanout up to 
512 by having a slightly different format for the next-to-terminal 
inode table index blocks with 16 bits inum, 48 bits leaf block, because 
at the near-terminal index nodes the inum space is already divided down 
to a small range.

More comments about Hammer later as I learn more about it.

Regards,

Daniel


Regards,

Daniel


From phillips at phunq.net  Fri Jul 25 02:39:10 2008
From: phillips at phunq.net (Daniel Phillips)
Date: Fri, 25 Jul 2008 02:39:10 -0700
Subject: [Tux3] Five kinds of btrees
Message-ID: <200807250239.10215.phillips@phunq.net>

Me thinking out loud, which I tend to do better when I know people will 
be reading it...

Tux3 has five different kinds of btrees at the moment, all to be 
accessed by the same generic btree code.  I thought I would list them 
now and consider the differences, as a prelude to writing the various 
methods needed to specialize to the different variants.

I tend to be obsessive about packing things into even binary sized 
chunks if I can, thinking that processors like to operate on data that 
way.  It probably does not matter a whole lot, but it satisfies me 
esthetically and if it gains a tiny amount of speed, that is nice.
Well, maybe it does actually matter.  A whole lot of cpu time is spent 
binary searching btree index blocks, and if the index entries line up 
on nice boundaries then the L1 cache hit rate goes up.

A more important consideration is for the btree index node entries to be
compact and achieve a high fanout.  Except for directory btrees (which 
are mapped into the logical blocks of files and can therefore get away 
with 24 bit logical block pointers) Tux3 index nodes have 16 bytes per 
entry, generally consisting of a 48 bit key, 48 bit address of a lower 
level index node or leaf, and free space hints to round things out to 
even numbers.

In two cases (file index and atime table) I could not think of anything 
useful to round out the twelve bytes of essential payload, so I call 
that unused and probably will eventually relax my obsession with binary 
sized data and make those entries 12 bytes long to improve the fanout 
by 25%.

Kinds of btrees, in pidgin C:

inode table
  node: { inum:48 node:48 freehint:32 }[] // 16 bytes/entry
  node: { inum:10 node:48 freehint:6 }[] // 8 bytes/entry variant
  leaf: attributes

file index
  node: { offset:48 node:48 unused:32 }[] // 16 bytes/entry... maybe 12?
  leaf: { block:48 count:6 version:10 }[] // 8 bytes/entry

free extents
  node: { block:48 node:48 freehint:32 }[] // 16 bytes/entry
  leaf: { block:48 count:16 }[] // 8 bytes/entry

atime table
  node: { inum:48 node:48 unused:32 }[] // 16 bytes/entry... maybe 12?
  leaf: { version:10 ablock:48 unused:6 }[] // 8 bytes/entry
  ablock: { date:32 }[] // 4 bytes

directory
  node: { hash:31 cont:1 node:32 }[] // 8 bytes
  leaf: { hash:31 dblock:32 }[] // 8 bytes
  dblock: ext2 dirents

For those who have not lived and breathed btrees (most of us) I should 
state the definition of a btree node: N block pointers separated by N-1 
keys.  There is one less key than block pointers because there is 
always a higher level btree block that supplies the key that separates 
two btree nodes.  Anyway, a vector of [pointer, key] pairs makes a 
btree index, with one of the keys being unused.  I sometimes fiendishly 
coopt that unused field for something else, even though this saves 
only .2% of the space in a block.  I will try to resist that temptation 
this time, and maybe even always fill in that unused field with a copy 
of the higher level key and use it for an internal cross check.

Free hints give an idea of the density and maximum size of the free 
space on the volume in the case of the free tree, or density in case of 
the inode table.  This allows me to avoid having a secondary structure 
like Ext3's inode bitmap in order to locate free inodes.  Instead we 
look in a part of the inode table with a suitable density, and search 
through the leaf block for the free inode we know is there.

Free space in PHTree directories is handled by an entirely different 
method described in the original Tux3 post (which I should eventually 
repost here with improvements).  The free hint idea cannot be used here 
because there are potentially multiple terminal index blocks referring 
to any given dirent block, and no way to find the referers other than 
the one we used to reach the block.  Fortunately, the non-sparse 
logical space of the directory file lends itself to mapping by a radix 
tree, which as I mentioned earlier can map freespace for about half a 
million dirents with a single 4K block, lazily updated.

There are two different kinds of index nodes for the inode btree: the 16 
byte kind and the 8 byte kind, the latter being useful deeper in a huge 
inode table where the btree indexes have cut the inode space up pretty 
small so that the low ten bits or so of the inum is a unique index.  
The free hint can also be smaller.  The thing is, there should end up 
being a lot more of the eight byte kind of inode index node because 
they are on the bushy end of the tree, so this should be a significant 
time and space saver and well worth the little bit of extra code to 
implement it.

The file index btree is almost as important to optimize as the inode 
table btree, because while there may not be many huge files, they are 
often things like databases and virtual machine root filesystems that 
require good performance.  So I try to keep the terminal index extents 
to 8 bytes including the version, which means that any gaps between 
extents for a particular version have to be filled in explicitly with 
extents of empty space.  This should be the rare case.  I don't know 
how well this is going to work out.  It would obviously be easier to 
put a logical offset in each versioned extent.  I will probably try to 
code it the 8 byte way and see if the problem yields easily.

The atime table is a unwanted stepchild of ancient unix.  It is hard to 
feel much love for it, but there it is, it has its own unique structure 
because the atimes have to be versioned.  There is not much point in 
packing atimes into the leaf blocks like attributes are packed into the 
inode table.  The leaves of the atime index tree are actually 
nonterminals because they reference blocks of atime dates.  Hmm, I 
suppose those could be versioned extents of atimes, which would make 
the atime index really small but then raises the tough question of how 
big to make the extent when somebody accesses an inode.  For now, no 
extents there.

I might as well specify the structures for the non-extent prototype 
while I am in here.

inode table
  node: { inum:48 node:48 }[]
  leaf: attributes

file index
  node: { offset:48 node:48 }[]
  leaf: { block:48 version:10 }[]

free extents
  use bitmaps instead

atime table
  forget it

directory
  straight up ext2 dirent blocks, linearly searched

Considerably simpler, no?  And yet it should prove the concept of this 
filesystem versioning method nicely.

The prototype will lift the btree leaf code for file indexes directly 
from ddsnap, including the leaf format that I have come to hate, but 
does the job.  I will probably improve the leaf format shortly after 
and backport that into ddsnap.

It seems kind of odd, but Ext2 dirent blocks may well end up as the 
final directory leaf format for Tux3 because the prime requirement is 
that dirents not move around (physical stability) which lets out all 
the fancy in-leaf indexing schemes I know of.  It would be kind of nice 
if that little piece of Ext2 survived, because Tux3 pretty much tosses 
out all the rest of it.

The inode table leaf format is new territory, somewhat described in the 
Tux3 project announcement.  The details are worth a post in their own 
right.

The generic btree will probably get coded over the weekend, including 
methods for the inode table and file index, all in userspace for the 
time being.

Regards,

Daniel

From phillips at phunq.net  Fri Jul 25 15:13:58 2008
From: phillips at phunq.net (Daniel Phillips)
Date: Fri, 25 Jul 2008 15:13:58 -0700
Subject: [Tux3] Comparison to Hammer fs design
Message-ID: <200807251513.59912.phillips@phunq.net>

On Friday 25 July 2008 11:53, Matthew Dillon wrote:
> 
> :Hi;
> :
> :The announcement of yet another filesystem:
> :
> :http://lkml.org/lkml/2008/7/23/257
> :
> :led to some comments about hammer fs:
> :
> :http://tux3.org/pipermail/tux3/2008-July/000006.html
> :
> :enjoy,
> :
> :     Pedro.
> 
>     Those are interesting comments.   I think I found Daniel's email address
>     so I am adding him to the To:   Dan, feel free to post this on your Tux
>     groups if you want.

How about a cross-post?
 
>     I did consider multiple-parentage...  that is the ability to have a
>     writable snapshot that 'forks' the filesystem history.  It would be
>     an ultra cool feature to have but I couldn't fit it into the B-Tree
>     model I was using.  Explicit snapshotting would be needed to make it
>     work, and the snapshot id would have to be made part of the B-Tree key,
>     which is fine.  HAMMER is based around implicit snapshots (being able
>     to do an as-of historical lookup without having explicitly snapshotted
>     bits of the history).

Yes, that is the main difference indeed, essentially "log everything" vs
"commit" style versioning.  The main similarity is the lifespan oriented
version control at the btree leaves.

>     I would caution against using B-Tree iterations 
>     in historical lookups, B-Trees are only fast if you can pretty much
>     zero in on the exact element you want as part of the primary search
>     mechanic.  Once you have to iterate or chain performance goes out the
>     window.  I know this because there are still two places where HAMMER
>     sometimes has to iterate due to the inexact nature of an as-of lookup.
>     Multiple-parentage almost certainly require two inequality searches,
>     or one inequality search and an iteration.  A single timeline only
>     requires one inequality search.

Once I get down to the leaf level I binary search on logical address in
the case of a file index btree or on version in the case of an inode
table block, so this cost is still Log(N) with a small k.  For a
heavily versioned inode or file region this might sometimes result in
an overflow block or two that has to be linearly searched which is not
a big problem so long as it is a rare case, which it really ought to be
for the kinds of filesystem loads I have seen.  A common example of a
worst case is /var/log/messages, where the mtime and size are going to
change in pretty much every version, so if you have hourly snapshots
and hold them for three months it adds up to about 2200 16 byte inode
table attributes to record, about 8 inode table leaf blocks.  I really
do not see that as a problem.  If it goes up to 100 leaf blocks to
search then that could be a problem.

Usually, I will only be interested in the first and last of those
blocks, the first holding the rarely changing attributes including the
root of the file index btree and the last holding the most recent
incarnation of the mtime/size attribute.

The penultimate inode table index block tells me how many blocks a
given inode lives in because several blocks will have the same inum
key.  So the lookup algorithm for a massively versioned file becomes:
1) read the first inode table block holding that inum; 2) read the last
block with the same inum.  The latter operation only needs to consult
the immediate parent index block, which is locked in the page cache at
that point.

It will take a little care and attention to the inode format,
especially the inode header, to make sure that I can reliably do that
first/last probe optimization for the "head" version, but it does seem
worth the effort.  For starters I will just do a mindless linear walk
of an "overflow" inode and get fancy if that turns out to be a problem.

>     I couldn't get away from having a delete_tid (the 'death version
>     numbers').  I really tried :-)  There are many cases where one is only
>     deleting, rather then overwriting.

By far the most common case I would think.  But check out the versioned
pointer algorithms.  Surprisingly that just works, which is not exactly
obvious:

   Versioned pointers: a new method of representing snapshots
   http://lwn.net/Articles/288896/

I was originally planning to keep all versions of a truncate/rewrite
file in the same file index, but recently I realized that that is dumb
because there will never be any file data shared by a successor version
in that case.  So the thing to do is just create an entirely new
versioned file data attribute for each rewrite, bulking up the inode
table entry a little but greatly constraining the search for versions
to delete and reducing cache pressure by not loading unrelated version
data when traversing a file.

>     Both numbers are then needed to 
>     be able to properly present a historical view of the filesystem.
>     For example, when one is deleting a directory entry a snapshot after
>     that deletion must show the directory entry gone.

...which happens in Tux3 courtesy of the fact that the entire block
containing the dirent will have been versioned, with the new version
showing the entry gone.  Here is one of two places where I violate my
vow to avoid copying an entire block when only one data item in it
changes (the other being the atime table).  I rely on two things to
make this nice: 1) Most dirent changes will be logically logged and
only rolled up into the versioned file blocks when there are enough to
be reasonably sure that each changed directory block will be hit
numerous times in each rollup episode.  (Storing the directory blocks
in dirent-create order as in PHTree makes this very likely for mass
deletes.)  2) When we care about this is usually during a mass delete,
where most or all dirents in each directory file block are removed
before moving on to the next block.

>     Both numbers are 
>     also needed to be able to properly prune out unwanted historical data
>     from the filesystem.  HAMMER's pruning algorithm (cleaning out old  
>     historical data which is no longer desired) creates holes in the
>     sequence so once you start pruning out unwanted historical data
>     the delete_tid of a prior record will not match the create_tid of the
>     following one (historically speaking).

Again, check out the versioned pointer algorithms.  You can tell what
can be pruned just by consulting the version tree and the create_tids
(birth versions) for a particular logical address.  Maybe the hang is
that you do not organize the btrees by logical address (or inum in the
case of the inode table tree).  I thought you did but have not read
closely enough to be sure.

>     At one point I did collapse the holes, rematching the delete_tid with
>     the create_tid of the following historical record, but I had to remove
>     that code because it seriously complicated the mirroring implementation.
>     I wanted each mirror to be able to have its own historical retention
>     policy independant of the master.  e.g. so you could mirror to a backup
>     system which would retain a longer and more coarse-grained history then
>     the production system.

Fair enough.  I have an entirely different approach to what you call
mirroring and what I call delta replication.  (I reserve the term
mirroring to mean mindless duplication of physical media writes.)  This
method proved out well in ddsnap:

   http://phunq.net/ddtree?p=zumastor/.git;a=tree;h=fc5cb496fff10a2b03034fcf95122f5828149257;hb=fc5cb496fff10a2b03034fcf95122f5828149257

(Sorry about the massive URL, you can blame Linus for that;-)

What I do in ddsnap is compute all the blocks that differ between two
versions and apply those to a remote volume already holding the first
of the two versions, yielding a replica of the second version that is
logically but not physically identical.  The same idea works for a
versioned filesystem: compute all the leaf data that differs between
two versions, per inum, and apply the resulting delta to the
corresponding inums in the remote replica.  The main difference vs a
ddsnap volume delta is that not all of the changes are physical blocks,
there are also changed inode attributes, so the delta stream format
has to be elaborated accordingly.

>     I also considered increasing the B-Tree fan-out to 256 but decided
>     against it because insertions and deletions really bloated up the
>     UNDO FIFO.  Frankly I'm still undecided as to whether that was a good 
>     idea, I would have prefered 256.  I can actually compile in 256 by
>     changing a #define, but I released with 64 because I hit up against
>     a number of performance issues:  bcopy() overheads for insertions
>     and deletions in certain tests became very apparent.  Locking
>     conflicts became a much more serious issue because I am using whole-node
>     locks rather then element locks.  And, finally, the UNDO records got
>     really bloated.  I would need to adjust the node locking and UNDO
>     generation code significantly to remove the bottlenecks before I could
>     go to a 256-element B-Tree node.

I intend to log insertions and deletions logically, which keeps each
down to a few bytes until a btree rollup episode comes along to perform
updating of the btree nodes in bulk.  I am pretty sure this will work
for you as well, and you might want to check out that forward logging
trick.

That reminds me, I was concerned about the idea of UNDO records vs
REDO.  I hope I have this right: you delay acknowledging any write
transaction to the submitter until log commit has progressed beyond the
associated UNDO records.  Otherwise, if you acknowledge, crash, and
prune all UNDO changes, then you would end up with applications
believing that they had got things onto stable storage and be wrong
about that.  I have no doubt you did the right thing there, but it is
not obvious from your design documentation.

>     HAMMER's B-Tree elements are probably huge compared to Tux3, and that's
>     another limiting factor for the fan-out I can use.  My elements
>     are 64 bytes each.

Yes, I mostly have 16 byte elements and am working on getting most of
them down to 12 or 8.

>     64x64 = 4K per B-Tree node.  I decided to go 
>     with fully expanded keys: 64 bit object id, 64 bit file-offset/db-key,
>     64 bit create_tid, 64 bit delete_tid), plus a 64-bit storage offset and
>     other auxillary info.  That's instead of using a radix-compressed key.
>     Radix compressed keys would have doubled the complexity of the B-Tree
>     code, particularly with the middle-of-tree pointer caching that
>     HAMMER does.

I use two-stage lookup, or four stage if you count searches within
btree blocks.  This makes the search depth smaller in the case of small
files, and in the case of really huge files it adds depth exactly as
appropriate.  The index blocks end up cached in the page cache (the
buffer cache is just a flavor of page cache in recent Linux) so there
is little repeated descending through the btree indices.  Instead, most
of the probing is through the scarily fast radix tree code, which has
been lovingly optimized over the years to avoid cache line misses and
SMP lock contention.

I also received a proposal for a "fat" btree index from a collaborator
(Maciej) that included the file offset but I really did not like the...
fattening.  A two level btree yields less metadata overall which in my
estimation is more important than saving some bree probes.  The way
things work these days, falling down from level 1 cache to level 2 or
from level 2 to level 3 costs much more than executing some extra CPU
instructions.  So optimization strategy inexorably shifts away from
minimizing instructions towards minimizing cache misses.

>     The historical access mechanic alone added major complexity to the
>     B-Tree algorithms.  I will note here that B-Tree searches have a very
>     high cpu overhead no matter how you twist it, and being able to cache
>     cursors within the B-Tree is absolutely required if you want the
>     filesystem to perform well.  If you always start searches at the root
>     your cpu overhead will be horrendous... so plan on caching cursors
>     from the get-go.

Fortunately, we get that for free on Linux, courtesy of the page cache
radix trees :-)

I might eventually add some explicit cursor caching, but various
artists over the years have noticed that it does not make as much
difference as you might think.

>     If I were to do radix compression I would also want to go with a
>     fully dynamic element size and fully dynamic fan-out in order to
>     best-compress each B-Tree node.  Definitely food for thought.

Indeed.  But Linux is braindamaged about large block size, so there is
very strong motivation to stay within physical page size for the
immediate future.  Perhaps if I get around to a certain hack that has
been perenially delayed, that situation will improve:

   "Variable sized page objects"
   http://lwn.net/Articles/37795/

>     I'd love to do something like that.  I think radix compression would
>     remove much of the topological bloat the B-Tree creates verses using
>     blockmaps, generally speaking.

Topological bloat?

>     Space management is currently HAMMER's greatest weakness, but it only
>     applies to small storage systems.  Several things in HAMMER's design
>     are simply not condusive to a small storage.  The storage model is not
>     fine-grained and (unless you are deleting a lot of stuff) needs
>     reblocking to actually recover the freed space.  The flushing algorithms
>     need around 100MB of UNDO FIFO space on-media to handle worst-case
>     dependancies (mainly directory/file visibility issues on crash recovery),
>     and the front-end caching of modifying operations, since they don't
>     know exactly how much actual storage will be needed, need ~16MB of
>     wiggle room to be able to estimate the on-media storage required to
>     back the operations cached by the front-end.  Plus, on top of that,
>     any sort of reblocking also needs some wiggle room to be able to clean
>     out partially empty big-blocks and fill in new ones.


>     On the flip side, the reblocker doesn't just de-fragment the filesystem,
>     it is also intended to be used for expanding and contracting partitions,
>     and adding removing volumes in the next release.  Definitely a
>     multi-purpose utility.

Good point.  I expect Tux3 will eventually have a reblocker (aka
defragger).  There are some really nice things you can do, like:

  1) Set a new version so logical changes cease for the parent
     version.

  2) We want to bud off a given directory tree into its own volume,
     so start by deleting the subtree in the current version.  If
     any link counts in the directory tree remain nonzero in the
     current version then there are hard links into the subtree, so
     fail now and drop the new version.

  3) Reblock a given region of the inode table tree and all the files
     in it into one physically contiguous region of blocks

  4) Add a free tree and other once-per volume structures to the new
     home region.

  5) The new region is now entirely self contained and even keeps its
     version history.  At the volume manager level, map it to a new,
     sparse volume that just has a superblock in the zeroth extent and
     the new, mini filesystem at some higher logical address.  Remap
     the region in the original volume to empty space and add the
     empty space to the free tree.

  6) Optionally reblock the newly budded filesystem to the base of the
     new volume so utilties that do not not play well with sparse
     volumes do not do silly things.

>     So I'm not actually being cavalier about it, its just that I had to
>     make some major decisions on the algorithm design and I decided to
>     weight the design more towards performance and large-storage, and
>     small-storage suffered a bit.

Cavalier was a poor choice of words, the post was full of typos as well
so I am not proud of it.  You are solving a somewhat different problem
and you have code out now, which is a huge achievement.  Still I think
you can iteratively improve your design using some of the techniques I
have stumbled upon.  There are probably some nice tricks I can get from
your code base too once I delve into it.

>     In anycase, it sounds like Tux3 is using many similar ideas.  I think
>     you are on the right track.

Thankyou very much.  I think you are on the right track too, which you
have a rather concrete way of proving.

>     I will add one big note of caution, drawing 
>     from my experience implementing HAMMER, because I think you are going
>     to hit a lot of the same issues.
> 
>     I spent 9 months designing HAMMER and 9 months implementing it.  During
>     the course of implementing it I wound up throwing away probably 80% of
>     the original design outright.  Amoung the major components I had to
>     rewrite were the LOG mechanic (which ultimately became the meta-data
>     UNDO FIFO), and the fine-grained storage mechanic (which ultimately
>     became coarse-grained).  The UNDO FIFO was actually the saving grace,
>     once that was implemented most of the writer-ordering dependancies went
>     away (devolved into just flushing meta-data buffers after syncing the
>     UNDO FIFO)... I suddenly had much, much more freedom in designing
>     the other on-disk structures and algorithms.
>
>     What I found while implementing HAMMER was that the on-disk topological
>     design essentially dictated the extent of HAMMER's feature set AND
>     most of its deficiencies (such as having to reblock to recover space),
>     and the algorithms I chose often dictated other restrictions.  But the
>     major restrictions came from the on-disk structures.
> 
>     Because of the necessarily tight integration between subsystems I found
>     myself having to do major redesigns during the implementation phase.
>     Fixing one subsystem created a cascade effect that required tweaking other
>     subsystems.  Even adding new features, such as the mirroring, required
>     significant changes in the B-Tree deadlock recovery code.  I couldn't get
>     away from it.  Ultimately I chose to simplify some of the algorithms
>     rather then have to go through another 4 months of rewriting.  All
>     major designs are an exercise in making trade-offs in topology, feature
>     set, algorithmic complexity, debuggability, robustness, etc.  The list
>     goes on forever.
> 

The big ahas! that eliminated much of the complexity in the Tux3 design
were:

  * Forward logging - just say no to incomplete transactions
  * Version weaving - just say no to recursive copy on write

Essentially I have been designing Tux3 for ten years now and working
seriously on the simplifying elements for the last three years or so,
either entirely on paper or in related work like ddsnap and LVM3.

One of the nice things about moving on from design to implementation of
Tux3 is that I can now background the LVM3 design process and see what
Tux3 really wants from it.  I am determined to match every checkbox
volume management feature of ZFS as efficiently or more efficiently,
without violating the traditional layering between filesystem and
block device, and without making LVM3 a Tux3-private invention.

>     Laters!

Hopefully not too much later.  I find this dialog very fruitful.  I just
wish such dialog would occur more often at the design/development stage
in Linux and other open source work instead of each group obsessively
ignoring all "competing" designs and putting huge energy into chatting
away about the numerous bugs that arise from rushing their design or
ignoring the teachings of history.

Regards,

Daniel

From phillips at phunq.net  Fri Jul 25 15:54:05 2008
From: phillips at phunq.net (Daniel Phillips)
Date: Fri, 25 Jul 2008 15:54:05 -0700
Subject: [Tux3] Comparison to Hammer fs design
In-Reply-To: <200807251853.m6PIrZY1015569@apollo.backplane.com>
References: <4889fdb4$0$849$415eb37d@crater_reader.dragonflybsd.org>
        <200807251853.m6PIrZY1015569@apollo.backplane.com>
Message-ID: <200807251554.05902.phillips@phunq.net>

(resent after subscribing to the the kernel at crater)

On Friday 25 July 2008 11:53, Matthew Dillon wrote:
> 
> :Hi;
> :
> :The announcement of yet another filesystem:
> :
> :http://lkml.org/lkml/2008/7/23/257
> :
> :led to some comments about hammer fs:
> :
> :http://tux3.org/pipermail/tux3/2008-July/000006.html
> :
> :enjoy,
> :
> :     Pedro.
> 
>     Those are interesting comments.   I think I found Daniel's email address
>     so I am adding him to the To:   Dan, feel free to post this on your Tux
>     groups if you want.

How about a cross post?
 
>     I did consider multiple-parentage...  that is the ability to have a
>     writable snapshot that 'forks' the filesystem history.  It would be
>     an ultra cool feature to have but I couldn't fit it into the B-Tree
>     model I was using.  Explicit snapshotting would be needed to make it
>     work, and the snapshot id would have to be made part of the B-Tree key,
>     which is fine.  HAMMER is based around implicit snapshots (being able
>     to do an as-of historical lookup without having explicitly snapshotted
>     bits of the history).

Yes, that is the main difference indeed, essentially "log everything" vs
"commit" style versioning.  The main similarity is the lifespan oriented
version control at the btree leaves.

>     I would caution against using B-Tree iterations 
>     in historical lookups, B-Trees are only fast if you can pretty much
>     zero in on the exact element you want as part of the primary search
>     mechanic.  Once you have to iterate or chain performance goes out the
>     window.  I know this because there are still two places where HAMMER
>     sometimes has to iterate due to the inexact nature of an as-of lookup.
>     Multiple-parentage almost certainly require two inequality searches,
>     or one inequality search and an iteration.  A single timeline only
>     requires one inequality search.

Once I get down to the leaf level I binary search on logical address in
the case of a file index btree or on version in the case of an inode
table block, so this cost is still Log(N) with a small k.  For a
heavily versioned inode or file region this might sometimes result in
an overflow block or two that has to be linearly searched which is not
a big problem so long as it is a rare case, which it really ought to be
for the kinds of filesystem loads I have seen.  A common example of a
worst case is /var/log/messages, where the mtime and size are going to
change in pretty much every version, so if you have hourly snapshots
and hold them for three months it adds up to about 2200 16 byte inode
table attributes to record, about 8 inode table leaf blocks.  I really
do not see that as a problem.  If it goes up to 100 leaf blocks to
search then that could be a problem.

Usually, I will only be interested in the first and last of those
blocks, the first holding the rarely changing attributes including the
root of the file index btree and the last holding the most recent
incarnation of the mtime/size attribute.

The penultimate inode table index block tells me how many blocks a
given inode lives in because several blocks will have the same inum
key.  So the lookup algorithm for a massively versioned file becomes:
1) read the first inode table block holding that inum; 2) read the last
block with the same inum.  The latter operation only needs to consult
the immediate parent index block, which is locked in the page cache at
that point.

It will take a little care and attention to the inode format,
especially the inode header, to make sure that I can reliably do that
first/last probe optimization for the "head" version, but it does seem
worth the effort.  For starters I will just do a mindless linear walk
of an "overflow" inode and get fancy if that turns out to be a problem.

>     I couldn't get away from having a delete_tid (the 'death version
>     numbers').  I really tried :-)  There are many cases where one is only
>     deleting, rather then overwriting.

By far the most common case I would think.  But check out the versioned
pointer algorithms.  Surprisingly that just works, which is not exactly
obvious:

   Versioned pointers: a new method of representing snapshots
   http://lwn.net/Articles/288896/

I was originally planning to keep all versions of a truncate/rewrite
file in the same file index, but recently I realized that that is dumb
because there will never be any file data shared by a successor version
in that case.  So the thing to do is just create an entirely new
versioned file data attribute for each rewrite, bulking up the inode
table entry a little but greatly constraining the search for versions
to delete and reducing cache pressure by not loading unrelated version
data when traversing a file.

>     Both numbers are then needed to 
>     be able to properly present a historical view of the filesystem.
>     For example, when one is deleting a directory entry a snapshot after
>     that deletion must show the directory entry gone.

...which happens in Tux3 courtesy of the fact that the entire block
containing the dirent will have been versioned, with the new version
showing the entry gone.  Here is one of two places where I violate my
vow to avoid copying an entire block when only one data item in it
changes (the other being the atime table).  I rely on two things to
make this nice: 1) Most dirent changes will be logically logged and
only rolled up into the versioned file blocks when there are enough to
be reasonably sure that each changed directory block will be hit
numerous times in each rollup episode.  (Storing the directory blocks
in dirent-create order as in PHTree makes this very likely for mass
deletes.)  2) When we care about this is usually during a mass delete,
where most or all dirents in each directory file block are removed
before moving on to the next block.

>     Both numbers are 
>     also needed to be able to properly prune out unwanted historical data
>     from the filesystem.  HAMMER's pruning algorithm (cleaning out old  
>     historical data which is no longer desired) creates holes in the
>     sequence so once you start pruning out unwanted historical data
>     the delete_tid of a prior record will not match the create_tid of the
>     following one (historically speaking).

Again, check out the versioned pointer algorithms.  You can tell what
can be pruned just by consulting the version tree and the create_tids
(birth versions) for a particular logical address.  Maybe the hang is
that you do not organize the btrees by logical address (or inum in the
case of the inode table tree).  I thought you did but have not read
closely enough to be sure.

>     At one point I did collapse the holes, rematching the delete_tid with
>     the create_tid of the following historical record, but I had to remove
>     that code because it seriously complicated the mirroring implementation.
>     I wanted each mirror to be able to have its own historical retention
>     policy independant of the master.  e.g. so you could mirror to a backup
>     system which would retain a longer and more coarse-grained history then
>     the production system.

Fair enough.  I have an entirely different approach to what you call
mirroring and what I call delta replication.  (I reserve the term
mirroring to mean mindless duplication of physical media writes.)  This
method proved out well in ddsnap:

   http://phunq.net/ddtree?p=zumastor/.git;a=tree;h=fc5cb496fff10a2b03034fcf95122f5828149257;hb=fc5cb496fff10a2b03034fcf95122f5828149257

(Sorry about the massive URL, you can blame Linus for that;-)

What I do in ddsnap is compute all the blocks that differ between two
versions and apply those to a remote volume already holding the first
of the two versions, yielding a replica of the second version that is
logically but not physically identical.  The same idea works for a
versioned filesystem: compute all the leaf data that differs between
two versions, per inum, and apply the resulting delta to the
corresponding inums in the remote replica.  The main difference vs a
ddsnap volume delta is that not all of the changes are physical blocks,
there are also changed inode attributes, so the delta stream format
has to be elaborated accordingly.

>     I also considered increasing the B-Tree fan-out to 256 but decided
>     against it because insertions and deletions really bloated up the
>     UNDO FIFO.  Frankly I'm still undecided as to whether that was a good 
>     idea, I would have prefered 256.  I can actually compile in 256 by
>     changing a #define, but I released with 64 because I hit up against
>     a number of performance issues:  bcopy() overheads for insertions
>     and deletions in certain tests became very apparent.  Locking
>     conflicts became a much more serious issue because I am using whole-node
>     locks rather then element locks.  And, finally, the UNDO records got
>     really bloated.  I would need to adjust the node locking and UNDO
>     generation code significantly to remove the bottlenecks before I could
>     go to a 256-element B-Tree node.

I intend to log insertions and deletions logically, which keeps each
down to a few bytes until a btree rollup episode comes along to perform
updating of the btree nodes in bulk.  I am pretty sure this will work
for you as well, and you might want to check out that forward logging
trick.

That reminds me, I was concerned about the idea of UNDO records vs
REDO.  I hope I have this right: you delay acknowledging any write
transaction to the submitter until log commit has progressed beyond the
associated UNDO records.  Otherwise, if you acknowledge, crash, and
prune all UNDO changes, then you would end up with applications
believing that they had got things onto stable storage and be wrong
about that.  I have no doubt you did the right thing there, but it is
not obvious from your design documentation.

>     HAMMER's B-Tree elements are probably huge compared to Tux3, and that's
>     another limiting factor for the fan-out I can use.  My elements
>     are 64 bytes each.

Yes, I mostly have 16 byte elements and am working on getting most of
them down to 12 or 8.

>     64x64 = 4K per B-Tree node.  I decided to go 
>     with fully expanded keys: 64 bit object id, 64 bit file-offset/db-key,
>     64 bit create_tid, 64 bit delete_tid), plus a 64-bit storage offset and
>     other auxillary info.  That's instead of using a radix-compressed key.
>     Radix compressed keys would have doubled the complexity of the B-Tree
>     code, particularly with the middle-of-tree pointer caching that
>     HAMMER does.

I use two-stage lookup, or four stage if you count searches within
btree blocks.  This makes the search depth smaller in the case of small
files, and in the case of really huge files it adds depth exactly as
appropriate.  The index blocks end up cached in the page cache (the
buffer cache is just a flavor of page cache in recent Linux) so there
is little repeated descending through the btree indices.  Instead, most
of the probing is through the scarily fast radix tree code, which has
been lovingly optimized over the years to avoid cache line misses and
SMP lock contention.

I also received a proposal for a "fat" btree index from a collaborator
(Maciej) that included the file offset but I really did not like the...
fattening.  A two level btree yields less metadata overall which in my
estimation is more important than saving some bree probes.  The way
things work these days, falling down from level 1 cache to level 2 or
from level 2 to level 3 costs much more than executing some extra CPU
instructions.  So optimization strategy inexorably shifts away from
minimizing instructions towards minimizing cache misses.

>     The historical access mechanic alone added major complexity to the
>     B-Tree algorithms.  I will note here that B-Tree searches have a very
>     high cpu overhead no matter how you twist it, and being able to cache
>     cursors within the B-Tree is absolutely required if you want the
>     filesystem to perform well.  If you always start searches at the root
>     your cpu overhead will be horrendous... so plan on caching cursors
>     from the get-go.

Fortunately, we get that for free on Linux, courtesy of the page cache
radix trees :-)

I might eventually add some explicit cursor caching, but various
artists over the years have noticed that it does not make as much
difference as you might think.

>     If I were to do radix compression I would also want to go with a
>     fully dynamic element size and fully dynamic fan-out in order to
>     best-compress each B-Tree node.  Definitely food for thought.

Indeed.  But Linux is braindamaged about large block size, so there is
very strong motivation to stay within physical page size for the
immediate future.  Perhaps if I get around to a certain hack that has
been perenially delayed, that situation will improve:

   "Variable sized page objects"
   http://lwn.net/Articles/37795/

>     I'd love to do something like that.  I think radix compression would
>     remove much of the topological bloat the B-Tree creates verses using
>     blockmaps, generally speaking.

Topological bloat?

>     Space management is currently HAMMER's greatest weakness, but it only
>     applies to small storage systems.  Several things in HAMMER's design
>     are simply not condusive to a small storage.  The storage model is not
>     fine-grained and (unless you are deleting a lot of stuff) needs
>     reblocking to actually recover the freed space.  The flushing algorithms
>     need around 100MB of UNDO FIFO space on-media to handle worst-case
>     dependancies (mainly directory/file visibility issues on crash recovery),
>     and the front-end caching of modifying operations, since they don't
>     know exactly how much actual storage will be needed, need ~16MB of
>     wiggle room to be able to estimate the on-media storage required to
>     back the operations cached by the front-end.  Plus, on top of that,
>     any sort of reblocking also needs some wiggle room to be able to clean
>     out partially empty big-blocks and fill in new ones.

Ah, that is a very nice benefit of Tux3-style forward logging I forgot
to mention in the original post: transaction size is limited only by
available free space on the volume, because log commit blocks can be
anywhere.  Last night I dreamed up a log commit block variant where the
associated transaction data blocks can be physically discontiguous, to
make this assertion come true, shortly after reading Stephen Tweedie's
horror stories about what you have to do if you must work around not
being able to put an entire, consistent transaction onto stable media
in one go:

   http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html

Most of the nasties he mentions just vanish if:

  a) File data is always committed atomically
  b) All commit transactions are full transactions

He did remind me (via time travel from year 2000) of some details I
should write into the design explicitly, for example, logging orphan
inodes that are unlinked while open, so they can be deleted on replay
after a crash.  Another nice application of forward logging, which
avoids the seek-happy linked list through the inode table that Ext3
does.

>     On the flip side, the reblocker doesn't just de-fragment the filesystem,
>     it is also intended to be used for expanding and contracting partitions,
>     and adding removing volumes in the next release.  Definitely a
>     multi-purpose utility.

Good point.  I expect Tux3 will eventually have a reblocker (aka
defragger).  There are some really nice things you can do, like:

  1) Set a new version so logical changes cease for the parent
     version.

  2) We want to bud off a given directory tree into its own volume,
     so start by deleting the subtree in the current version.  If
     any link counts in the directory tree remain nonzero in the
     current version then there are hard links into the subtree, so
     fail now and drop the new version.

  3) Reblock a given region of the inode table tree and all the files
     in it into one physically contiguous region of blocks

  4) Add a free tree and other once-per volume structures to the new
     home region.

  5) The new region is now entirely self contained and even keeps its
     version history.  At the volume manager level, map it to a new,
     sparse volume that just has a superblock in the zeroth extent and
     the new, mini filesystem at some higher logical address.  Remap
     the region in the original volume to empty space and add the
     empty space to the free tree.

  6) Optionally reblock the newly budded filesystem to the base of the
     new volume so utilties that do not not play well with sparse
     volumes do not do silly things.

>     So I'm not actually being cavalier about it, its just that I had to
>     make some major decisions on the algorithm design and I decided to
>     weight the design more towards performance and large-storage, and
>     small-storage suffered a bit.

Cavalier was a poor choice of words, the post was full of typos as well
so I am not proud of it.  You are solving a somewhat different problem
and you have code out now, which is a huge achievement.  Still I think
you can iteratively improve your design using some of the techniques I
have stumbled upon.  There are probably some nice tricks I can get from
your code base too once I delve into it.

>     In anycase, it sounds like Tux3 is using many similar ideas.  I think
>     you are on the right track.

Thankyou very much.  I think you are on the right track too, which you
have a rather concrete way of proving.

>     I will add one big note of caution, drawing 
>     from my experience implementing HAMMER, because I think you are going
>     to hit a lot of the same issues.
> 
>     I spent 9 months designing HAMMER and 9 months implementing it.  During
>     the course of implementing it I wound up throwing away probably 80% of
>     the original design outright.  Amoung the major components I had to
>     rewrite were the LOG mechanic (which ultimately became the meta-data
>     UNDO FIFO), and the fine-grained storage mechanic (which ultimately
>     became coarse-grained).  The UNDO FIFO was actually the saving grace,
>     once that was implemented most of the writer-ordering dependancies went
>     away (devolved into just flushing meta-data buffers after syncing the
>     UNDO FIFO)... I suddenly had much, much more freedom in designing
>     the other on-disk structures and algorithms.
>
>     What I found while implementing HAMMER was that the on-disk topological
>     design essentially dictated the extent of HAMMER's feature set AND
>     most of its deficiencies (such as having to reblock to recover space),
>     and the algorithms I chose often dictated other restrictions.  But the
>     major restrictions came from the on-disk structures.
> 
>     Because of the necessarily tight integration between subsystems I found
>     myself having to do major redesigns during the implementation phase.
>     Fixing one subsystem created a cascade effect that required tweaking other
>     subsystems.  Even adding new features, such as the mirroring, required
>     significant changes in the B-Tree deadlock recovery code.  I couldn't get
>     away from it.  Ultimately I chose to simplify some of the algorithms
>     rather then have to go through another 4 months of rewriting.  All
>     major designs are an exercise in making trade-offs in topology, feature
>     set, algorithmic complexity, debuggability, robustness, etc.  The list
>     goes on forever.
> 

The big ahas! that eliminated much of the complexity in the Tux3 design
were:

  * Forward logging - just say no to incomplete transactions
  * Version weaving - just say no to recursive copy on write

Essentially I have been designing Tux3 for ten years now and working
seriously on the simplifying elements for the last three years or so,
either entirely on paper or in related work like ddsnap and LVM3.

One of the nice things about moving on from design to implementation of
Tux3 is that I can now background the LVM3 design process and see what
Tux3 really wants from it.  I am determined to match every checkbox
volume management feature of ZFS as efficiently or more efficiently,
without violating the traditional layering between filesystem and
block device, and without making LVM3 a Tux3-private invention.

>     Laters!

Hopefully not too much later.  I find this dialog very fruitful.  I just
wish such dialog would occur more often at the design/development stage
in Linux and other open source work instead of each group obsessively
ignoring all "competing" designs and putting huge energy into chatting
away about the numerous bugs that arise from rushing their design or
ignoring the teachings of history.

Regards,

Daniel

From dillon at apollo.backplane.com  Fri Jul 25 19:02:02 2008
From: dillon at apollo.backplane.com (Matthew Dillon)
Date: Fri, 25 Jul 2008 19:02:02 -0700 (PDT)
Subject: [Tux3] Comparison to Hammer fs design
References: <4889fdb4$0$849$415eb37d@crater_reader.dragonflybsd.org>
        <200807251853.m6PIrZY1015569@apollo.backplane.com>
        <200807251513.59912.phillips@phunq.net>
Message-ID: <200807260202.m6Q222Ma018102@apollo.backplane.com>

:>     so I am adding him to the To:   Dan, feel free to post this on your Tux
:>     groups if you want.
:
:How about a cross-post?

    I don't think it will work, only subscribers can post to the DFly groups,
    but we'll muddle through it :-)  I will include the whole of the previous
    posting so the DFly groups see the whole thing, if you continue to get
    bounces.

    I believe I have successfully added you as an 'alias address' to the
    DragonFly kernel list so you shouldn't get bounced if you Cc it now.

:Yes, that is the main difference indeed, essentially "log everything" vs
:"commit" style versioning.  The main similarity is the lifespan oriented
:version control at the btree leaves.

    Reading this and a little more that you describe later let me make
    sure I understand the forward-logging methodology you are using.
    You would have multiple individually-tracked transactions in
    progress due to parallelism in operations initiated by userland and each
    would be considered committed when the forward-log logs the completion
    of that particular operation?

    If the forward log entries are not (all) cached in-memory that would mean
    that accesses to the filesystem would have to be run against the log
    first (scanning backwards), and then through to the B-Tree?  You
    would solve the need for having an atomic commit ('flush groups' in
    HAMMER), but it sounds like the algorithmic complexity would be
    very high for accessing the log.

    And even though you wouldn't have to group transactions into larger
    commits the crash recovery code would still have to implement those
    algorithms to resolve directory and file visibility issues.  The problem
    with namespace visibility is that it is possible to create a virtually
    unending chain of separate but inter-dependant transactions which either
    all must go, or none of them.  e.g. creating a, a/b, a/b/c, a/b/x, a/b/c/d,
    etc etc.  At some point you have to be able to commit so the whole mess
    does not get undone by a crash, and many completed mini-transactions
    (file or directory creates) actually cannot be considered complete until
    their governing parent directories (when creating) or children (when
    deleting) have been committed.  The problem can become very complex.

    Last question here:  Your are forward-logging high level operations.
    You are also going to have to log meta-data (actual B-Tree manipulation)
    commits in order to recover from a crash while making B-Tree
    modifications.  Correct?  So your crash recovery code will have to handle
    both meta-data undo and completed and partially completed transactions.
    And there will have to be a tie-in between the meta-data commits and
    the transactions so you know which ones have to be replayed.  That
    sounds fairly hairy.  Have you figured out how you are doing to do that?

:Once I get down to the leaf level I binary search on logical address in
:the case of a file index btree or on version in the case of an inode
:table block, so this cost is still Log(N) with a small k.  For a
:heavily versioned inode or file region this might sometimes result in
:an overflow block or two that has to be linearly searched which is not
:a big problem so long as it is a rare case, which it really ought to be
:for the kinds of filesystem loads I have seen.  A common example of a
:worst case is /var/log/messages, where the mtime and size are going to
:change in pretty much every version, so if you have hourly snapshots
:and hold them for three months it adds up to about 2200 16 byte inode
:table attributes to record, about 8 inode table leaf blocks.  I really
:do not see that as a problem.  If it goes up to 100 leaf blocks to
:search then that could be a problem.
    
    I think you can get away with this as long as you don't have too many
    snapshots, and even if you do I noticed with HAMMER that only a small
    percentage of inodes have a large number of versions associated with
    them from normal production operation.  /var/log/messages
    is an excellent example of that.  Log files were effected the most though
    I also noticed that very large files also wind up with multiple versions
    of the inode, such as when writing out a terrabyte-sized file. 

    Even with a direct bypass for data blocks (but not their meta-data,
    clearly), HAMMER could only cache so much meta-data in memory before
    it had to finalize the topology and flush the inode out.  A
    terrabyte-sized file wound up with about 1000 copies of the inode
    prior to pruning (one had to be written out about every gigabyte or so).

:The penultimate inode table index block tells me how many blocks a
:given inode lives in because several blocks will have the same inum
:key.  So the lookup algorithm for a massively versioned file becomes:
:1) read the first inode table block holding that inum; 2) read the last
:block with the same inum.  The latter operation only needs to consult
:the immediate parent index block, which is locked in the page cache at
:that point.

    How are you dealing with expansion of the logical inode block(s) as
    new versions are added?  I'm assuming you are intending to pack the
    inodes on the media so e.g. a 128-byte inode would only take up
    128 bytes of media space in the best case.  Multiple inodes would be
    laid out next to each other logically (I assume), but since the physical
    blocks are larger they would also have to be laid out next to each
    other physically within any given backing block.  Now what happens
    when one has to be expanded?

    I'm sure this ties into the forward-log but even with the best algorithms
    you are going to hit limited cases where you have to expand the inode.
    Are you just copying the inode block(s) into a new physical allocation
    then?

:It will take a little care and attention to the inode format,
:especially the inode header, to make sure that I can reliably do that
:first/last probe optimization for the "head" version, but it does seem
:worth the effort.  For starters I will just do a mindless linear walk
:of an "overflow" inode and get fancy if that turns out to be a problem.
:

    Makes sense.  I was doing mindless linear walks of the B-Tree element
    arrays for most of HAMMER's implementation until it was stabilized well
    enough that I could switch to a binary search.  And I might change it
    again to start out with a heuristical index estimation.

:>     I couldn't get away from having a delete_tid (the 'death version
:>     numbers').  I really tried :-)  There are many cases where one is only
:>     deleting, rather then overwriting.
:
:By far the most common case I would think.  But check out the versioned
:pointer algorithms.  Surprisingly that just works, which is not exactly
:obvious:
:
:   Versioned pointers: a new method of representing snapshots
:   http://lwn.net/Articles/288896/

    Yes, it makes sense.  If the snapshot is explicitly taken then you
    can store direct references and chain, and you wouldn't need a delete
    id in that case.  From that article though the chain looks fairly
    linear.  Historical access could wind up being rather costly.

:I was originally planning to keep all versions of a truncate/rewrite
:file in the same file index, but recently I realized that that is dumb
:because there will never be any file data shared by a successor version
:in that case.  So the thing to do is just create an entirely new
:versioned file data attribute for each rewrite, bulking up the inode
:table entry a little but greatly constraining the search for versions
:to delete and reducing cache pressure by not loading unrelated version
:data when traversing a file.

    When truncating to 0 I would agree with your assessment.  For the 
    topology you are using you would definitely want to use different
    file data sets.  You also have to deal with truncations that are not
    to 0, that might be to the middle of a file.  Certainly not a
    common case, but still a case that has to be coded for.  If you treat
    truncation to 0 as a special case you will be adding considerable
    complexity to the algorithm.

    With HAMMER I chose to keep everything in one B-Tree, whether historical
    or current.  That way one set of algorithms handles both cases and code
    complexity is greatly reduced.  It isn't optimal... large amounts of
    built up history still slow things down (though in a bounded fashion).
    In that regard creating a separate topology for snapshots is a good
    idea.

:>     Both numbers are then needed to 
:>     be able to properly present a historical view of the filesystem.
:>     For example, when one is deleting a directory entry a snapshot after
:>     that deletion must show the directory entry gone.
:
:...which happens in Tux3 courtesy of the fact that the entire block
:containing the dirent will have been versioned, with the new version
:showing the entry gone.  Here is one of two places where I violate my
:vow to avoid copying an entire block when only one data item in it
:changes (the other being the atime table).  I rely on two things to
:make this nice: 1) Most dirent changes will be logically logged and
:only rolled up into the versioned file blocks when there are enough to
:be reasonably sure that each changed directory block will be hit
:numerous times in each rollup episode.  (Storing the directory blocks
:in dirent-create order as in PHTree makes this very likely for mass
:deletes.)  2) When we care about this is usually during a mass delete,
:where most or all dirents in each directory file block are removed
:before moving on to the next block.

    This could wind up being a sticky issue for your implementation.
    I like the concept of using the forward-log but if you actually have
    to do a version copy of the directory at all you will have to update the
    link (or some sort of) count for all the related inodes to keep track
    of inode visibility, and to determine when the inode can be freed and
    its storage space recovered.

    Directories in HAMMER are just B-Tree elements.  One element per
    directory-entry.  There are no directory blocks.   You may want to
    consider using a similar mechanism.  For one thing, it makes lookups
    utterly trivial... the file name is hashed and a lookup is performed
    based on the hash key, then B-Tree elements with the same hash key
    are iterated until a match is found (usually the first element is the
    match).  Also, when adding or removing directory entries only the
    directory inode's mtime field needs to be updated.  It's size does not.

    Your current directory block method could also represent a problem for
    your directory inodes... adding and removing directory entries causing
    size expansion or contraction could require rolling new versions
    of the directory inode to update the size field.  You can't hold too
    much in the forward-log without some seriously indexing.  Another
    case to consider along with terrabyte-sized files and log files.
    
:>     Both numbers are 
:>     also needed to be able to properly prune out unwanted historical data
:>     from the filesystem.  HAMMER's pruning algorithm (cleaning out old  
:>     historical data which is no longer desired) creates holes in the
:>     sequence so once you start pruning out unwanted historical data
:>     the delete_tid of a prior record will not match the create_tid of the
:>     following one (historically speaking).
:
:Again, check out the versioned pointer algorithms.  You can tell what
:can be pruned just by consulting the version tree and the create_tids
:(birth versions) for a particular logical address.  Maybe the hang is
:that you do not organize the btrees by logical address (or inum in the
:case of the inode table tree).  I thought you did but have not read
:closely enough to be sure.

    Yes, I see.  Can you explain the versioned pointer algorithm a bit more,
    it looks almost like a linear chain (say when someone is doing a daily
    snapshot).  It looks great for optimal access to the HEAD but it doesn't
    look very optimal if you want to dive into an old snapshot. 

    For informational purposes: HAMMER has one B-Tree, organized using
    a strict key comparison.  The key is made up of several fields which
    are compared in priority order:

        localization    - used to localize certain data types together and
                          to separate pseudo filesystems created within
                          the filesystem.
        obj_id          - the object id the record is associated with.
                          Basically the inode number the record is 
                          associated with.

        rec_type        - the type of record, e.g. INODE, DATA, SYMLINK-DATA,
                          DIRECTORY-ENTRY, etc.

        key             - e.g. file offset

        create_tid      - the creation transaction id

    Inodes are grouped together using the localization field so if you
    have a million inodes and are just stat()ing files, the stat
    information is localized relative to other inodes and doesn't have to
    skip file contents or data, resulting in highly localized accesses
    on the storage media.

    Beyond that the B-Tree is organized by inode number and file offset.
    In the case of a directory inode, the 'offset' is the hash key, so
    directory entries are organized by hash key (part of the hash key is
    an iterator to deal with namespace collisions).

    The structure seems to work well for both large and small files, for
    ls -lR (stat()ing everything in sight) style traversals as well as 
    tar-like traversals where the file contents for each file is read.

    The create_tid is the creation transaction id.  Historical accesses are
    always 'ASOF' a particular TID, and will access the highest create_tid
    that is still >= the ASOF TID.  The 'current' version of the filesystem
    uses an asof TID of 0xFFFFFFFFFFFFFFFF and hence accesses the highest
    create_tid.

    There is also a delete_tid which is used to filter out elements that
    were deleted prior to the ASOF TID.  There is currently an issue with
    HAMMER where the fact that the delete_tid is *not* part of the B-Tree
    compare can lead to iterations for strictly deleted elements, verses
    replaced elements which will have a new element with a higher create_tid
    that the B-Tree can seek to directly.  In fact, at one point I tried
    indexing on delete_tid instead of create_tid, but created one hell of a
    mess in that I had to physically *move* B-Tree elements being deleted
    instead of simply marking them as deleted.

:Fair enough.  I have an entirely different approach to what you call
:mirroring and what I call delta replication.  (I reserve the term
:mirroring to mean mindless duplication of physical media writes.)  This
:method proved out well in ddsnap:
:
:   http://phunq.net/ddtree?p=zumastor/.git;a=tree;h=fc5cb496fff10a2b03034fcf95122f5828149257;hb=fc5cb496fff10a2b03034fcf95122f5828149257
:
:(Sorry about the massive URL, you can blame Linus for that;-)
:
:What I do in ddsnap is compute all the blocks that differ between two
:versions and apply those to a remote volume already holding the first
:of the two versions, yielding a replica of the second version that is
:logically but not physically identical.  The same idea works for a
:versioned filesystem: compute all the leaf data that differs between
:two versions, per inum, and apply the resulting delta to the
:corresponding inums in the remote replica.  The main difference vs a
:ddsnap volume delta is that not all of the changes are physical blocks,
:there are also changed inode attributes, so the delta stream format
:has to be elaborated accordingly.

    Are you scanning the entire B-Tree to locate the differences?  It
    sounds you would have to as a fall-back, but that you could use the
    forward-log to quickly locate differences if the first version is
    fairly recent.

    HAMMER's mirroring basically works like this:  The master has synchronized
    up to transaction id C, the slave has synchronized up to transaction id A.
    The mirroring code does an optimized scan of the B-Tree to supply all
    B-Tree elements that have been modified between transaction A and C.
    Any number of mirroring targets can be synchronized to varying degrees,
    and can be out of date by varying amounts (even years, frankly).

    I decided to use the B-Tree to optimize the scan.  The B-Tree is
    scanned and any element with either a creation or deletion transaction
    id >= the slave's last synchronization point is then serialized and
    piped to the slave.

    To avoid having to scan the entire B-Tree I perform an optimization
    whereby the highest transaction id laid down at a leaf is propagated
    up the B-Tree all the way to the root.  This also occurs if a B-Tree
    node is physically deleted (due to pruning), even if no elements are
    actually present at the leaf within the transaction range.

    Thus the mirroring scan is able to skip any internal node (and its
    entire sub-tree) which has not been modified after the synchronization
    point, and is able to identify any leaf for which real, physical deletions
    have occured (in addition to logical deletions which simply set the
    delete_tid field in the B-Tree element) and pass along the key range
    and any remaining elements in that leaf for the target to do a
    comparative scan with.

    This allows incremental mirroring, multiple slaves, and also allows
    a mirror to go offline for months and then pop back online again
    and optimally pick up where it left off.  The incremental mirroring
    is important, the last thing I want to do is have to scan 2 billion
    B-Tree elements to do an incremental mirroring batch.

    The advantage is all the cool features I got by doing things that way.
    The disadvantage is that the highest transaction id must be propagated
    up the tree (though it isn't quite that bad because in HAMMER an entire
    flush group uses the same transaction id, so we aren't constantly
    repropagating new transaction id's up the same B-Tree nodes when
    flushing a particular flush group).

    You may want to consider something similar.  I think using the
    forward-log to optimize incremental mirroring operations is also fine
    as long as you are willing to take the 'hit' of having to scan (though
    not have to transfer) the entire B-Tree if the mirror is too far
    out of date.

:I intend to log insertions and deletions logically, which keeps each
:down to a few bytes until a btree rollup episode comes along to perform
:updating of the btree nodes in bulk.  I am pretty sure this will work
:for you as well, and you might want to check out that forward logging
:trick.

    Yes.  The reason I don't is because while it is really easy to lay down
    a forward-log, intergrating it into lookup operations (if you don't
    keep all the references cached in-memory) is really complex code.
    I mean, you can always scan it backwards linearly and that certainly
    is easy to do, but the filesystem's performance would be terrible.
    So you have to index the log somehow to allow lookups on it in
    reverse to occur reasonably optimally.  Have you figured out how you
    are going to do that?

    With HAMMER I have an in-memory cache and the on-disk B-Tree.  Just
    writing the merged lookup code (merging the in-memory cache with the
    on-disk B-Tree for the purposes of doing a lookup) was fairly complex.
    I would hate to have to do a three-way merge.... in-memory cache, on-disk
    log, AND on-disk B-Tree.  Yowzer!

:That reminds me, I was concerned about the idea of UNDO records vs
:REDO.  I hope I have this right: you delay acknowledging any write
:transaction to the submitter until log commit has progressed beyond the
:associated UNDO records.  Otherwise, if you acknowledge, crash, and
:prune all UNDO changes, then you would end up with applications
:believing that they had got things onto stable storage and be wrong
:about that.  I have no doubt you did the right thing there, but it is
:not obvious from your design documentation.

    The way it works is that HAMMER's frontend is almost entirely
    disconnected from the backend.  All meta-data operations are cached
    in-memory.  create, delete, append, truncate, rename, write...
    you name it.  Nothing the frontend does modifies any meta-data
    or mata-data buffers.

    The only sychronization point is fsync(), the filesystem syncer, and
    of course if too much in-memory cache is built up.  To improve
    performance, raw data blocks are not included... space for raw data
    writes is reserved by the frontend (without modifying the storage
    allocation layer) and those data buffers are written to disk by the
    frontend directly, just without any meta-data on-disk to reference
    them so who cares if you crash then!  It would be as if those blocks
    were never allocated in the first place.

    When the backend decides to flush the cached meta-data ops it breaks
    the meta-data ops into flush-groups whos dirty meta-data fits in the 
    system's buffer cache.  The meta-data ops are executed, building the
    UNDO, updating the B-Tree, allocating or finalizing the storage, and
    modifying the meta-data buffers in the buffer cache.  BUT the dirty
    meta-data buffers are locked into memory and NOT yet flushed to
    the media.  The UNDO *is* flushed to the media, so the flush groups
    can build a lot of UNDO up and flush it as they go if necessary.

    When the flush group has completed any remaining UNDO is flushed to the
    media, we wait for I/O to complete, the volume header's UNDO FIFO index
    is updated and written out, we wait for THAT I/O to complete, *then*
    the dirty meta-data buffers are flushed to the media.  The flushes at
    the end are done asynchronously (unless fsync()ing) and can overlap with
    the flushes done at the beginning of the next flush group.  So there
    are exactly two physical synchronization points for each flush.

    If a crash occurs at any point, upon remounting after a reboot
    HAMMER only needs to run the UNDOs to undo any partially committed
    meta-data.

    That's it.  It is fairly straight forward.

    For your forward-log approach you would have to log the operations 
    as they occur, which is fine since that can be cached in-memory. 
    However, you will still need to synchronize the log entries with
    a volume header update to update the log's index, so the recovery
    code knows how far the log extends.  Plus you would also have to log
    UNDO records when making actual changes to the permanent media
    structures (your B-Trees), which is independant of the forward-log
    entries you made representing high level filesystem operations, and
    would also have to lock the related meta-data in memory until the
    related log entries can be synchronized.  Then you would be able
    to flush the meta-data buffers.

    The forward-log approach is definitely more fine-grained, particularly
    for fsync() operations... those would go much faster using a forward
    log then the mechanism I use because only the forward-log would have
    to be synchronized (not the meta-data changes) to 'commit' the work.
    I like that, it would be a definite advantage for database operations.

:>     HAMMER's B-Tree elements are probably huge compared to Tux3, and that's
:>     another limiting factor for the fan-out I can use.  My elements
:>     are 64 bytes each.
:
:Yes, I mostly have 16 byte elements and am working on getting most of
:them down to 12 or 8.

    I don't know how you can make them that small.  I spent months just
    getting my elements down to 64 bytes.  The data reference alone for
    data blocks is 12 bytes (64 bit media storage offset and 32 bit length).

:>     64x64 = 4K per B-Tree node.  I decided to go 
:>     with fully expanded keys: 64 bit object id, 64 bit file-offset/db-key,
:>     64 bit create_tid, 64 bit delete_tid), plus a 64-bit storage offset and
:>     other auxillary info.  That's instead of using a radix-compressed key.
:>     Radix compressed keys would have doubled the complexity of the B-Tree
:>     code, particularly with the middle-of-tree pointer caching that
:>     HAMMER does.
:
:I use two-stage lookup, or four stage if you count searches within
:btree blocks.  This makes the search depth smaller in the case of small
:files, and in the case of really huge files it adds depth exactly as
:appropriate.  The index blocks end up cached in the page cache (the
:buffer cache is just a flavor of page cache in recent Linux) so there
:is little repeated descending through the btree indices.  Instead, most
:of the probing is through the scarily fast radix tree code, which has
:been lovingly optimized over the years to avoid cache line misses and
:SMP lock contention.
:
:I also received a proposal for a "fat" btree index from a collaborator
:(Maciej) that included the file offset but I really did not like the...
:fattening.  A two level btree yields less metadata overall which in my
:estimation is more important than saving some bree probes.  The way
:things work these days, falling down from level 1 cache to level 2 or
:from level 2 to level 3 costs much more than executing some extra CPU
:instructions.  So optimization strategy inexorably shifts away from
:minimizing instructions towards minimizing cache misses.

    The depth and complexity of your master B-Tree will definitely
    be smaller.  You are offloading both the file contents and snapshots.
    HAMMER incorporates both into a single global B-Tree.  This has been
    somewhat of a mixed bag for HAMMER.  On the plus side performance is
    very good for production operations (where the same filename is not
    being created over and over and over and over again and where the same
    file is not truncated over and over and over again).... locality of
    reference is extremely good in a single global B-Tree when accessing
    lots of small files or when accessing fewer larger files.  On the 
    negative side, performance drops noticeably if a lot of historical
    information (aka snapshots) has built up in the data set being
    accessed.  Even though the lookups themselves are extremely efficient,
    the access footprint (the size of the data set the system must cache)
    of the B-Tree becomes larger, sometimes much larger.

    I think the cpu overhead is going to be a bit worse then you are
    contemplating, but your access footprint (with regards to system memory
    use caching the meta-data) for non-historical accesses will be better.

    Are you going to use a B-Tree for the per-file layer or a blockmap?  And
    how are you going to optimize the storage for small files?  Are you
    just going to leave them in the log and not push a B-Tree for them?

:>     The historical access mechanic alone added major complexity to the
:>     B-Tree algorithms.  I will note here that B-Tree searches have a very
:>     high cpu overhead no matter how you twist it, and being able to cache
:>     cursors within the B-Tree is absolutely required if you want the
:>     filesystem to perform well.  If you always start searches at the root
:>     your cpu overhead will be horrendous... so plan on caching cursors
:>     from the get-go.
:
:Fortunately, we get that for free on Linux, courtesy of the page cache
:radix trees :-)
:
:I might eventually add some explicit cursor caching, but various
:artists over the years have noticed that it does not make as much
:difference as you might think.

     For uncacheable data sets the cpu overhead is almost irrelevant.

     But for cached data sets, watch out!  The cpu overhead of your B-Tree
     lookups is going to be 50% of your cpu time, with the other 50% being
     the memory copy or memory mapping operation and buffer cache operations.
     It is really horrendous.  When someone is read()ing or write()ing a
     large file the last thing you want to do is traverse the same 4 nodes
     and do four binary searches in each of those nodes for every read().

     For large fully cached data sets not caching B-Tree cursors will
     strip away 20-30% of your performance once your B-Tree depth exceeds
     4 or 5.  Also, the fan-out does not necessarily help there because
     the search within the B-Tree node costs almost as much as moving
     between B-Tree nodes.

     I found this out when I started comparing HAMMER performance with
     UFS.  For the fully cached case UFS was 30% faster until I started
     caching B-Tree cursors.  It was definitely noticable once my B-Tree
     grew past a million elements or so.  It disappeared completely when
     I started caching cursors into the B-Tree.

:>     If I were to do radix compression I would also want to go with a
:>     fully dynamic element size and fully dynamic fan-out in order to
:>     best-compress each B-Tree node.  Definitely food for thought.
:
:Indeed.  But Linux is braindamaged about large block size, so there is
:very strong motivation to stay within physical page size for the
:immediate future.  Perhaps if I get around to a certain hack that has
:been perenially delayed, that situation will improve:
:
:   "Variable sized page objects"
:   http://lwn.net/Articles/37795/

    I think it will be an issue for people trying to port HAMMER.  I'm trying
    to think of ways to deal with it.  Anyone doing an initial port can 
    just drop all the blocks down to 16K, removing the problem but
    increasing the overhead when working with large files.

:>     I'd love to do something like that.  I think radix compression would
:>     remove much of the topological bloat the B-Tree creates verses using
:>     blockmaps, generally speaking.
:
:Topological bloat?

    Bytes per record.  e.g. the cost of creating a small file in HAMMER
    is 3 B-Tree records (directory entry + inode record + one data record),
    plus the inode data, plus the file data.  For HAMMER that is 64*3 +
    128 + 112 (say the file is 100 bytes long, round up to a 16 byte
    boundary)... so that is 432 bytes.

    The bigger cost is when creating and managing a large file.  A 1 gigabyte
    file in HAMMER requires 1G/65536 = 16384 B-Tree elements, which comes
    to 1 megabyte of meta-data.  If I were to radix-compress those 
    elements the meta-data overhead would probably be cut to 300KB,
    possibly even less.

    Where this matters is that it directly effects the meta-data footprint 
    in the system caches which in turn directly effects the filesystem's
    ability to cache information without having to go to disk.  It can
    be a big deal.

:Ah, that is a very nice benefit of Tux3-style forward logging I forgot
:to mention in the original post: transaction size is limited only by
:available free space on the volume, because log commit blocks can be
:anywhere.  Last night I dreamed up a log commit block variant where the
:associated transaction data blocks can be physically discontiguous, to
:make this assertion come true, shortly after reading Stephen Tweedie's
:horror stories about what you have to do if you must work around not
:being able to put an entire, consistent transaction onto stable media
:in one go:

    Yes, that's an advantage, and one of the reasons why HAMMER is designed
    for large filesystems and not small ones.  Without a forward-log
    HAMMER has to reserve more media space for sequencing its flushes.

    Adding a forward log to HAMMER is possible, I might do it just for
    quick write()/fsync() style operations.  I am still very wary of the
    added code complexity.

:   http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html
:
:Most of the nasties he mentions just vanish if:
:
:  a) File data is always committed atomically
:  b) All commit transactions are full transactions

    Yup, which you get for free with your forward-log.

:He did remind me (via time travel from year 2000) of some details I
:should write into the design explicitly, for example, logging orphan
:inodes that are unlinked while open, so they can be deleted on replay
:after a crash.  Another nice application of forward logging, which
:avoids the seek-happy linked list through the inode table that Ext3
:does.

    Orphan inodes in HAMMER will always be committed to disk with a
    0 link count.  The pruner deals with them after a crash.  Orphan
    inodes can also be commited due to directory entry dependancies.

:>     On the flip side, the reblocker doesn't just de-fragment the filesystem,
:>     it is also intended to be used for expanding and contracting partitions,
:>     and adding removing volumes in the next release.  Definitely a
:>     multi-purpose utility.
:
:Good point.  I expect Tux3 will eventually have a reblocker (aka
:defragger).  There are some really nice things you can do, like:
:
:  1) Set a new version so logical changes cease for the parent
:     version.
:
:  2) We want to bud off a given directory tree into its own volume,
:     so start by deleting the subtree in the current version.  If
:     any link counts in the directory tree remain nonzero in the
:     current version then there are hard links into the subtree, so
:     fail now and drop the new version.
:
:  3) Reblock a given region of the inode table tree and all the files
:     in it into one physically contiguous region of blocks
:
:  4) Add a free tree and other once-per volume structures to the new
:     home region.
:
:  5) The new region is now entirely self contained and even keeps its
:     version history.  At the volume manager level, map it to a new,
:     sparse volume that just has a superblock in the zeroth extent and
:     the new, mini filesystem at some higher logical address.  Remap
:     the region in the original volume to empty space and add the
:     empty space to the free tree.
:
:  6) Optionally reblock the newly budded filesystem to the base of the
:     new volume so utilties that do not not play well with sparse
:     volumes do not do silly things.

    Yes, I see.  The budding operations is very interesting to me...
    well, the ability to fork the filesystem and effectively write to
    a snapshot.  I'd love to be able to do that, it is a far superior
    mechanism to taking a snapshot and then performing a rollback later.

    Hmm.  I could definitely manage more then one B-Tree if I wanted to.
    That might be the ticket for HAMMER... use the existing snapshot
    mechanic as backing store and a separate B-Tree to hold all changes
    made to the snapshot, then do a merged lookup between the new B-Tree
    and the old B-Tree.  That would indeed work.

:>     So I'm not actually being cavalier about it, its just that I had to
:>     make some major decisions on the algorithm design and I decided to
:>     weight the design more towards performance and large-storage, and
:>     small-storage suffered a bit.
:
:Cavalier was a poor choice of words, the post was full of typos as well
:so I am not proud of it.  You are solving a somewhat different problem
:and you have code out now, which is a huge achievement.  Still I think
:you can iteratively improve your design using some of the techniques I
:have stumbled upon.  There are probably some nice tricks I can get from
:your code base too once I delve into it.

    Meh, you should see my documents when I post them without taking 10
    editorial passes.  Characters reversed, type-o's, sentences which make
    no sense, etc.

:>     In anycase, it sounds like Tux3 is using many similar ideas.  I think
:>     you are on the right track.
:
:Thankyou very much.  I think you are on the right track too, which you
:have a rather concrete way of proving.
:
:....
:>     rather then have to go through another 4 months of rewriting.  All
:>     major designs are an exercise in making trade-offs in topology, feature
:>     set, algorithmic complexity, debuggability, robustness, etc.  The list
:>     goes on forever.
:> 
:
:The big ahas! that eliminated much of the complexity in the Tux3 design
:were:
:
:  * Forward logging - just say no to incomplete transactions
:  * Version weaving - just say no to recursive copy on write
:
:Essentially I have been designing Tux3 for ten years now and working
:seriously on the simplifying elements for the last three years or so,
:either entirely on paper or in related work like ddsnap and LVM3.
:
:One of the nice things about moving on from design to implementation of
:Tux3 is that I can now background the LVM3 design process and see what
:Tux3 really wants from it.  I am determined to match every checkbox
:volume management feature of ZFS as efficiently or more efficiently,
:without violating the traditional layering between filesystem and
:block device, and without making LVM3 a Tux3-private invention.
:
:>     Laters!

    I can tell you've been thinking about Tux for a long time.  If I
    had one worry about your proposed implementation it would be in the
    area of algorithmic complexity.  You have to deal with the in-memory 
    cache, the log, the B-Tree, plus secondary indexing for snapshotted
    elements and a ton of special cases all over the place.  Your general
    lookup code is going to be very, very complex.

    My original design for HAMMER was a lot more complex (if you can
    believe it!) then the end result.  A good chunk of what I had to do
    going from concept to reality was deflate a lot of that complexity.
    When it got down to brass tacks I couldn't stick with using the 
    delete_tid as a B-Tree search key field, I had to use create_tid.  I
    couldn't use my fine-grained storage model concept because the
    performance was terrible (too many random writes interfering with
    streaming I/O).  I couldn't use a hybrid B-Tree, where a B-Tree element
    could hold the base of an entirely new B-Tree (it complicated the pruning
    and reblocking code so much that I just gave up trying to do it in
    frustration).  I couldn't implement file extents other then 16K and 64K
    blocks (it really complicated historical lookups and the buffer cache
    couldn't handle it) <--- that one really annoyed me.  I had overly
    optimized the storage model to try to get block pointers down to 32 bits
    by localizing B-Tree elements in the same 'super clusters' as the data
    they referenced.  It was a disaster and I ripped it out.  The list goes
    on :-)

    I do wish we had something like LVM on BSD systems.  You guys are very
    lucky in that regard.  LVM is really nice.

    BTW it took all day to write this!

                                        -Matt
                                        Matthew Dillon 
                                        <dillon at backplane.com>

:Hopefully not too much later.  I find this dialog very fruitful.  I just
:wish such dialog would occur more often at the design/development stage
:in Linux and other open source work instead of each group obsessively
:ignoring all "competing" designs and putting huge energy into chatting
:away about the numerous bugs that arise from rushing their design or
:ignoring the teachings of history.
:
:Regards,
:
:Daniel

From phillips at phunq.net  Sun Jul 27 04:51:34 2008
From: phillips at phunq.net (Daniel Phillips)
Date: Sun, 27 Jul 2008 04:51:34 -0700
Subject: [Tux3] Comparison to Hammer fs design
In-Reply-To: <200807260202.m6Q222Ma018102@apollo.backplane.com>
References: <4889fdb4$0$849$415eb37d@crater_reader.dragonflybsd.org>
        <200807251513.59912.phillips@phunq.net>
        <200807260202.m6Q222Ma018102@apollo.backplane.com>
Message-ID: <200807270451.35458.phillips@phunq.net>

linSubscribed now, everything should be OK.

On Friday 25 July 2008 19:02, Matthew Dillon wrote:
> :Yes, that is the main difference indeed, essentially "log everything" vs
> :"commit" style versioning.  The main similarity is the lifespan oriented
> :version control at the btree leaves.
> 
>     Reading this and a little more that you describe later let me make
>     sure I understand the forward-logging methodology you are using.
>     You would have multiple individually-tracked transactions in
>     progress due to parallelism in operations initiated by userland and each
>     would be considered committed when the forward-log logs the completion
>     of that particular operation?

Yes.  Writes tend to be highly parallel in Linux because they are
mainly driven by the VMM attempting to clean cache dirtied by active
writers, who generally do not wait for syncing.  So this will work
really well for buffered IO, which is most of what goes on in Linux.
I have not thought much about how well this works for O_SYNC or
O_DIRECT from a single process.  I might have to do it slightly
differently to avoid performance artifacts there, for example, guess
where the next few direct writes are going to land based on where the
most recent ones did and commit a block that says "the next few commit
blocks will be found here, and here, and here...".

When a forward commit block is actually written it contains a sequence
number and a hash of its transaction in order to know whether the
commit block write ever completed.  This introduces a risk that data
overwritten by the commit block might contain the same hash and same
sequence number in the same position, causing corruption on replay.
The chance of this happening is inversely related to the size of the
hash times the chance of colliding with the same sequence number in
random data times the chance of of rebooting randomly.  So the risk can
be set arbitrarily small by selecting the size of the hash, and using
a good hash.  (Incidentally, TEA was not very good when I tested it in
the course of developing dx_hack_hash for HTree.)

Note: I am well aware that a debate will ensue about whether there is
any such thing as "acceptable risk" in relying on a hash to know if a
commit has completed.  This occurred in the case of Graydon Hoare's
Monotone version control system and continues to this day, but the fact
is, the cool modern version control systems such as Git and Mercurial
now rely very successfully on such hashes.  Nonetheless, the debate
will keep going, possibly as FUD from parties who just plain want to
use some other filesystem for their own reasons.  To quell that
definitively I need a mount option that avoids all such commit risk,
perhaps by providing modest sized journal areas salted throughout the
volume whose sole purpose is to record log commit blocks, which then
are not forward.  Only slightly less efficient than forward logging
and better than journalling, which has to seek far away to the journal
and has to provide journal space for the biggest possible journal
transaction as opposed to the most commit blocks needed for the largest
possible VFS transaction (probably one).

>     If the forward log entries are not (all) cached in-memory that would mean
>     that accesses to the filesystem would have to be run against the log
>     first (scanning backwards), and then through to the B-Tree?  You
>     would solve the need for having an atomic commit ('flush groups' in
>     HAMMER), but it sounds like the algorithmic complexity would be
>     very high for accessing the log.

Actually, the btree node images are kept fully up to date in the page
cache which is the only way the high level filesystem code accesses
them.  They do not reflect exactly what is on the disk, but they do
reflect exactly what would be on disk if all the logs were fully
rolled up ("replay").

The only operations on forward logs are:

? 1) Write them
? 2) Roll them up into the target objects
  3) Wait on rollup completion

The rollup operation is also used for replay after a crash.

A forward log that carries the edits to some dirty cache block pins
that dirty block in memory and must be rolled up into a physical log
before the cache block can be flushed to disk.  Fortunately, such a
rollup requires only a predictable amount of memory: space to load
enough of the free tree to allocate space for the rollup log, enough
available cache blocks to probe the btrees involved, and a few
cache blocks to set up the physical log transaction.  It is the
responsibility of the transaction manager to ensure that sufficient
memory to complete the transaction is available before initiating it,
otherwise deadlock may occur in the block writeout path.  (This
requirement is the same as for any other transaction scheme).

One traditional nasty case that becomes really nice with logical
forward logging is truncate of a gigantic file.  We just need to
commit a logical update like ['resize', inum, 0] then the inode data
truncate can proceed as convenient.  Another is orphan inode handling
where an open file has been completely unlinked, in which case we
log the logical change ['free', inum] then proceed with the actual
delete when the file is closed or when the log is replayed after a
surprise reboot.

Logical log replay is not idempotent, so special care has to be taken
on replay to ensure that specified changes have not already been
applied to the target object.  I choose not to go the traditional route
of providing special case tests for "already applied" which get really
ugly or unreliable when there are lot of stacked changes.  Instead I
introduce the rule that a logical change can only be applied to a known
good version of the target object, which promise is fullfilled via the
physical logging layer.

For example, on replay, if we have some btree index on disk for which
some logical changes are outstanding then first we find the most recent
physically logged version of the index block and read it into cache,
then apply the logical changes to it there.  Where interdependencies
exist between updates, for example the free tree should be updated to
reflect a block freed by merging two btree nodes, the entire collection
of logical and physical changes has to be replayed in topologically
sorted order, the details of which I have not thought much about other
than to notice it is always possible.

When replay is completed, we have a number of dirty cache blocks which
are identical to the unflushed cache blocks at the time of a crash,
and we have not yet flushed any of those to disk.  (I suppose this gets
interesting and probably does need some paranoid flushing logic in
replay to handle the bizarre case where a user replays on a smaller
memory configuration than they crashed on.)  The thing is, replay
returns the filesystem to the logical state it was in when the crash
happened.  This is a detail that journalling filesystem authors tend
to overlook: actually flushing out the result of the replay is
pointless and only obscures the essential logic.  Think about a crash
during replay, what good has the flush done?

>     And even though you wouldn't have to group transactions into larger
>     commits the crash recovery code would still have to implement those
>     algorithms to resolve directory and file visibility issues.  The problem
>     with namespace visibility is that it is possible to create a virtually
>     unending chain of separate but inter-dependant transactions which either
>     all must go, or none of them.  e.g. creating a, a/b, a/b/c, a/b/x, a/b/c/d,
>     etc etc.

I do not see why this example cannot be logically logged in pieces:

   ['new', inum_a, mode etc] ['link', inum_parent, inum_a, "a"]
   ['new', inum_b, mode etc] ['link' inum_a, inum_b, "b"]
   ['new', inum_c, mode etc] ['link' inum_a_b, inum_c, "c"]
   ['new', inum_x, mode etc] ['link' inum_a_b, inum_x, "x"]
   ['new', inum_d, mode etc] ['link' inum_a_b_c, inum_d, "d"]

Logical updates on one line are in the same logical commit.  Logical
allocations of blocks to record the possibly split btree leaves and new
allocations omitted for clarity.  The omitted logical updates are
bounded by the depth of the btrees.  To keep things simple, the logical
log format should be such that it is impossible to overflow one commit
block with the updates required to represent a single vfs level
transaction.

I suspect there may be some terminology skew re the term "transaction".
Tux3 understands this as "VFS transaction", which does not include
trying to make an entire write (2) call atomic for example, but only
such things as allocate+write for a single page cache page or
allocate+link for a sys_link call.  Fsync(2) is not a transaction but
a barrier that Tux3 is free to realize via multiple VFS transactions,
which the Linux VFS now takes care of pretty well now after many years
of patching up the logic.

>     At some point you have to be able to commit so the whole mess 
>     does not get undone by a crash, and many completed mini-transactions
>     (file or directory creates) actually cannot be considered complete until
>     their governing parent directories (when creating) or children (when
>     deleting) have been committed.  The problem can become very complex.

Indeed.  I think I have identified a number of techniques for stripping
away much of that complexity.  For example, the governing parent update
can be considered complete as soon as the logical 'link' log commit has
completed.

>     Last question here:  Your are forward-logging high level operations.
>     You are also going to have to log meta-data (actual B-Tree manipulation)
>     commits in order to recover from a crash while making B-Tree
>     modifications.  Correct?

Correct!  That is why there are two levels of logging: logical and
physical.  The physical logging level takes care of updating the cache
images of disk blocks to match what the logical logging level expects
before it can apply its changes.  These two logging levels are
interleaved: where a logical change requires splitting a btree block,
the resulting blocks are logged logically, but linked into the parent
btree block using a logical update stored in the commit block of the
physical log transaction.  How cool is that?  The physical log
transactions are not just throwaway things, they are the actual new
data.  Only the commit block is discarded, which I suppose will leave
a lot of one block holes around the volume, but then I do not have to
require that the commit block be immediately adjacent to the body of
the transaction, which will allow me to get good value out of such
holes.  On modern rotating media, strictly linear transfers are not
that much more efficient than several discontiguous transfers that all
land relatively close to each other.

>     So your crash recovery code will have to handle 
>     both meta-data undo and completed and partially completed transactions.

Tux3 does not use undo logging, only redo, so a transaction is complete
as soon as there is enough information on durable media to replay the
redo records.

>     And there will have to be a tie-in between the meta-data commits and
>     the transactions so you know which ones have to be replayed.  That
>     sounds fairly hairy.  Have you figured out how you are doing to do that?

Yes.  Each new commit records the sequence number of the oldest commit
that should be replayed.  So the train lays down track in front of
itself and pulls it up again when the caboose passes.  For now, there
is just one linear sequence of commits, though I could see elaborating
that to support efficient clustering.

One messy detail: each forward log transaction is written into free
space wherever physically convenient, but we need to be sure that that
free space is not allocated for data until log rollup has proceeded
past that transaction.  One way to do this is to make a special check
against the list of log transactions in flight at the point where
extent allocation thinks it has discovered a suitable free block, which
is the way ddsnap currently implements the idea.  I am not sure whether
I am going to stick with that method for Tux3 or just update the disk
image of the free tree to include the log transaction blocks and
somehow avoid logging those particular free tree changes to disk.  Hmm,
a choice of two ugly but workable methods, but thankfully neither
affects the disk image.

> :Once I get down to the leaf level I binary search on logical address in
> :the case of a file index btree or on version in the case of an inode
> :table block, so this cost is still Log(N) with a small k.  For a
> :heavily versioned inode or file region this might sometimes result in
> :an overflow block or two that has to be linearly searched which is not
> :a big problem so long as it is a rare case, which it really ought to be
> :for the kinds of filesystem loads I have seen.  A common example of a
> :worst case is /var/log/messages, where the mtime and size are going to
> :change in pretty much every version, so if you have hourly snapshots
> :and hold them for three months it adds up to about 2200 16 byte inode
> :table attributes to record, about 8 inode table leaf blocks.  I really
> :do not see that as a problem.  If it goes up to 100 leaf blocks to
> :search then that could be a problem.
>     
>     I think you can get away with this as long as you don't have too many
>     snapshots, and even if you do I noticed with HAMMER that only a small
>     percentage of inodes have a large number of versions associated with
>     them from normal production operation.

Yes, that is my expectation.  I think everything will perform fine up
to a few hundred snapshots without special optimization, which is way
beyond current expectations.  Most users only recently upgraded to
journalling filesystems let alone having any exposure to snapshots
at all.

>     /var/log/messages 
>     is an excellent example of that.  Log files were effected the most though
>     I also noticed that very large files also wind up with multiple versions
>     of the inode, such as when writing out a terrabyte-sized file. 

Right, when writing the file takes longer than the snapshot interval.

>     Even with a direct bypass for data blocks (but not their meta-data,
>     clearly), HAMMER could only cache so much meta-data in memory before
>     it had to finalize the topology and flush the inode out.  A
>     terrabyte-sized file wound up with about 1000 copies of the inode
>     prior to pruning (one had to be written out about every gigabyte or so).

For Tux3 with an hourly snapshot schedule that will only be four or
five versions of the [mtime, size] attribute to reflect the 4.7 hours
write time or so, and just one version of each block pointer.

> :The penultimate inode table index block tells me how many blocks a
> :given inode lives in because several blocks will have the same inum
> :key.  So the lookup algorithm for a massively versioned file becomes:
> :1) read the first inode table block holding that inum; 2) read the last
> :block with the same inum.  The latter operation only needs to consult
> :the immediate parent index block, which is locked in the page cache at
> :that point.
> 
>     How are you dealing with expansion of the logical inode block(s) as
>     new versions are added?  I'm assuming you are intending to pack the
>     inodes on the media so e.g. a 128-byte inode would only take up
>     128 bytes of media space in the best case.  Multiple inodes would be
>     laid out next to each other logically (I assume), but since the physical
>     blocks are larger they would also have to be laid out next to each
>     other physically within any given backing block.  Now what happens
>     when one has to be expanded?

The inode table block is split at a boundary between inodes.

An "inode" is broken up into attribute groups arranged so that it makes
sense to update or version all the members of an attribute group
together. ?The "standard" attribute group consists of mode, uid, gid,
ctime and version, which adds up to 16 bytes. ?The smallest empty inode
(touch foo) is 40 bytes and a file with "foo" in it as immediate data
is 64 bytes. ?This has a standard attribute, a link count attribute, a
size/mtime attribute, and an immediate data attribute, all versioned.
An inode with heavily versioned attributes might overflow into the next
btree leaf as described elsewhere.

Free inodes lying between active inodes in the same leaf use two bytes
each, a consequence of the inode leaf directory, which has a table
at the top of the leaf of two byte pointers giving the offset of each
inode in the leaf, stored backwards and growing down towards the inode
attributes.  (This is a slight evolution of the ddsnap leaf format.)

>     I'm sure this ties into the forward-log but even with the best algorithms
>     you are going to hit limited cases where you have to expand the inode.
>     Are you just copying the inode block(s) into a new physical allocation
>     then?

Split the inode in memory, allocating a new buffer; forward log the two
new pieces physically out to disk with a logical record in the commit
block recording where the two new pointers are to be inserted without
actually inserting them until a logical rollup episode is triggered.

> :>     I couldn't get away from having a delete_tid (the 'death version
> :>     numbers').  I really tried :-)  There are many cases where one is only
> :>     deleting, rather then overwriting.
> :
> :By far the most common case I would think.  But check out the versioned
> :pointer algorithms.  Surprisingly that just works, which is not exactly
> :obvious:
> :
> :   Versioned pointers: a new method of representing snapshots
> :   http://lwn.net/Articles/288896/
> 
>     Yes, it makes sense.  If the snapshot is explicitly taken then you
>     can store direct references and chain, and you wouldn't need a delete
>     id in that case.  From that article though the chain looks fairly
>     linear.  Historical access could wind up being rather costly.

I can think of three common cases of files that get a lot of historical
modifications:

  * Append log
      - The first iteration in each new version generates a new size
        attribute

  * Truncate/rewrite
     - The first iteration in each new version generates a new size
       attribute and either a new file index root or a new immediate
       data attribute

  * Database
     - The first file change in each new version generates a new
       versioned pointer at the corresponding logical address in the
       file btree and a new size attribute (just because mtime is
       bundled together with the size in one attribute group).

So the only real proliferation is the size/mtime attributes, which gets
back to what I was thinking about providing quick access for the
"current" version (whatever that means).

> :I was originally planning to keep all versions of a truncate/rewrite
> :file in the same file index, but recently I realized that that is dumb
> :because there will never be any file data shared by a successor version
> :in that case.  So the thing to do is just create an entirely new
> :versioned file data attribute for each rewrite, bulking up the inode
> :table entry a little but greatly constraining the search for versions
> :to delete and reducing cache pressure by not loading unrelated version
> :data when traversing a file.
> 
>     When truncating to 0 I would agree with your assessment.  For the 
>     topology you are using you would definitely want to use different
>     file data sets.  You also have to deal with truncations that are not
>     to 0, that might be to the middle of a file.  Certainly not a
>     common case, but still a case that has to be coded for.  If you treat
>     truncation to 0 as a special case you will be adding considerable
>     complexity to the algorithm.

Yes, the proposal is to treat truncation to exactly zero as a special
case.  It is a tiny amount of extra code for what is arguably the most
common file rewrite case.

>     With HAMMER I chose to keep everything in one B-Tree, whether historical
>     or current.  That way one set of algorithms handles both cases and code
>     complexity is greatly reduced.  It isn't optimal... large amounts of
>     built up history still slow things down (though in a bounded fashion).
>     In that regard creating a separate topology for snapshots is a good
>     idea.

Essentially I chose the same strategy, except that I have the file
trees descending from the inode table instead of stretching out to the
side.  I think this gives a more compact tree overall, and since I am
using just one generic set of btree operations to handle these two
variant btrees, additional code complexity is minimal.

> :>     Both numbers are then needed to 
> :>     be able to properly present a historical view of the filesystem.
> :>     For example, when one is deleting a directory entry a snapshot after
> :>     that deletion must show the directory entry gone.
> :
> :...which happens in Tux3 courtesy of the fact that the entire block
> :containing the dirent will have been versioned, with the new version
> :showing the entry gone.  Here is one of two places where I violate my
> :vow to avoid copying an entire block when only one data item in it
> :changes (the other being the atime table).  I rely on two things to
> :make this nice: 1) Most dirent changes will be logically logged and
> :only rolled up into the versioned file blocks when there are enough to
> :be reasonably sure that each changed directory block will be hit
> :numerous times in each rollup episode.  (Storing the directory blocks
> :in dirent-create order as in PHTree makes this very likely for mass
> :deletes.)  2) When we care about this is usually during a mass delete,
> :where most or all dirents in each directory file block are removed
> :before moving on to the next block.
> 
>     This could wind up being a sticky issue for your implementation.
>     I like the concept of using the forward-log but if you actually have
>     to do a version copy of the directory at all you will have to update the
>     link (or some sort of) count for all the related inodes to keep track
>     of inode visibility, and to determine when the inode can be freed and
>     its storage space recovered.

There is a versioned link count attribute in the inode.  When a link
attribute goes to zero for a particular version it is removed from the
inode.  When there are no more link attributes left, that means no
directory block references the inode for any version and the inode may
be reused.

Note: one slightly bizarre property of this scheme is that an inode
can be reused in any version in which its link count is zero, and the
data blocks referenced by different versions in the same file index
can be completely unrelated.  I doubt there is a use for that.

>     Directories in HAMMER are just B-Tree elements.  One element per
>     directory-entry.  There are no directory blocks.   You may want to
>     consider using a similar mechanism.  For one thing, it makes lookups
>     utterly trivial... the file name is hashed and a lookup is performed
>     based on the hash key, then B-Tree elements with the same hash key
>     are iterated until a match is found (usually the first element is the
>     match).  Also, when adding or removing directory entries only the
>     directory inode's mtime field needs to be updated.  It's size does not.

Ext2 dirents are 8 bytes + name + round up to 4 bytes, very tough to
beat that compactness.  We have learned through bitter experience that
anything other than an Ext2/UFS style physically stable block of
dirents makes it difficult to support NFS telldir cookies accurately
because NFS vs gives us only a 31 bit cookie to work with, and that is
not enough to store a cursor for, say, a hash order directory
traversal.  This is the main reason that I have decided to go back to
basics for the Tux3 directory format, PHTree, and make it physically
stable.

In the PHTree directory format lookups are also trivial: the directory
btree is keyed by a hash of the name, then each dirent block (typically
one) that has a name with that hash is searched linearly.  Dirent block
pointer/hash pairs are at the btree leaves.  A one million entry
directory has about 5,000 dirent blocks referenced by about 1000 btree
leaf blocks, in turn referenced by three btree index blocks (branching
factor of 511 and 75% fullness).  These blocks all tend to end up in
the page cache for the directory file, so searching seldom references
the file index btree.

I did some back of the envelope calculations of the number of cache
lines that have to be hit for a lookup by Hammer with its fat btree
keys and lower branching factor, vs Tux3 with its high branching
factor, small keys, small pointers, extra btree level for dirent
offset stability and stupidly wasteful linear dirent search.  I will
not bore the list with the details, but it is obvious that Tux3/PHTree
will pass Hammer in lookup speed for some size of directory because of
the higher btree fanout.  PHTree starts from way behind courtesy of the
32 cache lines that have to be hit on average for the linear search,
amounting to more than half the CPU cost of performing a lookup in a
million element directory, so the crossover point is somewhere up in
the millions of entries.

Thus annoyed, I cast about for a better dirent leaf format than the
traditional Ext2 directory block, and found one after not too much
head scratching.  I reuse the same leaf directory format as for inode
leaf blocks, but this time the internal offsets are sorted by lexical
name order instead of inode number order.  The dirents become Ext2
dirents minus the record number format, and minus the padding to
four byte alignment, which does not do anything useful.  Dirent inum
increases to 6 bytes, balanced by saving 1.5 pad bytes on average,
so the resulting structure stores about 2% fewer dirents than the
Ext2 format against a probable 50% reduction in CPU latency per lookup.
A fine trade indeed.

The new format is physically stable and suitable for binary searching.
When we need to manage space within the leaf for creating or deleting
an entry, a version of the leaf directory ordered by offset can be
created rapidly from the lexically sorted directory, which occupies at
most about six cache lines.  The resulting small structure can be
cached to take advantage of the fact that most mass creates and deletes
keep hitting the same dirent block repeatedly, because Tux3 hands the
dirents to getdents in physical dirent order.

I concluded that both Hammer and PHTree are exceptionally fast at name
resolution.  When I have the new dirent block format in place I expect
to really hammer Hammer.  But then I do not expect Hammer to stand
still either ;-)

>     Your current directory block method could also represent a problem for
>     your directory inodes... adding and removing directory entries causing
>     size expansion or contraction could require rolling new versions
>     of the directory inode to update the size field.  You can't hold too
>     much in the forward-log without some seriously indexing.  Another
>     case to consider along with terrabyte-sized files and log files.

A PHTree directory grows like an append-only file, a block at a time,
though every time a new entry is added, mtime has to change, so mtime
changes more often than size.  They are grouped together in the same
attribute, so the distinction is moot.  If the directory is modified at
least once after every snapshot (quite likely) then the snapshot
retention policy governs how many mtime/size attributes are stored in
the inode.

Say the limit is 1,024 snapshots, then 16K worth of those attributes
have to be stored, which once again encourages me to store the "current"
mtime specially where it can be retrieved quickly.  It would also be a
good idea to store the file data attribute in front of the mtime/size
attributes so only stat has to worry about the bulky version info, not
lookup.

Note that there is no such thing as revving an entire inode, only bits
and pieces of it.

For truly massive number of versions, the plan is to split up the
version tree into subtrees, each having a thousand versions or so.  For
each version subtree (hmm, subversion tree...) there is a separate
inode btree carrying only attributes and file data owned by members of
that subtree.  At most log(subtrees) of those tables need to be
accessed by any versioned entity algorithm.  (Note the terminology
shift from versioned pointers to versioned entities to reflect the fact
that the same algorithms work for both pointers and attributes.)  I
think this approach scales roughly to infinity, or at least to the
point where log(versions) goes vertical which is essentially never.  It
requires a bounded amount of implementation effort, which will probably
be deferred for months or years.

Incidentally, PHTree directories are pretty much expand-only, which
is required in order to maintain physical dirent stability.  Compaction
is not hard, but it has to be an admin-triggered operation so it will
not collide with traversing the directory.

> :>     Both numbers are 
> :>     also needed to be able to properly prune out unwanted historical data
> :>     from the filesystem.  HAMMER's pruning algorithm (cleaning out old  
> :>     historical data which is no longer desired) creates holes in the
> :>     sequence so once you start pruning out unwanted historical data
> :>     the delete_tid of a prior record will not match the create_tid of the
> :>     following one (historically speaking).
> :
> :Again, check out the versioned pointer algorithms.  You can tell what
> :can be pruned just by consulting the version tree and the create_tids
> :(birth versions) for a particular logical address.  Maybe the hang is
> :that you do not organize the btrees by logical address (or inum in the
> :case of the inode table tree).  I thought you did but have not read
> :closely enough to be sure.
> 
>     Yes, I see.  Can you explain the versioned pointer algorithm a bit more,
>     it looks almost like a linear chain (say when someone is doing a daily
>     snapshot).  It looks great for optimal access to the HEAD but it doesn't
>     look very optimal if you want to dive into an old snapshot. 

It is fine for diving into old snapshots.  Lookup is always O(elements)
where an element is a versioned pointer or (later) extent, which are
very compact at 8 bytes each.  Because the truncate/rewrite file update
case is so common, you will rarely see more than one version at any
given logical address in a file.  A directory growing very slowly could
end up with about 200 different versions of each of its final few
blocks, one for each added entry.  It takes on the order of a
microsecond to scan through a few hundred versioned pointers to find
the one that points at the version in which we are interested, and that
only happens the first time the directory block is read into the page
cache.  After that, each reference to the block costs a couple of
lookups in a page cache radix tree.

The version lookup algorithm is not completely obvious: we scan through
the list of pointers looking for the version label that is nearest the
version being accessed on the path from that version to the root of the
version tree.  This potentially expensive search is accelerated using a
bitmap table to know when a version is "on the path" and a pre-computed
ord value for each version, to know which of the versions present and
"on the path" for a given logical address is furthest from the root
version.  This notion of furthest from the root on the path from a given
version to the root implements data inheritance, which is what yields
the compact representation of versioned data, and is also what
eliminates the need to explicitly represent the death version.  You have
discovered this too, in a simpler form.  I believe that once you get the
aha you will want to add versions of versions to your model.

>     For informational purposes: HAMMER has one B-Tree, organized using
>     a strict key comparison.  The key is made up of several fields which
>     are compared in priority order:
> 
>       localization    - used to localize certain data types together and
>                         to separate pseudo filesystems created within
>                         the filesystem.
>       obj_id          - the object id the record is associated with.
>                         Basically the inode number the record is 
>                         associated with.
> 
>       rec_type        - the type of record, e.g. INODE, DATA, SYMLINK-DATA,
>                         DIRECTORY-ENTRY, etc.
> 
>       key             - e.g. file offset
> 
>       create_tid      - the creation transaction id
> 
>     Inodes are grouped together using the localization field so if you
>     have a million inodes and are just stat()ing files, the stat
>     information is localized relative to other inodes and doesn't have to
>     skip file contents or data, resulting in highly localized accesses
>     on the storage media.
> 
>     Beyond that the B-Tree is organized by inode number and file offset.
>     In the case of a directory inode, the 'offset' is the hash key, so
>     directory entries are organized by hash key (part of the hash key is
>     an iterator to deal with namespace collisions).
> 
>     The structure seems to work well for both large and small files, for
>     ls -lR (stat()ing everything in sight) style traversals as well as 
>     tar-like traversals where the file contents for each file is read.

That is really sweet, provided that you are ok with the big keys.  It
was smart to go with that regular structure, giving you tons of control
over everything and get the system up and running early, thus beating the
competition.  I think you can iterate from there to compress things and
be even faster.  Though if I were to presume to choose your priorities
for you, it would be to make the reblocking optional.

>     The create_tid is the creation transaction id.  Historical accesses are
>     always 'ASOF' a particular TID, and will access the highest create_tid
>     that is still <= the ASOF TID.  The 'current' version of the filesystem
>     uses an asof TID of 0xFFFFFFFFFFFFFFFF and hence accesses the highest
>     create_tid.

So you have to search through a range of TIDs to find the operative one
for a particular version, just as I have to do with versioned pointers.
Now just throw in a bitmap like I use to take the version inheritance
into account and you have versioned pointers, which means snapshots of
snapshots and other nice things.  You're welcome :-)

I think that versioned pointer lookup can be implemented in O(log(E))
where E is the number of versioned pointers (entities) at the same
logical address, which would put it on a par with btree indexing, but
more compact.  I have sketched out an algorithm for this but I have not
yet tried to implement it.  

>     There is also a delete_tid which is used to filter out elements that
>     were deleted prior to the ASOF TID.  There is currently an issue with
>     HAMMER where the fact that the delete_tid is *not* part of the B-Tree
>     compare can lead to iterations for strictly deleted elements, verses
>     replaced elements which will have a new element with a higher create_tid
>     that the B-Tree can seek to directly.  In fact, at one point I tried
>     indexing on delete_tid instead of create_tid, but created one hell of a
>     mess in that I had to physically *move* B-Tree elements being deleted
>     instead of simply marking them as deleted.

This is where I do not quite follow you.  File contents are never
really deleted in subsequent versions, they are just replaced.  To
truncate a file, just add or update size attribute for the target
version and recover any data blocks past the truncate point that
belongs only to the target version.

Extended attributes and dirents can be truly deleted.  To delete an
extended attribute that is shared by some other version I will insert
a new versioned attribute that says "this attribute is not here" for
the current version, which will be inherited by its child versions.

Dirent deletion is handled by updating the dirent block, possibly
versioning the entire block.  Should the entire block ever be deleted
by a directory repacking operation (never actually happens on the vast
majority of Linux systems) that will be handled as a file truncate.

I see that in Hammer, the name is deleted from the btree instead, so
fair enough, that is actual deletion of an object.  But it could be
handled alternatively by adding a "this object is not here" object,
which might be enough to get rid of delete_tid entirely and just have
create_tid.

> :Fair enough.  I have an entirely different approach to what you call
> :mirroring and what I call delta replication.  (I reserve the term
> :mirroring to mean mindless duplication of physical media writes.)  This
> :method proved out well in ddsnap:
> :
> :   http://phunq.net/ddtree?p=zumastor/.git;a=tree;h=fc5cb496fff10a2b03034fcf95122f5828149257;hb=fc5cb496fff10a2b03034fcf95122f5828149257
> :
> :(Sorry about the massive URL, you can blame Linus for that;-)
> :
> :What I do in ddsnap is compute all the blocks that differ between two
> :versions and apply those to a remote volume already holding the first
> :of the two versions, yielding a replica of the second version that is
> :logically but not physically identical.  The same idea works for a
> :versioned filesystem: compute all the leaf data that differs between
> :two versions, per inum, and apply the resulting delta to the
> :corresponding inums in the remote replica.  The main difference vs a
> :ddsnap volume delta is that not all of the changes are physical blocks,
> :there are also changed inode attributes, so the delta stream format
> :has to be elaborated accordingly.
> 
>     Are you scanning the entire B-Tree to locate the differences?  It
>     sounds you would have to as a fall-back, but that you could use the
>     forward-log to quickly locate differences if the first version is
>     fairly recent.

I plan to scan the whole btree initially, which is what we do in ddsnap
and it works fine.  But by looking at the mtime attributes in the inode
I can see in which versions the file data was changed and thus not have
to scan most files.  Eventually, building in some accelerator to be able
to skip most inode leaves as well would be a nice thing to do.  I will
think about that in the background.

>     HAMMER's mirroring basically works like this:  The master has synchronized
>     up to transaction id C, the slave has synchronized up to transaction id A.
>     The mirroring code does an optimized scan of the B-Tree to supply all
>     B-Tree elements that have been modified between transaction A and C.
>     Any number of mirroring targets can be synchronized to varying degrees,
>     and can be out of date by varying amounts (even years, frankly).
> 
>     I decided to use the B-Tree to optimize the scan.  The B-Tree is
>     scanned and any element with either a creation or deletion transaction
>     id >= the slave's last synchronization point is then serialized and
>     piped to the slave.

That is nearly identical to the ddsnap replication algorithm, and we
also send the list over a pipe.  Ddsnap does not deal with attributes,
only volume blocks, otherwise I think we arrived at the same thing.

>     To avoid having to scan the entire B-Tree I perform an optimization
>     whereby the highest transaction id laid down at a leaf is propagated
>     up the B-Tree all the way to the root.  This also occurs if a B-Tree
>     node is physically deleted (due to pruning), even if no elements are
>     actually present at the leaf within the transaction range.
>     Thus the mirroring scan is able to skip any internal node (and its
>     entire sub-tree) which has not been modified after the synchronization
>     point, and is able to identify any leaf for which real, physical deletions
>     have occured (in addition to logical deletions which simply set the
>     delete_tid field in the B-Tree element) and pass along the key range
>     and any remaining elements in that leaf for the target to do a
>     comparative scan with.

Hmm.  I have quite a few bits available in the inode table btree node
pointers, which are currently only going to be free inode density
hints.  Something to think about.

>     This allows incremental mirroring, multiple slaves, and also allows
>     a mirror to go offline for months and then pop back online again
>     and optimally pick up where it left off.  The incremental mirroring
>     is important, the last thing I want to do is have to scan 2 billion
>     B-Tree elements to do an incremental mirroring batch.

Very, very nice.

>     The advantage is all the cool features I got by doing things that way.
>     The disadvantage is that the highest transaction id must be propagated
>     up the tree (though it isn't quite that bad because in HAMMER an entire
>     flush group uses the same transaction id, so we aren't constantly
>     repropagating new transaction id's up the same B-Tree nodes when
>     flushing a particular flush group).

This is a job for forward logging :-)

>     You may want to consider something similar.  I think using the
>     forward-log to optimize incremental mirroring operations is also fine
>     as long as you are willing to take the 'hit' of having to scan (though
>     not have to transfer) the entire B-Tree if the mirror is too far
>     out of date.

There is a slight skew in your perception of the function of forward
logging here, I think.  Forward logs transactions are not long-lived
disk objects, they just serve to batch together lots of little updates
into a few full block "physical" updates that can be written to the
disk in seek-optimized order.  It still works well for this application.
In short, we propagate dirty bits up the btree while updating the btree
disk blocks only rarely.  Instead, the node dirty bits are tucked into
the commit block of the write transaction or namespace edit that caused
the change, and are in turn propagated into the commit block of an
upcoming physical rollup, eventually working their way up the on-disk
btree image.  But they are present in the cached btree image as soon as
they are set, which is the structure consulted by the replication
process.

> :I intend to log insertions and deletions logically, which keeps each
> :down to a few bytes until a btree rollup episode comes along to perform
> :updating of the btree nodes in bulk.  I am pretty sure this will work
> :for you as well, and you might want to check out that forward logging
> :trick.
> 
>     Yes.  The reason I don't is because while it is really easy to lay down
>     a forward-log, intergrating it into lookup operations (if you don't
>     keep all the references cached in-memory) is really complex code.
>     I mean, you can always scan it backwards linearly and that certainly
>     is easy to do, but the filesystem's performance would be terrible.
>     So you have to index the log somehow to allow lookups on it in
>     reverse to occur reasonably optimally.  Have you figured out how you
>     are going to do that?

I hope that question is cleared up now.

>     With HAMMER I have an in-memory cache and the on-disk B-Tree.  Just
>     writing the merged lookup code (merging the in-memory cache with the
>     on-disk B-Tree for the purposes of doing a lookup) was fairly complex.
>     I would hate to have to do a three-way merge.... in-memory cache, on-disk
>     log, AND on-disk B-Tree.  Yowzer!

Tux3 also has an in-memory cache and an on-disk btree, and in addition,
the in-memory cache is identical to some future version of the on-disk
btree, which ought to be good for losing a lot of corner cases.

> :That reminds me, I was concerned about the idea of UNDO records vs
> :REDO.  I hope I have this right: you delay acknowledging any write
> :transaction to the submitter until log commit has progressed beyond the
> :associated UNDO records.  Otherwise, if you acknowledge, crash, and
> :prune all UNDO changes, then you would end up with applications
> :believing that they had got things onto stable storage and be wrong
> :about that.  I have no doubt you did the right thing there, but it is
> :not obvious from your design documentation.
> 
>     The way it works is that HAMMER's frontend is almost entirely
>     disconnected from the backend.  All meta-data operations are cached
>     in-memory.  create, delete, append, truncate, rename, write...
>     you name it.  Nothing the frontend does modifies any meta-data
>     or mata-data buffers.
> 
>     The only sychronization point is fsync(), the filesystem syncer, and
>     of course if too much in-memory cache is built up.  To improve
>     performance, raw data blocks are not included... space for raw data
>     writes is reserved by the frontend (without modifying the storage
>     allocation layer) and those data buffers are written to disk by the
>     frontend directly, just without any meta-data on-disk to reference
>     them so who cares if you crash then!  It would be as if those blocks
>     were never allocated in the first place.

Things are a little different in Linux.  The VFS takes care of most of
what you describe as your filesystem front end, and in fact, the VFS is
capable of running as a filesystem entirely on its own, just by
supplying a few stub methods to bypass the backing store (see ramfs).
I think this is pretty much what you call your front end, though I
probably missed some major functionality you provide there that the
VFS does not directly provide.

A filesystem backend on Linux would be the thing that implements the
prepare_write/commit_write calls that come in from the one-size-fits all
generic_file_buffered_write function that most filesystems use to
implement write(2).  These functions generally just call back into the
block library, passing a filesystem-specific get_block callback which
the library uses to assign physical disk sector addresses to buffers
attached to the page.  Ext3 opens a journal transaction in prepare_write
and finishes it in commit_write.  Not much of a transaction if you ask
me.  Tux3 wants to do things in a more direct way and in bigger chunks.

Actually, all Tux3 needs to do with the prepare_write/commit_write
calls that generic_file_buffered_write throws at it is remember which
inodes where involved, because the inode keeps a list of dirty pages,
the same ones that prepare_write/commit_write is trying to tell us
about.  When there are enough of them, on one of the commit_write calls
Tux3 can go delving into its file index btree to find or create a place
to store some.  This will generate changes to the btree that must be
logged logically in the commit block of the write transaction that is
being set up, which now gets its sequence number so that the logical
changes can be applied in the same order if replay is needed.

At the point Tux3 is asked to generate some new logical changes, it may
decide that it has already sent enough logical changes onto disk and
now would be a good time to roll some of them up.  The rolled up block
images are sitting in the buffer cache where Tux3 has prudently locked
them by incrementing the page use count.  Tux3 now sets up some physical
log transactions to write out the images.  Before doing so, it copies
each image to a new location in the buffer cache and modifies pointers
in the parent images to point at the new images (possibly having to read
in the parents first).  This generates new logical changes (the changes
to the parent nodes and changes to the free tree to allocate the new
buffer cache positions) which will be logically logged in the commit
blocks of the physical transaction about to be committed.

When the rollup transactions have been sent to disk, Tux3 can continue
allocating disk space etc by updating the new images of the disk blocks
and setting up new transactions, but no transactions that depend on the
rolled up state of the disk blocks can be allowed to complete until the
rollup transactions have completed, which Tux3 learns of by counting the
interrupt mode ->endio calls that come back from the bio transfers.

Note: no other filesystem on Linux current works this way.  They all
pretty much rely on the block IO library which implements the fairly
goofy one-block-at-a-time transfer regimen.  The idea of setting up bio
transfers directly from the filesystem is unfortunately something new.
We should have been doing this for a long time, because the interface
is quite elegant, and actually, that is just what the block IO library
is doing many levels down below.  It is time to strip away some levels
and just do what needs to be done in the way it ought to be done.

Notice how recursive the whole idea is.  Logical transactions cause
physical transactions which cause logical transactions etc.  I am
taking it on faith that the recursion terminates properly and does not
get into cycles.  Over time I will give concrete reasons why I think it
all just works.

Digression done, back to the Hammer algorithms...

>     When the backend decides to flush the cached meta-data ops it breaks
>     the meta-data ops into flush-groups whos dirty meta-data fits in the 
>     system's buffer cache.  The meta-data ops are executed, building the
>     UNDO, updating the B-Tree, allocating or finalizing the storage, and
>     modifying the meta-data buffers in the buffer cache.  BUT the dirty
>     meta-data buffers are locked into memory and NOT yet flushed to
>     the media.  The UNDO *is* flushed to the media, so the flush groups
>     can build a lot of UNDO up and flush it as they go if necessary.

Hmm, and I swear I did not write the above before reading this paragraph.
Many identical ideas there.

>     When the flush group has completed any remaining UNDO is flushed to the
>     media, we wait for I/O to complete, the volume header's UNDO FIFO index
>     is updated and written out, we wait for THAT I/O to complete, *then*
>     the dirty meta-data buffers are flushed to the media.  The flushes at
>     the end are done asynchronously (unless fsync()ing) and can overlap with
>     the flushes done at the beginning of the next flush group.  So there
>     are exactly two physical synchronization points for each flush.
> 
>     If a crash occurs at any point, upon remounting after a reboot
>     HAMMER only needs to run the UNDOs to undo any partially committed
>     meta-data.
> 
>     That's it.  It is fairly straight forward.
>
>     For your forward-log approach you would have to log the operations 
>     as they occur, which is fine since that can be cached in-memory. 
>     However, you will still need to synchronize the log entries with
>     a volume header update to update the log's index, so the recovery
>     code knows how far the log extends.  Plus you would also have to log
>     UNDO records when making actual changes to the permanent media
>     structures (your B-Trees), which is independant of the forward-log
>     entries you made representing high level filesystem operations, and
>     would also have to lock the related meta-data in memory until the
>     related log entries can be synchronized.  Then you would be able
>     to flush the meta-data buffers.

That is pretty close to the mark, but there are UNDO records in Tux3,
there are only logical and physical REDO records.

The forward log is supposed to try to avoid syncing the beginning of
the chain to a known location as much as possible, which it does by
having at least one transaction in its pipeline that has not been
committed to disk yet and is part of an existing forward chain.  The
chain is extended by allocating a location for the commit block of a
new transaction being formed, storing that location in the commit
block of the existing transaction, then committing the existing
transaction to disk.

>     The forward-log approach is definitely more fine-grained, particularly
>     for fsync() operations... those would go much faster using a forward
>     log then the mechanism I use because only the forward-log would have
>     to be synchronized (not the meta-data changes) to 'commit' the work.
>     I like that, it would be a definite advantage for database operations.

I think you will like it even more when you realize that updating the
filesystem header is not required for most transactions.

> :>     HAMMER's B-Tree elements are probably huge compared to Tux3, and that's
> :>     another limiting factor for the fan-out I can use.  My elements
> :>     are 64 bytes each.
> :
> :Yes, I mostly have 16 byte elements and am working on getting most of
> :them down to 12 or 8.
> 
>     I don't know how you can make them that small.  I spent months just
>     getting my elements down to 64 bytes.  The data reference alone for
>     data blocks is 12 bytes (64 bit media storage offset and 32 bit length).

A versioned extent:

   struct extent { unsigned version:10, count:6, block:24; };

Directory format to index the extent within a leaf:

   struct entry { unsigned loglo:24, offset:8 };
   struct group { unsigned loghi:24, count:8 };

Leaf layout (abridged):

   struct leaf { unsigned groups; struct group[]; struct entry[]; struct elements[] };

Extents are divided into groups, all of which have the same upper 24 bits
of logical address.  The 8 bit offset gives the position of the first
extent with that logical offset, measured in 8 byte units from the base
of the extent group.  The difference between two successive offset
bytes is the number of extents with the same logical offset.  The first
offset byte gives the total size of the group, and would otherwise be
always be zero and that would be a waste.  The base of the ith group is
determined by totalling the zeroth offset bytes.

The 48 bit logical address is split between the 32 bit dictionary and
the 32 bit group entry.  This format gives the ability to binsearch on
both entries and groups.  The number of groups per leaf should be just 
one or two at deep leaves where considerable splitting of the logical
address has already occurred, so the overhead of the dictionary is just
32 bits per extent roughly, giving a total overhead of 12 bytes per
extent.  Shallow btrees will approach 16 bytes per extent, but the tree
is shallow anyway so it does not matter.  The shallowest tree consists
of just a few extents recorded as an attribute in the inode btree.  I
introduce the rule that for such an extent attribute, the extents
belonging to each version are presumed to be logically contiguous, so
no directory is needed and the extent overhead drops to 8 bytes in this
common case.  Actually, most files should just consist of a single
extent.  That could be a new attribute form, a size/extent attribute,
with the mtime assumed to be the same as the ctime, which I think would
cover a majority of files on the system I am using right now.

See, I am really obsessive when it comes to saving bits.  None of these
compression hacks costs a lot of code or has a big chance of hiding a
bug, because the extra boundary conditions are strictly local.

>     I think the cpu overhead is going to be a bit worse then you are
>     contemplating, but your access footprint (with regards to system memory
>     use caching the meta-data) for non-historical accesses will be better.

We shall see.  Factors working in my favor are:

  * Snapshots are explicit, so you have to create lots of them plus
    heavily update the data in order to make the versioned data bushy.

  * The linux page cache not only caches data but provides an efficient
    access path to logical blocks that avoids having to consult the
    filesystem metadata repeatedly under heavy traffic.

  * Compact metadata entities including versioned attributes and extents
    (pointers for now)

  * The kind of load that people put on a Netapp is a trivially small
    number of versions for Tux3.

>     Are you going to use a B-Tree for the per-file layer or a blockmap?

Btree, which I call the file index btree.

>     And how are you going to optimize the storage for small files?

Immediate data is allowed in an inode as a versioned attribute.

>     Are you just going to leave them in the log and not push a B-Tree
>     for them? 

It did not occur to me to leave them in the logical log, but yes that
is a nice optimization.  They will only stay there for a while, until a
rollup is triggered, moving them into the inode table proper.  It gets
a little tricky to figure out whether the amortization due to the
logical logging step is worth the extra write.  There is a crossover
somewhere, maybe around a hundred bytes or so, ideal for symlinks and
acls.

> :I might eventually add some explicit cursor caching, but various
> :artists over the years have noticed that it does not make as much
> :difference as you might think.
> 
>      For uncacheable data sets the cpu overhead is almost irrelevant.

For rotating media, true.  There is also flash, and Ramback...

>      But for cached data sets, watch out!  The cpu overhead of your B-Tree
>      lookups is going to be 50% of your cpu time, with the other 50% being
>      the memory copy or memory mapping operation and buffer cache operations.
>      It is really horrendous.  When someone is read()ing or write()ing a
>      large file the last thing you want to do is traverse the same 4 nodes
>      and do four binary searches in each of those nodes for every read().

I found that from my HTree experience.  I found that pretty much the
entire cost of the bad old linear dirops was CPU and I was able to show
a ridiculous speedup by cutting the cost down to O(log511(ops)) from
O(ops^2).

>      For large fully cached data sets not caching B-Tree cursors will
>      strip away 20-30% of your performance once your B-Tree depth exceeds
>      4 or 5.  Also, the fan-out does not necessarily help there because
>      the search within the B-Tree node costs almost as much as moving
>      between B-Tree nodes.
>  
>      I found this out when I started comparing HAMMER performance with
>      UFS.  For the fully cached case UFS was 30% faster until I started
>      caching B-Tree cursors.  It was definitely noticable once my B-Tree
>      grew past a million elements or so.  It disappeared completely when
>      I started caching cursors into the B-Tree.

OK, you convinced me.  I already have btree cursors (see ddsnap path[])
that are so far only used for per operation btree access and editing.
So these will eventually become cache objects.
 
> :>     If I were to do radix compression I would also want to go with a
> :>     fully dynamic element size and fully dynamic fan-out in order to
> :>     best-compress each B-Tree node.  Definitely food for thought.
> :
> :Indeed.  But Linux is braindamaged about large block size, so there is
> :very strong motivation to stay within physical page size for the
> :immediate future.  Perhaps if I get around to a certain hack that has
> :been perenially delayed, that situation will improve:
> :
> :   "Variable sized page objects"
> :   http://lwn.net/Articles/37795/
> 
>     I think it will be an issue for people trying to port HAMMER.  I'm trying
>     to think of ways to deal with it.  Anyone doing an initial port can 
>     just drop all the blocks down to 16K, removing the problem but
>     increasing the overhead when working with large files.

Do the variable sized page hack :-]

Or alternatively look at the XFS buffer cache shim layer, which
Christoph manhandled into kernel after years of XFS not being accepted
into mainline because of it.

> :>     I'd love to do something like that.  I think radix compression would
> :>     remove much of the topological bloat the B-Tree creates verses using
> :>     blockmaps, generally speaking.
> :
> :Topological bloat?
> 
>     Bytes per record.  e.g. the cost of creating a small file in HAMMER
>     is 3 B-Tree records (directory entry + inode record + one data record),
>     plus the inode data, plus the file data.  For HAMMER that is 64*3 +
>     128 + 112 (say the file is 100 bytes long, round up to a 16 byte
>     boundary)... so that is 432 bytes.

Let me see, for Tux3 that is 64 bytes for the inode part including a
tiny amount of data, plus 32 bytes or so for the dirent part, say 100
bytes.  This is my reward for being an obsessive bit mizer.

>     The bigger cost is when creating and managing a large file.  A 1 gigabyte
>     file in HAMMER requires 1G/65536 = 16384 B-Tree elements, which comes
>     to 1 megabyte of meta-data.  If I were to radix-compress those 
>     elements the meta-data overhead would probably be cut to 300KB,
>     possibly even less.
>
>     Where this matters is that it directly effects the meta-data footprint 
>     in the system caches which in turn directly effects the filesystem's
>     ability to cache information without having to go to disk.  It can
>     be a big deal. 

With pointers it is about a megabyte per gigabyte for Tux3 too (and
Ext2 and Ext3) so going to extents is not optional.  I plan a limit of
64 blocks per extent, cutting the metadata overhead down to 16K per
gigabyte, which is hard to complain about.  Larger extents in the file
index would not buy much and would mess up the nice, 8 byte
one-size-fits-all versioned extent design.

>     Adding a forward log to HAMMER is possible, I might do it just for
>     quick write()/fsync() style operations.  I am still very wary of the
>     added code complexity.

Waiting for me to prove the idea seems reasonable.

> :He did remind me (via time travel from year 2000) of some details I
> :should write into the design explicitly, for example, logging orphan
> :inodes that are unlinked while open, so they can be deleted on replay
> :after a crash.  Another nice application of forward logging, which
> :avoids the seek-happy linked list through the inode table that Ext3
> :does.
> 
>     Orphan inodes in HAMMER will always be committed to disk with a
>     0 link count.  The pruner deals with them after a crash.  Orphan
>     inodes can also be commited due to directory entry dependancies.

That is nice, but since I want to run without a pruner I had better
stick with the logical logging plan.  The link count of the inode will
also be committed as zero, or rather, the link attribute will be
entirely missing at that point, which allows for a cross check.

>     The budding operations is very interesting to me...
>     well, the ability to fork the filesystem and effectively write to
>     a snapshot.  I'd love to be able to do that, it is a far superior
>     mechanism to taking a snapshot and then performing a rollback later.

I was thinking more about the operation of taking somebody's home
directory and turning it into a separate volume, for whatever reason.
Tux3 snapshots (versions) have that write/rollback ability, and also
have it for snapshots of snapshots, just in case that was not clear.

>     Hmm.  I could definitely manage more then one B-Tree if I wanted to.
>     That might be the ticket for HAMMER... use the existing snapshot
>     mechanic as backing store and a separate B-Tree to hold all changes
>     made to the snapshot, then do a merged lookup between the new B-Tree
>     and the old B-Tree.  That would indeed work.

Yes, it sounds nice.

>     I can tell you've been thinking about Tux for a long time.  If I
>     had one worry about your proposed implementation it would be in the
>     area of algorithmic complexity.  You have to deal with the in-memory 
>     cache, the log, the B-Tree, plus secondary indexing for snapshotted
>     elements and a ton of special cases all over the place.  Your general
>     lookup code is going to be very, very complex.

The lookup path is actually quite simple, I am far enough in to know
that.  The versioning algorithms are the most complex piece from an
algorithmic point of view, and as you can see from the sample
implementation, they have already been boiled down to nice readable
code.

The details of logging are also less complex than they appear at first
blush, due to a number of simplifications I have discovered.  You
should see the mess that the Ext3 guys had to deal with as a result of
journalling limitations.

I do not plan to provide secondary indexing for any versioned elements,
except as described in the btree leaf formats, which have obligingly
arranged themselves into a nice pattern: fixed sized elements at the
bottom of the leaf grouped by a little directory vector growing down
from the top of the leaf.  I am pleased with the way those details get
more regular as implementation proceeds, opposite to the usual trend.

>     My original design for HAMMER was a lot more complex (if you can
>     believe it!) then the end result.

My design for Tux2 was gross in retrospect.

>     A good chunk of what I had to do 
>     going from concept to reality was deflate a lot of that complexity.
>     When it got down to brass tacks I couldn't stick with using the 
>     delete_tid as a B-Tree search key field, I had to use create_tid.  I
>     couldn't use my fine-grained storage model concept because the
>     performance was terrible (too many random writes interfering with
>     streaming I/O).  I couldn't use a hybrid B-Tree, where a B-Tree element
>     could hold the base of an entirely new B-Tree (it complicated the pruning
>     and reblocking code so much that I just gave up trying to do it in
>     frustration).  I couldn't implement file extents other then 16K and 64K
>     blocks (it really complicated historical lookups and the buffer cache
>     couldn't handle it) <--- that one really annoyed me.  I had overly
>     optimized the storage model to try to get block pointers down to 32 bits
>     by localizing B-Tree elements in the same 'super clusters' as the data
>     they referenced.  It was a disaster and I ripped it out.  The list goes
>     on :-)

I do not rely on reblocking, and pruning of discarded versions is done
by the same somewhat complex but already in production btree traversal
for both the inode table and file btree levels, so I thankfully did not
run into the reasons you discarded the hybrid btree.

File extents are indeed a nasty issue because of the complexity they
impose on the versioned entity algorithms.  But this code will be
implemented and debugged in isolation along the same lines as the
versioned pointer fuzz tester.  I am actually thinking of a list member
who I know is entirely capable of doing that elegantly, and I hope will
raise his hand soon, as otherwise I will have to spend a significant
number of weeks on that.  For the time being I have chosen the path of
wisdom and will stick with versioned pointers until the filesystem is
basically up, versioning, atomically committing, and not deadlocking.

>     I do wish we had something like LVM on BSD systems.  You guys are very
>     lucky in that regard.  LVM is really nice.

But I thought you BSD guys have GEOM.  I think GEOM is really nice, and
I think that Linux LVM has so many problems that I have started work on
a replacement, LVM3.

>     BTW it took all day to write this!

It took two days to write this ;-)

I hope the fanout effect does not mean the next post will take four days.

Regards,

Daniel

From phillips at phunq.net  Sun Jul 27 14:29:06 2008
From: phillips at phunq.net (Daniel Phillips)
Date: Sun, 27 Jul 2008 14:29:06 -0700
Subject: [Tux3] forward logging and NVRAM
In-Reply-To: <567D357E-03E1-45E5-9EE9-37B36CC29C7A@mac.com>
References: <567D357E-03E1-45E5-9EE9-37B36CC29C7A@mac.com>
Message-ID: <200807271429.06849.phillips@phunq.net>

Hi Timothy,

On Sunday 27 July 2008 14:10, timothy norman huber wrote:
> Daniel,
> 
> Several weeks ago, during a discussion over coffee, you mentioned a  
> rather intriguing optimization- by writing the start of the forward  
> log chain to NVRAM rather than disk, you avoid a disk seek and write  
> transfer for each forward log chain.  i believe there was another  
> motivation that had to do with avoiding traversing the entire volume  
> in case of a crash.  Was this design feature to overcome lame fsck  
> performance?

The same motivation, actually.  Writing the start of a forward log
chain to nvram instead of to some known location on disk means that I
do not have to worry about doing unnatural things to optimize such
loads as O_SYNC writing, where each transaction has to complete
before the next one can begin, which means that the forward log chain
never gets more than one element log, requiring a seek to the known
location for every transaction.  Not worse than journalling to be
sure, and usually better, but not as fast as avoiding a seek and disk
write per transaction entirely.

So if somebody could give me 16 bytes of NVRAM or so per volume I
would be most pleased and could use that to generate some nice O_SYNC
benchmarks :-)

By the way, I wonder why your posts do not show up in the Mailman
(Pipermail) archive, while mine and Matt's do?

Daniel

From phillips at phunq.net  Sun Jul 27 22:46:45 2008
From: phillips at phunq.net (Daniel Phillips)
Date: Sun, 27 Jul 2008 22:46:45 -0700
Subject: [Tux3] Two kinds of atomic commit
Message-ID: <200807272246.46581.phillips@phunq.net>

Here I describe two slightly different methods that Tux3 will use to
implement atomic commit of data and metadata.  Both methods combine 
logical and physical forward logging to perform atomic commits
efficiently.  The discussion below assumes we are updating a btree leaf,
for example to make room for more inode data or to add pointers to a
file index.  The same techniques apply equally well to all structures
in the filesystem.

1) The Update a Clone Method

Instead of directly modifying a disk block corresponding to a btree
leaf, we allocate a new block for the leaf and copy the contents of
the original block to the new block, only in the buffer cache (no copy
is performed on disk).  We can now alter the new block at will and
flush it out to disk without affecting the on-disk btree at all.  But
have not yet linked the new block into the btree.  We could accomplish
that by performing a similar clone update recursively up to the root of
the tree, which creates a new tree root.  The whole chain of new blocks
would then be flushed out to disk and a pointer to the new root stored
at some predictable location so it can be found later.  This is the
"phase tree" method that I invented for Tux2, and is commonly called
"copy on write" these days.  It could also be called the "sue me" method
because Netapp likes to sue those such as Sun who implement it.

Fortunately, there is a better way to do it that I invented recently and
which I have never heard of anyone using before.  We modify only the
leaf node of the btree by cloning and record the pointer to the clone
only in the cached btree index block, not on disk.  To be able to
reconstruct the cached version of the index node after a crash, we log a
logical change record to the disk that says "write this leaf into that
btree index node".

We make whatever changes we need to the clone of the leaf node, then
construct a two block transaction consisting of the clone and a commit
block.  The commit block points at the new leaf node and also carries
the logical update record that describes how to modify the parent index
node recorded on disk to point at the new leaf.  If the commit block can
be allocated immediately adjacent to the clone then we can construct a
single bio to transfer the clone and the commit block to disk in one
operation.  Otherwise we construct two bios.  If there is some previous
commit transaction that has been staged but not yet committed, then we
modify its commit block to point at the location where our new commit
block will be stored.  Otherwise we have to seek to some known location
on the disk to store the location of the new commit.

To be able to tell after a crash whether the commit block and its
associated data made it to disk, we store a sequence number to
distinguish this commit block from a possible similar commit that might
have landed accidentally at the same location earlier, and store a hash
of the two blocks in the commit block.

Now the transaction is ready to go out to disk.  But if the disk is busy
with previous traffic (we hope it is) then we just stage the transaction
for commit, and commit the predecessor transaction instead.  This gives
a chance for any following transaction to extend the forward log chain
instead of having to start a new chain, which would need an extra seek
and transfer.

The original version of the leaf node on disk is not yet freeable,
because it may be needed after a crash (together with the logical update
record) to reconstruct the cached btree node image.  Only after we "roll
up" the logical update log entry by atomically updating the parent node
may we free the original leaf block and the commit block that recorded
the logical update.  Such a rollup happen periodically, otherwise we
would end up with an unreasonably long chain of logical updates
needing to be replayed on remount to reconstruct the btree image in
cache.

2) The Update in Place Method

An alternate method of atomically updating our btree leaf node is to
write it to disk twice: the first time along with a commit block that
says we intend to overwrite the leaf node with the data part of the
transaction, and the second write does the actual overwrite.  If we
crash before completing the second write, replay will load the image
of the btree block from the commit transaction into cache and we are
ready to continue where we left off.

Like the clone method, the update in place method constructs a
transaction, tries to link it into an existing forward log chain, or
falls back to writing the commit block location to a known location if
it has to, then stages the transaction for commit, committing the
predecessor transaction in the forward chain if there is one.  The
overwrite transfer can be submitted as soon as the commit transaction
has completed, which might consist of one or two bio transfers depending
on whether it was possible to allocate the transaction contiguously or
not.

It is possible that the updated leaf block may be needed as the base for
applying logical updates to reconstruct the cached version of the leaf
as a consequence of some future transaction, but the future transaction
must be able to rely on a known state of the leaf if it wants to log
logical changes against it.  Therefore the future transaction may have
to wait on the update in place transaction to complete as well, to
guarantee that the desired leaf state can be reconstructed.

Any future update in place of the same block has to wait for the
overwrite transfer to complete before submitting its own overwrite
transfer, otherwise the second overwrite may race ahead of the first
creating a stale disk block.

The whole update in place transaction can be freed as soon as the
overwrite transfer completes.

Comparison of the methods

Both methods are very efficient ways to achieve atomic commit.  The
update in place method has the advantage of not perturbing the original
disk layout, possibly reducing fragmentation.  The update clone method
has the advantage of one less write operation.  Tux3 will provide a
mount option to specify which method to use.

A related technique is atomic append.  A whole range of data blocks can
be written out along with a commit block in a single bio transfer, or
a small number of bio transfers in one transaction, along with a logical
record in the commit block to say which file the blocks are to be
appended, in the form of a logical update in the commit block which will
eventually result in the addition of one or more extents to the file
index, update of the size and mtime, and removal of the new data blocks
from the free map.  Thus, a file write of considerable size can be
atomically committed with a single bio transfer.

Regards,

Daniel

From phillips at phunq.net  Sun Jul 27 23:39:56 2008
From: phillips at phunq.net (Daniel Phillips)
Date: Sun, 27 Jul 2008 23:39:56 -0700
Subject: [Tux3] Comparison to Hammer fs design
In-Reply-To: <200807272131.m6RLVou8036132@apollo.backplane.com>
References: <4889fdb4$0$849$415eb37d@crater_reader.dragonflybsd.org>
        <200807270451.35458.phillips@phunq.net>
        <200807272131.m6RLVou8036132@apollo.backplane.com>
Message-ID: <200807272339.57117.phillips@phunq.net>

On Sunday 27 July 2008 14:31, Matthew Dillon wrote:
> :A versioned extent:
> :
> :   struct extent { unsigned version:10, count:6, block:24; };
> :
> :Directory format to index the extent within a leaf:
> :
> :   struct entry { unsigned loglo:24, offset:8 };
> :   struct group { unsigned loghi:24, count:8 };
> :
> :Leaf layout (abridged):
> :
> :   struct leaf { unsigned groups; struct group[]; struct entry[]; struct extent[] };
> :...
> :See, I am really obsessive when it comes to saving bits.  None of these
> :compression hacks costs a lot of code or has a big chance of hiding a
> :bug, because the extra boundary conditions are strictly local.
>      
>     Man, those are insanely small structures.  Watch out for the cpu
>     and in-kernel-memory tracking overhead.

And there is a mistake, it should be:

-   struct extent { unsigned version:10, count:6, block:24; };
+   struct extent { unsigned version:10, count:6, block:48; };

The compression effort was about finding a way to represent 48 bits of
logical address for each extent and to avoid repeating logical addresses
for successive extents starting at the same logical address, using 32
bits instead of the 64 bits currently used by the ddsnap leaf format,
for a 33% overall improvement in leaf capacity.

Regards,

Daniel

From phillips at phunq.net  Mon Jul 28 00:18:57 2008
From: phillips at phunq.net (Daniel Phillips)
Date: Mon, 28 Jul 2008 00:18:57 -0700
Subject: [Tux3] File index leaf format
Message-ID: <200807280018.57961.phillips@phunq.net>

In an earlier post I was seen fretting about how to reduce the 
relatively expensive linear search in a PHTree directory file dirent 
block to O(log(dirents)), while keeping the compactness and physical 
stability of the good old Ext dirent format.  It did not take too long 
to find one.

   struct tux3_dirent { unsigned inum:48, type:8, len:8; char name[]; };

Versus an Ext2 dirent, the two byte record size is gone and the inum
is increased from 32 bits to 48.  The fields are big-endian and the 
structure is byte-aligned, and therefore needs to be declared with 
attribute packed so that gcc will generate code that does not throw 
exceptions on alignment-challenged architectures when trying to pick up 
the 48 bit inum field.

A simple directory grows down from the top of the dirent block towards
the dirents:

   be_16 entries[];

Each big endian 16 bit entry is the offset of a dirent from the base of 
the block.  The entries are in lexically sorted order to support 
binsearch.  To maintain physical stability for telldir cookies, an 
entry is never moved once created.  Too bad about directory compaction, 
that is an offline operation.

To search for space within the dirent we make a copy of the directory 
and sort it.  In the base of the directory we record the size of the 
biggest contiguous free space in the dirent block in order to avoid 
most fruitless searches and the associated sort.  There is also a 
highwater mark to optimize the common case of dirent blocks that just 
fill up from bottom to top, with free space seldom created in the 
middle, and to know how far down the directory can grow before it
meets the dirents.

To delete an entry, just delete it from the directory and reduce the
count of entries.

The block header looks like:

   struct tux3_dirent_block { char magic[4]; unsigned count, high; };

The inum of the directory file may also be stored in the header to help
with fsck.

I am not in a big hurry to implement this, because the good old Ext2
dirent block format will serve quite nicely for now, and I can just cut 
and paste the code to implement that.  But when dirop speed rises to 
the top of this list of things that people are benchmarking, this 
improvement is ready and will measureably increase the speed of creates 
and random deletes.  Also, Tux3 is supposed to have 48 bit inode 
numbers, so the directory format has to support that.

Regards,

Daniel

From dillon at apollo.backplane.com  Mon Jul 28 09:58:04 2008
From: dillon at apollo.backplane.com (Matthew Dillon)
Date: Mon, 28 Jul 2008 09:58:04 -0700 (PDT)
Subject: [Tux3] Two kinds of atomic commit
References: <200807272246.46581.phillips@phunq.net>
Message-ID: <200807281658.m6SGw44L044915@apollo.backplane.com>

:1) The Update a Clone Method
:
:2) The Update in Place Method
:
:Daniel

    Hmm.  Those seem fairly complex.  How would you deal with incidental
    operations on the B-Tree such as doing a split?  A single insert
    could wind up requiring a split of every internal node on the way
    down to the leaf.  I guess you could clone the blocks, but there are
    a ton of parent and child linkages that have to be changed when you
    split a node.  It sounds pretty expensive.

    Maybe implementing UNDOs for incidental B-Tree operations is the ticket.
    Incidental meaning those operations (such as splits) which lead up to
    an insertion or deletion but do not actually perform the insertion
    or deletion.   On crash recovery you would undo partially completed
    B-Tree operations and then REDO the related logical operations.

    Having to allocate new blocks for B-Tree deletions could create a big
    issue when the filesystem becomes full or near-full.

                                        -Matt
                                        Matthew Dillon 
                                        <dillon at backplane.com>

From phillips at phunq.net  Mon Jul 28 12:52:09 2008
From: phillips at phunq.net (Daniel Phillips)
Date: Mon, 28 Jul 2008 12:52:09 -0700
Subject: [Tux3] Two kinds of atomic commit
In-Reply-To: <200807281658.m6SGw44L044915@apollo.backplane.com>
References: <200807272246.46581.phillips@phunq.net>
        <200807281658.m6SGw44L044915@apollo.backplane.com>
Message-ID: <200807281252.09179.phillips@phunq.net>

On Monday 28 July 2008 09:58, Matthew Dillon wrote:
> :1) The Update a Clone Method
> :
> :2) The Update in Place Method
> :
> :Daniel
> 
>     Hmm.  Those seem fairly complex.

Then I need to rewrite the post so it seems as simple as it is :-)

>     How would you deal with incidental 
>     operations on the B-Tree such as doing a split?  A single insert
>     could wind up requiring a split of every internal node on the way
>     down to the leaf.  I guess you could clone the blocks, but there are
>     a ton of parent and child linkages that have to be changed when you
>     split a node.  It sounds pretty expensive.

In this case a log transaction is created containing all of the split
nodes as physical updates and possibly some merged nodes from the free
tree.  Btree splitting can always be committed atomically and
independently of any other activity on the filesystem.  It is nicely
bounded by the btree depth.  Allocations that do not require splitting
the free tree are logged logically, leaving the affected free tree
blocks dirty in cache.  So the allocation updates come "for free",
being tucked into the same commit block that governs the split btree
leaves.

>     Maybe implementing UNDOs for incidental B-Tree operations is the ticket.
>     Incidental meaning those operations (such as splits) which lead up to
>     an insertion or deletion but do not actually perform the insertion
>     or deletion.   On crash recovery you would undo partially completed
>     B-Tree operations and then REDO the related logical operations.

I now see how UNDO works, thankyou for your patient explanations.  The
point I overlooked is, fsync (and friends) is indeed the only barrier
you have to worry about.

Still, I think it is good to try to get as much as possible of what was
going on in a bit burst of activity with no fsync durably onto disk.
Redo will clearly leave the filesystem in a consistent state less far
back in time than undo.

So my general strategy is to log big changes like splits as "physical"
full block updates and small ones like allocating an extent as logical
edit records in the commit blocks of the physical updates.

The complexity you noted above is due to making sure that the on-disk
image of a block is always what the logical edit record expects it
to be at the time the logical edit is replayed.

>     Having to allocate new blocks for B-Tree deletions could create a big
>     issue when the filesystem becomes full or near-full.

Yes, allocate-in-free is usually nasty.  I need to ensure that there is
always a reserve of blocks available on disk that is more than the
maximum possible transaction that can be accepted.  This simple idea
can get really complex, I know.  One nice trick to simplify things a
little is to have a really generous limit on the maximum number of
dirty blocks that are allowed, when the filesystem has lots of free
space, and shrink that limit progressively as free space approaches
zero.

I now see what you were driving at with your point about interlocking
namespace transactions, etc.  While each VFS transaction can indeed be
committed on its own, atomically and independently, to do so would be
death for throughput.  So it is necessary to batch up a lot of these
transactions, and be mindful of the interdependencies between the VFS
transactions and the underlying data structures.  The rule is,
underlying data changes required by any given VFS transaction must
never lie in more than one commit.  This can be accomplished without
actually tracking the physical representation interdependencies between
VFS transactions.  Instead, just count the dirty cache blocks and send
out an atomic commit for the entire set of transactions when some
threshold is passed.  Now the challenge is to figure out how to avoid
stalling during the commit.

It is thanks to you, your searching questions and the example of Hammer
that I was forced to understand these issues clearly. :-)

Note!  In Linux, VFS means "Virtual Filesystem Switch", that is, the
security, synchronization and abstraction layer that exists above
filesystems in Linux.  I think it means "Virtual File System" to you,
which is just "filesystem" to Linux hackers.

Regards,

Daniel

From phillips at phunq.net  Mon Jul 28 18:20:48 2008
From: phillips at phunq.net (Daniel Phillips)
Date: Mon, 28 Jul 2008 18:20:48 -0700
Subject: [Tux3] Aha on the way to versioned extents
Message-ID: <200807281820.48972.phillips@phunq.net>

Hi all,

There are now over fifty of you here on the Tux3 list, I am most 
pleased.  A special shoutout to all the exnominate people who joined 
last night.  Gruesse!

I mentioned that the biggest piece of coding work remaining in Tux3 is  
generalizing from versioned pointers to versioned extents.  Just now, a 
way to turn that big project into a small project popped into my head.

First a couple of terminology adjustments: from now on there will be no 
more exceptions (legacy volume management terminology) and there will 
be elements instead.  A versioned element can be a pointer, an extent 
or an attribute.  Also, there will no longer be chunks.  Chunks are now 
just blocks in accord with standard filesystem/database terminology.

The aha is this: there is already nice code for the versioned pointer 
case, working on a single element list that is the set of versioned 
pointers for some logical block of a file or volume.  To generalize it 
to extents, we take a region of extents on which we want to perform 
some operation and convert it to a set of element lists, one for each 
logical block in the region.  Then we apply an edit operation (say 
version delete, the nastiest) to each of the lists separately.  Then we 
walk through the lists of single block pointers in parallel, converting 
them back to extents.  Then insert those emitted extents back into the 
region we started with, merging with extents on the boundares of the 
region where possible.  Easy enough to see how to do that, right?

Issue: this brute force approach may suck for performance.  But it does 
not suck for verifying correctness, and that is the dominating issue at 
the moment, the other dominating issue being that without extents we 
cannot compare Tux3's metadata compactness in any meaningful way to 
filesystems that do have extents.

The thing to do then, is iteratively improve the above simple idea to be 
more efficient.  We do not have to generate a versioned pointer list 
for each logical address, we only need one list and we will modify that 
one as we walk across the extent region.  Actually, make that two 
lists: a before-edit list and an after-edit list.  Actually, make that 
a source list of extents and an edited list of extents and also have an 
output list of extents.

First, load the source list of extents by adding in each extent that 
overlaps the logical block at the start of the region.  Now apply the 
edit operation to that list as if it were a list of single block 
versioned pointers, yielding an edited extent list.  Any extents in the 
source list that failed to appear in the output list are truncated to 
end at the current location, and if now zero length, deleted entirely.  
Copy the edited list to the output list.

Now advance to the beginning of the next extent in the region.  For any 
extents that end during that advance, set the extent count, remove them 
from the output extent list and emit them as final results.  Add the 
extents that begin at the new position to the source extent list.  
Apply the edit operation to the list of source extents as if they were 
block pointers, generating a new edited list.  Compare the edited list 
to the output list to see which output extents have disappeared from 
the edited list and therefore should have their length set and be 
removed from the output list and emitted as final results.  Repeat 
from "advance" above.

This is not meant to be a completely accurate algorithm, only to 
communicate the idea and to show that the messy problem can be 
unravelled into a nice simple problem with a simple solution.

Even this improved, iterative algorithm is far from optimal, but it is 
not too bad either.  It will be simple minded code (the way I like it) 
and it build strictly on the existing versioned pointer editing code 
without making the underlying code more complicated.  This approach 
will likely serve well for some time.

I now think that coding up versioned extents is not a big callenge, and 
that the biggest outstanding issue is designing a good, efficient 
transaction commit model, which is proceding well in another thread.

Regards,

Daniel

From phillips at phunq.net  Mon Jul 28 19:19:51 2008
From: phillips at phunq.net (Daniel Phillips)
Date: Mon, 28 Jul 2008 19:19:51 -0700
Subject: [Tux3] Mercurial repository for tux3
Message-ID: <200807281919.51304.phillips@phunq.net>

I have set up a Mercurial repository for Tux3 here:

   http://phunq.net/tux3

Please forgive the sparse web interface, I am just running the sample
cgi code there at the moment.  No kernel code checked in, only userspace
prototypes for the next little while.  Only one file checked in, the
version fuzz tester:

   http://phunq.net/tux3?f=9b760c6a9f62;file=test/version.c

To run: c99 version.c && ./a.out

Expected result: 10,000 random versioning operations ending with the
version tree:

(8) ~3
   (7) ~2
      (4) '1991' [119]
      (17) '1984' [105]
   (14) ~2
      (1) '1989' [111]
      (2) '1977' [118]
   (25) '1970' [115]
8 versions

and element list:

5 elements: [17, 105] [1, 111] [25, 115] [2, 118] [4, 119]

"But it's a start."

Daniel

From lars.segerlund at gmail.com  Mon Jul 28 23:53:12 2008
From: lars.segerlund at gmail.com (Lars Segerlund)
Date: Tue, 29 Jul 2008 08:53:12 +0200
Subject: [Tux3] Where can I get the code ?
Message-ID: <80d224fa0807282353x4011773cj48c79fe483f0e6d4@mail.gmail.com>

 I am 'dying' to get a look at the code and play around a bit.
 So where is it ? Is there a tarball or git somewhere ?

 I looked at the mercurial stuff, but there were only a userspace program ...

 / regards, Lars Segerlund.

From zenczykowski at gmail.com  Tue Jul 29 11:06:07 2008
From: zenczykowski at gmail.com (=?UTF-8?Q?Maciej_=C5=BBenczykowski?=)
Date: Tue, 29 Jul 2008 11:06:07 -0700
Subject: [Tux3] forward logging and NVRAM
In-Reply-To: <200807271429.06849.phillips@phunq.net>
References: <567D357E-03E1-45E5-9EE9-37B36CC29C7A@mac.com>
        <200807271429.06849.phillips@phunq.net>
Message-ID: <55a4f86e0807291106i373b14b4l4634281041b49692@mail.gmail.com>

The real problem here is that this is going to be a constantly written
16 bytes of nvram.
This will _very_ very quickly kill any type of flash that I'm aware
of...  as such it would have to be something
along the lines of a battery backed few KB of ram, with write out to
flash on failure.
Not sure what the lifetime in write-cycles of the standard cmos nvram
is, but even if (being battery backed)
it would work, that's a real hard place to find any space...?

Maciej

On Sun, Jul 27, 2008 at 14:29, Daniel Phillips <phillips at phunq.net> wrote:
> Hi Timothy,
>
> On Sunday 27 July 2008 14:10, timothy norman huber wrote:
>> Daniel,
>>
>> Several weeks ago, during a discussion over coffee, you mentioned a
>> rather intriguing optimization- by writing the start of the forward
>> log chain to NVRAM rather than disk, you avoid a disk seek and write
>> transfer for each forward log chain.  i believe there was another
>> motivation that had to do with avoiding traversing the entire volume
>> in case of a crash.  Was this design feature to overcome lame fsck
>> performance?
>
> The same motivation, actually.  Writing the start of a forward log
> chain to nvram instead of to some known location on disk means that I
> do not have to worry about doing unnatural things to optimize such
> loads as O_SYNC writing, where each transaction has to complete
> before the next one can begin, which means that the forward log chain
> never gets more than one element log, requiring a seek to the known
> location for every transaction.  Not worse than journalling to be
> sure, and usually better, but not as fast as avoiding a seek and disk
> write per transaction entirely.
>
> So if somebody could give me 16 bytes of NVRAM or so per volume I
> would be most pleased and could use that to generate some nice O_SYNC
> benchmarks :-)
>
> By the way, I wonder why your posts do not show up in the Mailman
> (Pipermail) archive, while mine and Matt's do?
>
> Daniel
>
> _______________________________________________
> Tux3 mailing list
> Tux3 at tux3.org
> http://tux3.org/cgi-bin/mailman/listinfo/tux3
>

From phillips at phunq.net  Tue Jul 29 12:20:28 2008
From: phillips at phunq.net (Daniel Phillips)
Date: Tue, 29 Jul 2008 12:20:28 -0700
Subject: [Tux3] Where can I get the code ?
In-Reply-To: <80d224fa0807282353x4011773cj48c79fe483f0e6d4@mail.gmail.com>
References: <80d224fa0807282353x4011773cj48c79fe483f0e6d4@mail.gmail.com>
Message-ID: <200807291220.29541.phillips@phunq.net>

On Monday 28 July 2008 23:53, Lars Segerlund wrote:
>  I am 'dying' to get a look at the code and play around a bit.
>  So where is it ? Is there a tarball or git somewhere ?
> 
>  I looked at the mercurial stuff, but there were only a userspace program ...
> 
>  / regards, Lars Segerlund.

Hi Lars,

The code will arrive a bit at a time.  The next piece to land will be
a fuzz test for the file index leaf operations, that implements and tests the index leaf
directory structures on the list not too long ago

Next after that will be generic btree operations including a simple
block allocator.

Kernel bits will start landing where both the inode table and file
index btrees are up and thoroughly tested in userspace, including all
the leaf operations, which including integrating the versioning code
that now lives in version.c.

Regards,

Daniel

From phillips at phunq.net  Tue Jul 29 18:26:04 2008
From: phillips at phunq.net (Daniel Phillips)
Date: Tue, 29 Jul 2008 18:26:04 -0700
Subject: [Tux3] A modest code drop
Message-ID: <200807291826.04503.phillips@phunq.net>

Hi,

A new file has landed in the test directory: file index btree leaf
operations along the lines of my earlier description, implementing a
compressed two level indexing scheme within a leaf in order to hold the
overhead of indexing down to around 4 bytes per distinct 48 bit logical
address.  For an unversioned file where there is just one extent per
logical address this requires about 12 bytes per extent including
indexing.  For a heavily versioned file with many different versioned
extents per logical address, the leaf space per extent approaches 8
bytes.  Each extent can address 2**6 blocks or 256K of storage, giving
a metadata to data ratio of 2**15, or 32K to one.  So I am beginning to
deliver on my promise of highly compressed metadata.

The indexing scheme is slightly complex with its chopped up fields,
boundary conditions due to not storing sentinels, two level structure
and reversed table format to optimize for the linear writing case,
however the code is not bulky and is very efficient.  The lookup
algorithm in particular is not much more code than a simple linear
search.

The split and merge operations are not done yet, or the delete.  The
test harness falls short of a proper fuzz test.  This code really does
require a random test methodology to harden it up, like the version.c
code, which also has a lot of subtle boundary conditions.  There is
a simple "hand made" test in leaf.c that sets up some basic leaf
contents, runs a few insert operations, then does some lookups.  This
demonstrates that the format is not so complex that a simple human
being cannot deal with it.

I am sure there still a couple of bugs lurking.

  http://tux3.org/tux3?fd=25972ee30667;file=test/leaf.c
  c99 leaf.c && ./a.out

Regards,

Daniel

From phillips at phunq.net  Wed Jul 30 21:46:20 2008
From: phillips at phunq.net (Daniel Phillips)
Date: Wed, 30 Jul 2008 21:46:20 -0700
Subject: [Tux3] Another modest code drop
Message-ID: <200807302146.21928.phillips@phunq.net>

The split.c code is now generally better, and actually runs the test
case correctly, which the previous version did not.

Turning the groups vector upside down like the entries vector made
things considerably more regular.  The leaf_insert function is now
pretty readable, which is something of an achievement given all the
corner case in it.

   http://tux3.org/tux3?f=35d77cb24c1f;file=test/leaf.c

One neat thing about this insert is, there is only one memmove on
the fast path, and that one usually moves zero bytes.  Otherwise as
can be seen, the fast path is very fast.

Now on to the leaf split.  Then merge.  Then delete.  Then on to the
inode table leaf functions, which are a considerably simpler.

Regards,

Daniel

From timothyhuber at mac.com  Thu Jul 31 07:21:09 2008
From: timothyhuber at mac.com (timothy norman huber)
Date: Thu, 31 Jul 2008 07:21:09 -0700
Subject: [Tux3] forward logging and NVRAM
Message-ID: <439735E2-D8B2-41D5-9097-3E4BDB877925@mac.com>

Maciej,
MRAM would be suitable for this.  Freescale Semi spun off their MRAM  
unit recently, but they're shipping small (but expensive) MRAM chips.
-Timothy

> The real problem here is that this is going to be a constantly written
> 16 bytes of nvram.
> This will _very_ very quickly kill any type of flash that I'm aware
> of...  as such it would have to be something
> along the lines of a battery backed few KB of ram, with write out to
> flash on failure.
> Not sure what the lifetime in write-cycles of the standard cmos nvram
> is, but even if (being battery backed)
> it would work, that's a real hard place to find any space...?
>
> Maciej
>
> On Sun, Jul 27, 2008 at 14:29, Daniel Phillips <phillips at  
> phunq.net> wrote:
> > Hi Timothy,
> >
> > On Sunday 27 July 2008 14:10, timothy norman huber wrote:
> >> Daniel,
> >>
> >> Several weeks ago, during a discussion over coffee, you mentioned a
> >> rather intriguing optimization- by writing the start of the forward
> >> log chain to NVRAM rather than disk, you avoid a disk seek and  
> write
> >> transfer for each forward log chain.  i believe there was another
> >> motivation that had to do with avoiding traversing the entire  
> volume
> >> in case of a crash.  Was this design feature to overcome lame fsck
> >> performance?
> >
> > The same motivation, actually.  Writing the start of a forward log
> > chain to nvram instead of to some known location on disk means  
> that I
> > do not have to worry about doing unnatural things to optimize such
> > loads as O_SYNC writing, where each transaction has to complete
> > before the next one can begin, which means that the forward log  
> chain
> > never gets more than one element log, requiring a seek to the known
> > location for every transaction.  Not worse than journalling to be
> > sure, and usually better, but not as fast as avoiding a seek and  
> disk
> > write per transaction entirely.
> >
> > So if somebody could give me 16 bytes of NVRAM or so per volume I
> > would be most pleased and could use that to generate some nice  
> O_SYNC
> > benchmarks :-)
> >
> > By the way, I wonder why your posts do not show up in the Mailman
> > (Pipermail) archive, while mine and Matt's do?
> >
> > Daniel



Timothy Huber
Strategic Account Development

tim.huber at metaram.com
cell 310 795.6599

MetaRAM Inc.
181 Metro Drive, Suite 400
San Jose, CA 95110

http://www.linkedin.com/in/timhuber

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://tux3.org/pipermail/tux3/attachments/20080731/fe19db47/attachment.html 

From phillips at phunq.net  Thu Jul 31 21:58:08 2008
From: phillips at phunq.net (Daniel Phillips)
Date: Thu, 31 Jul 2008 21:58:08 -0700
Subject: [Tux3] leaf_split has landed
Message-ID: <200807312158.09392.phillips@phunq.net>

Another messy little function for the file index btree set of btree
leaf functions has landed, leaf_split:

   http://tux3.org/tux3?f=0fed5998fbec;file=test/leaf.c

This appears to work.  It was able to split the sample leaf at any
entry position, include zero (leaving an empty leaf in the original
btree block) or num_entries (leaving an empty leaf in the split block).

This was not a whole lot easier to write than leaf_insert.  In fact,
this whole two level design makes the coding quite hard, but the
benefit is clearly worth the effort I think.  The functions so far are
very fast, and the structure is really, really compact.  Which will
help offset the fact that tux3 will often be loading a lot of version
data that it doesn't actually need in order to get at a particular
version, including the current version.  I do not think this will be a
problem in practice because heavily versioned files are actually pretty
rare, and cache effects most likely will not be noticeable until up in
the several hundred version range.  At that level, probably every
versioning filesystem is going to exhibit some cache effects, not just
Tux3.

And we have a long term plan for keeping the cache footprint of
versioned extents to a manageable level, even for thousands of
versions, described in the original lkml posting.

Another possibility is to keep around a "digest" metadata btree that
contains only the pointers for a particular version, and make that be
consistent somehow with the "full" weave-style metadata.  Since this
would be maintained only for the current version, total metadata still
stays small vs schemes with distinct tree roots for each version (ZFS
and Btrfs).

Anyway, whatever cache pressure comes up with the current scheme, this
compressed leaf indexing scheme will help with it, to the tune of
fitting 33% more versioned extents into the same memory.

The next function to implement is leaf_merge, slightly less complex than
leaf_split I think.  Then the delete, which I am playing with ideas for
now, and finally a proper fuzzing test.  There will be 11 methods for
file index leaf nodes in total when done:

  1) show - done
  2) check - started
  3) lookup - done
  4) insert - done
  5) split - done
  6) merge
  7) delete
  8) create - done
  9) destroy - done
  10) freespace
  11) needspace

Then all 11 methods have to be implemented for the inode table btree,
which is actually quite a lot easier than the file index because
instead of having the two level directory, the directory is just a
simple table indexed by inum.

Then there are the index node methods, then there are three more btrees
to do, to implement all five I described earlier:

   inode table
   file index
   free extents*
   atime table*
   PHTree directory*

That all adds up to 110 methods to implement, quite a lot of work by
any measure.  Fortunately, most of them are much simpler than the file
index leaves.  Another nice time saver on the way to a functional
prototype is, none of the btrees marked with * are needed in the
functional prototype.  So after implementing two complete sets of leaf
functions (file index and inode table) I will move on to the generic
btree code, which will get the system to the point where we can test
the path all the way from inode lookup to file data block access.

Then the thing to do is hook up the versioned pointer code (not extents
for now) and try accessing some versioned data.  All still in userspace.

That is about a dozen code drops away.

Regards,

Daniel