Btrfs Sparse File Tests

Btrfs - Interestingly when I disabled copy on write for some files, which are updated often partially, like databases. The data fragmentation on the disk got way worse. I didn't really expect that to happen. I assumed the fragmentation would decrease when CoW is disabled. Well, I guess it's highly circumstantial. Also very interestingly I still see allocations where a single block is allocated as extent. Previous extent is 1023 blocks long and next extent is just 1 block and the next is 1023. Then there are also seemingly random 1025 block extents. This is way messed up on some level. Maybe the root cause is bad (?) allocation on some level, plus the fact that I'm using sparse files. Especially because those are also spread between different areas on the disk. I'm pretty sure that this hasn't been properly thought out, because the allocation is just extremely bad to be honest. When I read such file, it's just contact seeking to get the file read and performance is way worse than it would be with contiguous file. - Facts after tests -

To be honest, this is a second run. The first run was 1GB file, one file at a time. Most interestingly in both cases, it produced perfect contiguous file when the sparse file was filled. I assume this happened because all of the data was stored in RAM before getting stored to disk.

This run is same, but 4GB files and both cow and nocow files getting written in parallel to the disk from two different processes. Now it's very slow and let's see what kind of results are produced.

Interesting observation, writing the nocow file in parallel progressed much faster. Dunno why. Also, this is probably the slowest 8GB I've ever written to disk.

State before random filling:

$ lsattr cow

---------------------- cow/sparse_file

$ lsattr nocow

---------------C------ nocow/sparse_file

$ filefrag -v cow/sparse_file 

Filesystem type is: 9123683e

File size of sparse_file is 4294967296 (1048576 blocks of 4096 bytes)

 ext:     logical_offset:        physical_offset: length:   expected: flags:

   0:        0..  196607:  540286208.. 540482815: 196608:             unwritten

   1:   196608..  327679:  575683840.. 575814911: 131072:  540482816: unwritten

   2:   327680.. 1048575:  577780992.. 578501887: 720896:  575814912: last,unwritten,eof

sparse_file: 3 extents found

$ filefrag -v nocow/sparse_file 

Filesystem type is: 9123683e

File size of sparse_file is 4294967296 (1048576 blocks of 4096 bytes)

 ext:     logical_offset:        physical_offset: length:   expected: flags:

   0:        0..   65535:   70556928..  70622463:  65536:             unwritten

   1:    65536..  131071:  436518522.. 436584057:  65536:   70622464: unwritten

   2:   131072..  196607:  437001472.. 437067007:  65536:  436584058: unwritten

   3:   196608..  327679:  486087965.. 486219036: 131072:  437067008: unwritten

   4:   327680..  458751:  492903716.. 493034787: 131072:  486219037: unwritten

   5:   458752..  524287:  506207488.. 506273023:  65536:  493034788: unwritten

   6:   524288..  655359:  518790400.. 518921471: 131072:  506273024: unwritten

   7:   655360..  851967:  533011712.. 533208319: 196608:  518921472: unwritten

   8:   851968.. 1048575:  535567852.. 535764459: 196608:  533208320: last,unwritten,eof

sparse_file: 9 extents found

I took some intermediate states, and those were enormous. Just final counts:

$ filefrag -v cow

cow/sparse_file: 849354 extents found

Short sample:

 110:      115..     115:  575927108.. 575927108:      1:  575923036:

 111:      116..     116:   58859811..  58859811:      1:  575927109:

 112:      117..     117:  540286325.. 540286325:      1:   58859812: unwritten

 113:      118..     118:  540286326.. 540286326:      1:            

 114:      119..     119:  540501281.. 540501281:      1:  540286327:

 115:      120..     121:  540286328.. 540286329:      2:  540501282: unwritten

 116:      122..     125:  540286330.. 540286333:      4:            

 117:      126..     126:  540534146.. 540534146:      1:  540286334:

 118:      127..     127:  540493154.. 540493154:      1:  540534147:

$ filefrag -v nocow

nocow/sparse_file: 9 extents found

Short sample:

  19:      351..     351:   70557279..  70557279:      1:             unwritten

  20:      352..     356:   70557280..  70557284:      5:            

  21:      357..     357:   70557285..  70557285:      1:             unwritten

  22:      358..     366:   70557286..  70557294:      9:            

  23:      367..     367:   70557295..  70557295:      1:             unwritten

  24:      368..     374:   70557296..  70557302:      7:            

  25:      375..     375:   70557303..  70557303:      1:             unwritten

Nocow is partially written, but it's just 9 extents, which some are used and some arent (unwritten).

Verified files that content is as expected after using the random filling method:

$ b3sum reference-file

5620ca55bbbcc193dc5051623adaa484c84f9457d16790ce011b5b3b24860403

$ b3sum nocow/sparse_file

5620ca55bbbcc193dc5051623adaa484c84f9457d16790ce011b5b3b24860403

$ b3sum cow/sparse_file

5620ca55bbbcc193dc5051623adaa484c84f9457d16790ce011b5b3b24860403

$ filefrag -v nocow/sparse_file 

File size of sparse_file is 4294967296 (1048576 blocks of 4096 bytes)

 ext:     logical_offset:        physical_offset: length:   expected: flags:

   0:        0..   65535:   70556928..  70622463:  65536:            

   1:    65536..  131071:  436518522.. 436584057:  65536:   70622464:

   2:   131072..  196607:  437001472.. 437067007:  65536:  436584058:

   3:   196608..  327679:  486087965.. 486219036: 131072:  437067008:

   4:   327680..  458751:  492903716.. 493034787: 131072:  486219037:

   5:   458752..  524287:  506207488.. 506273023:  65536:  493034788:

   6:   524288..  655359:  518790400.. 518921471: 131072:  506273024:

   7:   655360..  851967:  533011712.. 533208319: 196608:  518921472:

   8:   851968.. 1048575:  535567852.. 535764459: 196608:  533208320: last,eof

sparse_file: 9 extents found

Seems that nocow file is exactly as expected. Intermediate slowness was just due to high

metadata operations during intermediate stages, it was NOT caused by fragmentation.

So I must and have to admit it: I was completely wrong.

Then the most interesting result, what's the state with cow file.

I did included full extent list for the nocow file, because it was well, now very compact. I won't include the list for btrfs. I'll just take a snippet and final number. Why? I believe the final number makes you understand it.

$ filefrag -v cow/sparse_file

File size of sparse_file is 4294967296 (1048576 blocks of 4096 bytes)

 ext:     logical_offset:        physical_offset: length:   expected: flags:

 13456:   915378..  915378:  357562574.. 357562574:      1:   58803094:

913457:   915379..  915379:  580091596.. 580091596:      1:  357562575:

913458:   915380..  915380:  578501932.. 578501932:      1:  580091597:

913459:   915381..  915381:   58829751..  58829751:      1:  578501933:

913460:   915382..  915382:  373022317.. 373022317:      1:   58829752:

913461:   915383..  915383:  436730783.. 436730783:      1:  373022318:

cow/sparse_file: 1046354 extents found

As kind of expected (?) the file is extremely fragmented (uh, it's basically as fragmented as a file can be). Yet technically the fragment locations could be even more spread out, it it would be totally random allocation in the partition space. It seems to be 250 read read operations per second and roughly 1 MB/s read speed. So it's like ~250 IOP for one megabyte. After thinking the number for a while: Wow, that's just 2222 extents short of "perfect fragmentation", which would have resulted as 1048576 extents.

Well, at least this fixed some of my expectations and clarified some other things. Making this test was worth of doing.

Conclusions:

1) Pre-allocation really does work great with sparse files and is really worth of doing

2) Copy-on-Write (CoW) is very bad for files which are updated in small chunks. As example databases and other written as random access style files.

After seeing the amount of fragmentation, I don't anymore wonder the reports where people complain about "extreme amount of metadata". If this 4 megabyte test file got ~100 megabytes of metadata, well, it kind of explains it. And no, I didn't measure the actual amount of metadata. It's pure estimation. The list of the allocations is ~72 megabytes, yet its' sparse presentation. Compact presentation doubled could be in around same range but probably is bit less. We're still talking about tens of megabytes quite likely.

In this sense the ext4 with it's extent based allocation seems much better keeping data in somewhat sanely sized segments. Typically at least 2048 blocks per extent, usually way more.

I've been using SQLite3 databases with many random writes on Btrfs with older SSD. What a nightmare! I'm really curious what the Write Amplification Factor (WAF) actually is, when I increment counter. I could test that just for fun. I'm sure it's something horrible. CREATE TABLE tab (count); UPDATE tab SET count = count + 1; Just a table with a single counter column and a single row. And then call the update many times in individual transactions. How many bytes will be erased from SSD and how many IOPs the OS I/O counters will show.

After digging around the net with my observations, I can also say that:

> With Btrfs there's nothing surprising in this post. All of my findings are totally expected and actually well known and documented at least a decade a ago.

Also running Ubuntu 22.04 LTS -> 24.04 LTS update on btrfs took almost 4 times longer than running same update with same hardware using ext4. The high number of separate IO operations was the primary performance killer. With ext4 the ioscheduler usually merged very high percentage of the writes almost completely masking the load caused by that from the underlying storage device.

Full detailed data is available here for sometime, before I clean it up.

https://s.sami-lehtinen.net/pub/231018_btrfs_sparse_file_test.7z (no HTML linking on purpose)

I guess this is the reason why we need storage devices which can handle 500k 4 kB random IOPS.

2024-09-08