TAR BZIP2 with BFS attributes

Hi people,
I’ve managed to produce a simple app in python, using Haiku-PyAPI, that inflates and deflates tar.bzip2 archives. The special effects here are two:

  1. The bzip2 part is parallelized, thus taking advantage of multi-core/processor systems

  2. The app archives the specific attributes stored in BFS filesystem and restores them at decompression-time.

To all intents and purposes it creates tar.bzip2 archives, so in fact you can extract them with the relative tools. But if you do this way, you’ll get along with your original compressed files, files containing the attribute data.
This made me think about whether to use a different file type or continue using the tar.bz2 extension. It would be nice to open a vote to let you choose what to do.

In the meantime, I’m posting some compression size/time benchmarks that everyone loves.
Here’s a single big file compression using:

  • original tar command with bzip2 compression (that doesn’t store the BFS attributes)
  • my app
  • zip command
    The test file used is a randomly chosen file from the net, a soundfont2 (.sf2)

cs-sf
As you can see the output size is slightly greater than original tar/bzip2 compression, this is due in part by the little overhead introduced by bzip2 parallelization and in part by the attribute data stored.
To be precise, the data are these:

  • tar-bzi2 = 343106781 bytes
  • HTPBZ2 = 344757164 bytes
  • zip = 405221672 bytes
    Zip format compression is weaker as it was expected in this benchmark.

But on the contrary, it excels in processing speed as you can see in this graph:
ct-sf

here you can see that while zip is extremely fast comparing to original tar/bzip2
my app is slightly slower, a difference not very relevant, but huge if compared to original tar/bzip2 compression tools.
the raw data is here:

  • tar cjf → 40.82s
  • HTPBZ2 → 18.97s
  • zip → 17,74s

Now it’s time to compare a multiple files and dirs archive. I used the haiku source folder for this test (not so many attributes to store though)

cs-mf
Not so many differences from the previous chart

ct-mf

here we can see a slow down, in my app, placing its position between the original compression tools and the fastest zip tool. The possible reason may be that while bzip2 compression is easily parallelizable, tar storege is not. The attributes are added to the archive as files during the tar archive creation time. (but maybe in a future version I can think of a possible solution)

For now, HTPBZ2 is not optimized for decompression, so no need to produce charts comparing timings against tar/bzip2 tool. But it could happen if I manage to improve the code.

If interested the code is on GitHub
The application is still in alpha state for testing reasons and maybe because I’m missing some code here and there :smiley:

Enjoy!

Edit: Requirements: Latest Haiku-PyAPI from git repo

11 Likes

Here I am again, with some updates:
I rewrote part of the code to make the hierarchical structure more consistent. Now there is a sense of “common root”.
Established that the app can be started from the GUI, once it is properly installed, the app can be called from command line as well. Here’s an example:

HTPBZ2.py -c /boot/home/file1.bmp file2.txt …/…/otherdir subdir2/file3.pdf

As you can see there are 4 objects that will be added to the compressed file:

  • the first is written down with its absolute path
  • the second has only its basename (because it is in the same path of the current working directory)
  • the third is a directory located outside the current working directory
  • the fourth is a file with a relative path

in this case the program will try to find the common root path:
if the …/…/otherdir and the current-working-dir live in the same tree on /boot/home, the program will find the most lower parent (in this case /boot/home/ of /boot/home/file1.bmp) so the /boot/home/ part becomes the root of our archive, preserving the original structure, without storing the absolute path .

in this case:

HTPBZ2.py -c /boot/system/blabla/file /HaikuR14B/testfile /boot/home/inflatedfile.txt

The common root is /, so when extracting the file we will get the creation of the entire path (within the directory decided for the extraction)

Tricks
there are some more tricks through command line:

  • The parameter -t will measure the duration of the compression/decompression process
  • The parameter -g will automatically start the compression or decompression procedure and close the app at the end of the process.

Fisrt steps towards parallel decompression:
AFAIK, contrary to what happens for compression, the bzip2 decompression is not (easily?) parallelizable. A shame since this is the most heavy computational part. So I must focus my efforts in the tar part of the decompression. I managed to create a first attempt to parallelization of tar extraction. But as far as I could do, the parallelized process is slower than the single threading one. I still have to investigate where it looses ground, my first thoughts are that:

  • maybe the threading handling introduces overheading
  • there’s a I/O simultaneous access
  • lockings

for these three points I’ll try to batch more files for a single parallelized extraction. This would reduce the number of threads. Moreover I would try to use a multiprocess separation of works instead of a threading one, mitigating the impact of python “global intepreter lock”.
Another thought is that:

  • maybe it’s something related to the cpu architectures, as recent cpus has cpu boost abilities, that do not activate during tar decompression time as not heavy cpu dependent.
  • Or maybe the reason is simple: the tar parallelized approach if faulted from the ground.

Tricks part 2
The FileType association:
I’ve created a simple c++ launcher for decompressing tar.bz2 files with my app… Use it with FileTypes and the trick is done. doubleclicking tar.bz2 files will launch my app. The executable is in my HTPBZ2 github repo

Other “to do” tasks:
There are some more tricks and refinements I’m working on some of them are:

  • make it work exclusively in ram to speed up decompression
  • more checkups over files passed at launch
  • test and check the functionality of md5 checksums … stay tuned.

For now just a simple screenshot of the app:

3 Likes

Fisrt Optimization results:

Here some results of my first attempt to reduce the lag introduced by tar parallelization. Before this change the code exracted every tarfile “member” (file stored inside tarfile) in his own thread. Now I batched and elaborated them toghether, so the number of batches are more or less matches the number of cpus. Thus creating an hybrid between pure parallelization and single threading execution.

(the results are measured in the same conditions of temperature, files stored (no overwritings) and cpu frequencies and governor (no energy saving mode))

Time in seconds with the pure serial decompression:
max: 23.02s

min: 20.58s

Time in seconds with pure parallellized decompression:
max: 35.72s

min: 30.26s

Time in seconds with batched-parallellized decompression:
max: 30.25s

min: 28.18s

decomp.optimization

results:
The batched parallellization compensates some losses, but not enough to put it on an equal footing with single thread decompression.

Looking at the ActivityMonitor, this is the cpus behavior in single thread:

As expected the BZip2 task is intensive and spins the cpu at 100%, and the tar task pikes a single cpu to near 90% (the spindowns happens when the task swithes between cpus but we can consider inexistents)

while this is what happens in parallellized mode:

As you can see the tar decompression is shared among all the cpus but all of them runs at half the “potential” of each cpu.

All advice will be appreciated. Soon the code will be shared for the decompression parallellized modes

Thank you

4 Likes

Good News

I found one of my code errors, the lag was introduced by the excessive and useless opening and closing of the tar file. Now the extraction times are on the same level for all three methods of extraction:

from 17 to 26 seconds for parallel extraction
from 17 to 28 seconds for batch-parallel extraction
from 20 to 23 seconds for serial extraction

Despite this increase in performance, when the processing is in parallel, the graphic on the use of the CPUs remains the same (all CPUs at 40/50%)

maybe I can still do something better…

For now the code is on parallel-decomp branch in github

4 Likes

In the end, I noticed some faults (premature end of data, broken stored attributes …), during parallel extractions, which happens only sometimes. The more the extraction is parallelized the more errors happens. This is mostly due to tar not being thread safe, so:

  • considering the lack of serious advantages over single thread tar extraction
  • considering the problems introduced with parallelized tar extraction

I decided to abandon the tar parallel-decompression modes (at least until I dream of some kind of magic to bypass the problem)
Anyway the attributes writings remain parallelized for the parallelized mode

Apart from this hitch, I introduced the in ram bzip2 extraction, this would be nice, as it can save our SSDs from useless writings. Though I see that Haiku makes whims when I reach a little more than 2GB of ram space occupied (32GB here so no, that’s not a lack of ram), I need to do some more test about this.

Last post about this topic:

Program completed
I announce that the program has been completed.

The program was renamed and I decided to create the TMZ format, to distinguish it from TAR.BZ2 as there are additional files and data in the TMZ archives. In any case, it is possible (if necessary) to extract TMZ files with the TAR command.

I added an Installer script that will help in managing the filetypes and a launcher to be able to extract the double-click TMZ files (though I’m not sure if it should be assigned inside FileTypes).

The app can be launched within Deskbar menu, or by command line.
Command line is like this:
HTMZ.py -c path/to/file_or_dir
for compression mode
or
HTMZ.py -d path/to/file_or_dir
for decompression mode

you can pass multiple input files
like HTMZ.py -c file1 file2 dir1 file3 dir2

and has also these options:
-t show time used for the operation
-g start the operation and closes the app automatically (usefull in scripts)
-e experimental and incomplete features (mostly with bugs)

suggestion: until Haikuports provides a new version of Haiku-PyAPI with the recent fixes You’ll need to recompile the Haiku-PyAPIs with the installation script. Just launch the installer.sh

For any bugs please contact me

I hope it will be usefull

3 Likes

To be able to use it from Expander would be nice. It could also make adoption easier as people will look there first when they want to make archives.

I don’t know what you mean. This is a separate app written in another language (not C++ as Expander).
I provide a FileType (x-tmz) which will be installed to ~/config/settings/mime_db/application and a launcher which will be called at doubleclick on tmz files.
Maybe I can plan to write an add-on for compression requests.
That’s the best I can do.

Expander uses a rules file that has commands on how to handle specific file extensions by passing commands (list, expand) to external handlers.

Expander is just a front-end for decompressing tools. I don’t know if it handles tools that are not binaries yet but this shouldn’t be a problem. Another native front-end popular on Haiku is Beezer, it could be interesting to see if you can interact with it as well.

I accidentally used Bezip to compress a 7zip file with a password!

I intended to open it with Expander instead!

But to late, to fast clicked and now I cannot open the password protected 7zip file anymore!

Password it wrong, what to do?

Something to do with BFS attributes?

No. If you don’t remember the password you used, you’ll need to figure out how to use a zip password cracker type of app (not seeing any on depot, so you might need to use something on Linux or Windows).

Edit: not sure if you got a 7z file with passord protection, or if you just created a new zip (password protected) from an existing .7z file. In any case… nothing to do with BFS attributes.