Interesting Usage of File Attributes

https://vectorvfs.readthedocs.io/en/latest/index.html

1 Like

I was thinking about something similar for BFS but there are some challenges.
First of all, in case of text embedding, the text itself needs to be split in chunks after the tokenisation. I haven’t thought about how to deal with it, yet.
The second challenge is that the multi-dimensional vector that represents the information can be made of hundreds if not thousands of dimensions depending on the model.
For example, MiniLM-L6-V2 returns 384 dimensions whilst GPT-4 16.000.
I’m not clear if Haiku can store such a big information as an attribute and I don’t how to efficiently store it, by the way.
I did not know this project, I’ll have a look. Thanks for sharing!

For VectorVFS, our goal is to store embeddings directly in the inode when possible to minimize lookup overhead and maintain the embedding’s tight association with the file metadata. When embeddings are too large to fit, we can:

Compress or quantize the embedding to shrink its size.

Split the embedding into multiple xattr entries.

Allow the filesystem to automatically spill to an external xattr block if necessary.

By carefully managing the size and format of the embeddings, VectorVFS achieves seamless integration of vector search capabilities into the file system layer itself. In the current implementation, VectorVFS will store embeddings of 1024D into half-precision to fit the 4kb budget.