Find Duplicate Files
Go!
Disclaimer: I don’t even know if its possible using Find/Query
but it would be Sooo useful!
Find Duplicate Files
Go!
Disclaimer: I don’t even know if its possible using Find/Query
but it would be Sooo useful!
The problem is: how do you define what is a “duplicated” one? The name? The name plus the size? The name plus the size and timestamp? None of them are 100% bullet proof. The only way to be sure that is a duplicated, is applying some hash calculation over the files, like an MD5, or SHA checksum.
md5sum, for sure.
It’d be kinda cool to have the filesystem stash the md5sum of the content of every file in one of those handy BFS attributes and update it whenever the content changed.
It’d make software installs ind updates take a hella long time, though.
Software installs/updates are only a single file per package that gets written to the packages directory,with the contents of the package only mounted into the file system but not written to the disk.
That means you only need the md5sum of the package file,which should be pretty fast.
Anyway,I’m not sure if there’s a real need to have the hash of every file stored as attribute.
Even if it’s only very little compute and time per change,most users will probably never need it.
I think the original post was related more toward the user personal files, than the packages. But again, calculating MD5 checksum for each file is really slow, and being done automatically will severely penalize performance. I think this is the scope of a dedicated tool, but not part of the standard “Find” functionality (that already have their own bugs).
I hadn’t thought about how that must be working under the hood; that makes sense.
Neither am I, given the small install base? But with my IT hat on, I can say it’d make a hell of an intrusion-detection bonus and BFS is probably the only filesystem that could support it internally? It’s the kind of feature Tripwire and its ilk have to bolt onto the side of a Linux disk.
But it’d be a nifty thing to have available (off by default, of course); I certainly wouldn’t urge any developers to spend time on it now. ![]()
Name && Size it would be great, but even just file size or name would be useful if it only display files where the count is >= 2. (Is there even a Regular Expression that can do “count >= 2” ?)
There are tools like fdupes that can do it in Terminal, but having output in a Query window is more useful (sorting by name, date etc..)
or xxHash ? ![]()
name, file length, last modified date … because one could change an ‘x’ to a ‘y’ so name + size would not catch that.
Is there a way to redirect the output of a Terminal command to a Tracker window?
Even something simple like redirect “ls” to display in a Tracker query window?
If this is possible then redirecting fdupes command to Tracker would get the job done.
How does one create “virtual-directories”?
That should be relatively easy to implement, ZFS and btrfs do checksum, but not as visible attributes, they do it in the background at clock level.
Here is a linux shellscript that creates a sha256sum of a file and stores it as extended attribute user.checksum.sha256:
checksumfile:
#!/bin/bash
file=“$1”
hash=$(sha256sum “$file” | awk ‘{print $1}’)
setfattr -n user.checksum.sha256 -v “$hash” “$file”
echo “Hash gespeichert für $file”
Here the script that checks the checksum against the file:
#!/bin/bash
file=“$1”
stored_hash=$(getfattr --absolute-names -n user.checksum.sha256 --only-values “$file” 2>/dev/null)
if [ -z “$stored_hash” ]; then
echo “Kein Hash-Attribut gefunden.”
exit 1
fi
current_hash=$(sha256sum “$file” | awk ‘{print $1}’)
if [ “$stored_hash” == “$current_hash” ]; then
echo “OK: Hash stimmt überein.”
else
echo “FEHLER: Hash-Abweichung! Datei wurde eventuell beschädigt.”
exit 1
fi
For Haiku you need to replace setattr with addattr and getattr with catattr, i think that should be all
(sorry no Haiku at work (yet
)
perhaps a script to search dupes, it adds an indexed attribute META:dupe with the first matching file path+name to the following matching files, and opening a query META:dupe==** to manage them with a standard tracker interface?
There is a tool named fdupes that can do this on Linux. I think it could be ported to Haiku without much problems.
we already have fdupes ![]()
Yes this will work and I thought about using fdupes to get the duplicate files and then adding an attribute to what it finds, but then once you delete one of the duplicates (which is why I want to use the tool) you are left with a single file that has an attribute declaring it a duplicate, so you would have to remove the attribute after deleting the copy.
I’m working on a solution right now that takes the output of fdupes and turns it into a query file so each result becomes a file that Find will specifically look for. (name==”FILE1” || name==”FILE2” || name==”FILE3”) etc..
This way you are not modifying the files in any way.
Which leads me to the question- how to add the contents of a text file to an attribute file. Specifically a Query file. (i.e. the stuff in “Recent Queries” folder..
I see addattr can do it, but I’m having trouble formatting it properly..
The help file says “addattr [-f value-from-file] [-t type] attr file1 “
So lets say I have a query file named FindDups that I want to add values to from a text file named FD.txt. I guess it would be something like:
addattr -f FD.txt [-t type] attr FindDups
But I’m not sure of the “[-t type] attr “ part..
you should set an attribute named _trk/qrystr of string type
but what if a duplicated file has a common name, as readme.md or index.htm? you are going to have false matches
Yes thats true. Thats where I have to do things like sort by file size to check the match. I expect that.
You can check fdupes out in Haiku Depot or here GitHub - adrianlopezroche/fdupes: FDUPES is a program for identifying or deleting duplicate files residing within specified directories. · GitHub
The more I think about it, I think your solution is better. What I’m going to do is write a script that uses fdupes to get the duplicate files, mark the results using attributes, open them in Find, then when the Find window closes, the script will remove the attribute.