Haiku considers OpenXML files (docx, pptx, xlsx) to be zip files

OK, I know that docx, xlsx and .pptx have a zip container, which is why Haiku is technically correct in assigning the application/zip mimetype to them. But this is highly impractical. How can I fix this?

Opening the file types app and adding the file extensions to the correct mime time to e.g. application/vnd.openxmlformats-officedocument.wordprocessingml.document has no effect. That is: Even after a reboot, tracker shows all files as Zip archives, regardless of their extension.

Also, right clicking a file and selecting “file types” from the context menu seems to affect only the current file and it involves a two-step process where I have to first manually change the mime type and second, if I am very lucky, I may have the chance to set the desired handler. But even this manual assignment doesn’t seem to survive reboots.

I’m almost certain that there must be another more practical way, or isn’t there? How does MacOS handle this? There, mime types are detected by content sniffing as well, so it should have the same problem.

2 Likes

File extensions mean very little on Haiku since we use MIME types in extended attributes.

This sounds like a bug.

What filesystem type are you working on? This manual assignment probably only works on BFS, as other filesystems don’t support typed extended attributes. (Or no extended attributes at all, e.g. FAT.) But it should work smoothly on BFS at least.

We need MIME sniffing rules for these types. We have them for other zip-based formats like ODTs to distinguish them from .zips, so we just need them for DOCX and the like too.

1 Like

Then, why does the file types app allow to add extensions?

BeFS. I’m not aware that the installer accepts anything else. Does it?

I wasn’t aware that OpenDocument also uses a zip container. In that case, adding something similar for OpenXML should not be too difficult, and, maybe I’m naive, but I’m surprised that it hasn’t been done.

The zip sniff rule has a 40% confidence level when it find PK\003\004 pattern at file start:

Any sniffing rule with some extra pattern rule with higher confidence, like 50% for instance, will win over at mime type detection time.

That’s how the current OASIS open document files types sniff rules works over the default zip mime type:

It looks in the first 512 bytes for an extra pattern.

I guess something similar could be done for openxml files.

This works for compressed containers that have their identifying stuff as the first file. :slight_smile: This doesn’t always have to be the case.

Some formats also use “just” the filename to try and differentiate. Maybe we should allow some file extensions to be the deciding factor for some files.

1 Like

This issue keeps popping up, and every time, can only be solved with developer intervention. File extensions should really be the deciding factor by default. Files that have the same file extension but need to be treated differently, are the exception, not the other way around. Is Unix purity so important?

1 Like

I’m undecided about that. However, it is frustrating that as a user I don’t seem to be able to affect the process. The file type application lets you add extensions, but doing so doesn’t have any visible effect

It has, when there is no sniff rule matching.

Unfortunately, here it’s a type of file which is also a zip file.

What missing, maybe, is to have a confidence value for extension, a 0.5 confidence on .docx extension could then win over the 0.4 confidence of the sniffing rule of a zip file.

1 Like

Is there a sniff rule for docx but with a low priority?

I would rather have something like a mimetype with a dependency.
Something like a something file has to also be a zip file, that is without considering dependant types it must resolve to a zip file, and then we can have the zip-dependants battle about which is right. i.e which file extension or magic is included (first file in the archive at a fixed offset for example)

Of course you can, you can set the mime type attribute manually. Ideally this should be just as easy as changing a file extension, but the ui for editing attributes isn’t quite there yet.

You can also edit the sniffing rules from the filetypes preferences, if you know how to identify files from their content (currently not by their extensions)

This has nothing to do with “UNIX purity” and is more a legacy from classic MacOS (where file extensions weren’t a thing either)

Can you give me a hint what I need to do? BTW, are you talking about changing the attributes for individual files? Because, having to set the mime type for each file individually would also be kind of tedious.

Can you tell me where the file are that I need to edit?

Not the solution you are asking for, but you can leave that at just changing the MIME type and setting the handler for all files of that type from the FileTypes general preferences.

You can also set the type from the command line with settype -t 'application/vnd.openxmlformats-officedocument.wordprocessingml.document' file.docx, do it to several files in one go or compose it with find to set it for all the files in your system if you are brave enough.

Even if better than clicking, that’s still manual.

OpenDocument requires the first file in the zip to be ‘mimetype’, uncompressed and with the mimetype as content, so it’s easy to detect. Microsoft Office formats aren’t that nice. You can check, on top of the zip signature, for the text ‘[Content_Types].xml’ that seems to always be the first file in packages created by Microsoft tools, and for ‘word/document.xml’, ‘ppt/presentation.xml’ and ‘xl/workbook.xml’ depending on the kind of file. Though I think the latter are not really normative, as it isn’t that they are the first files, so you may not find the strings in the first file chunk.

You would do it from the FileTypes preferences. Select the type (add it if you don’t already have one) and edit the Rule field in the File recognition box, just below the extensions. If you don’t see it, check Show recognition rule in the Settings menu.

1 Like

I think LibreOffice ships the sniffing rules and filetype definitions, but they are not in the base system? Should we include them in the base Haiku install?

I vote Yes on that. Sure, we all have feelings about Microsoft, but there are a lot of these files floating around.

2 Likes

OpenDocuments are already defined in haiku mine_db. For instance, for .odt files:

I think similar can be done for OpenXML documents, using, as pointed by @madmax , using sniff rule to check it’s a zip file but also contains the mandatory Content_Types.xml name in the zip catalogue

Something like “0.50 (\“PK\”) [0:512](\“[Content_Types].xml\”)"

Issues are:

  • there is no warranty that the [Content_Types].xml entry will be at top of the file, I’ve a least one sample, while still small, XLSX file where in fact it appears at the bottom of the file
  • you still need a second way to detect the type of OpenXML document, between spreadsheet, text, presentation. Which will also depends on the presence or not of some named XML entries within the zip file.