Page 1 of 1

Feature request: encoding detection

Posted: Fri May 24, 2013 3:42 pm
by ltn
Hi, Keka developer(s):

I'm a Chinese user, and just started using Keka to unarchive zip files. We know that there's no encoding information in a zip file, so if I unzip a zip file which is encoded in GBK, it will produce unrecognizable characters as the file names.

Suppose I have a file 英文.srt whose filename is in GBK (which is quite typical in a Windows computer), and compress it into a zip file on a Windows. Then I unzip it on a Mac using Keka, the filename would become ¼òÌå&Ó¢ÎÄ.srt. This is because the default encoding on Mac is utf-8, both the system unzipper and Keka will treat it as utf-8, but it is actually in GBK, which is the default filename encoding on Windows.

So, can you add encoding auto-detection in Keka for non-english speaking users?

Thank you very much.

Re: Feature request: encoding detection

Posted: Sat May 25, 2013 10:05 am
by Lotusbrod
Hello ltn

Obviously it is aONe who can give an answer on whether or not the functionality you are looking can be provided but as I have some experience of automatic encoding detection I thought I would comment. Unfortunately it is not as straightforward as you might think to discover what the encoding is for a particular filename especially as it does not need to make sense in the language you are using. However, there is a way to solve the problem you are having by providing manual encoding selection. I am not saying this is a trivial task but it is certainly not as difficult and would be a lot more reliable than an automated system.

I would think of it working something like this:
1. add an option to the extraction settings along the lines of "Allow filename encoding to be set"
2. when this flag is set and you drag and drop an archive into the extraction area it opens a dialog to set the encoding. Think of what happens when you open a CSV file in OpenOffice. You get a dialog box which shows encoding, language and lets you preview the contents of the file so you can make sure the encoding you choose gives readable results. The language is picked up from the system environment and the encoding would be UTF-8 by default as that is the system default but you could change both the language and the encoding using dropdown lists. The preview section could then display the structure of the archive so you can check if the encoding you have chosen results in readable filenames after extraction.
3. after finding the right encoding you could then confirm to extract with the chosen encoding and the filenames would be converted to utf-8

Of course code for the conversion of the filename would still need to be written (as far as I am aware Apple do not provide off the shelf conversion in Objective-C) and I don't know how big a job that might be.

Re: Feature request: encoding detection

Posted: Sat May 25, 2013 5:36 pm
by ltn
Hi Lotusbrod,

Thanks for replying.

As far as I know, it is impossible to detect the encoding of a string with 100% accuracy. But actually we know a lot of apps such as TextMate who can "guess" the encoding of text with a rather high accuracy, and if the guess fails, it will ask the user to choose the encoding with a preview window to show the result of decoding. I don't really know how do they do it, and I don't think it's a trivial task. But probably there's a library to do this?

Your solution is a very good idea if encoding "guessing" is too hard to implement. Thank you very much.

Re: Feature request: encoding detection

Posted: Mon May 27, 2013 9:35 am
by aone
I'll have to see if p7zip supports any encoding flag first. If not, this will have to wait to a binary free Keka build.

Re: Feature request: encoding detection

Posted: Tue Jun 04, 2013 2:57 pm
by ltn
Hi, if p7zip doesn't support encoding, you can simply use a shell script to convert the file names after unzipping.

Thanks aone, great work!

Re: Feature request: encoding detection

Posted: Tue Feb 25, 2014 11:09 am
by SergeyGomanyuk
Keka is the very good archiver but it can be the best one if filenames encoding would be supported for zip archives. I'd like to up this topic and describe my situation:
I often deal with zip archives created on Windows and create zip archives for guys who use Windows. All goes fine till latin letters are used for filenames. Once national letters, i.e. cyrillic, are met in archive the troubles come - filenames of extracted files are unreadable. So currently I use "The Unarchiver" for extracting zip archives, because it is possible to choose filenames encoding in its zip setting. For creating zip archives I use CleanArchiver that again has possibility to setup filenames encoding in zip archive. So if such settings would be available in Keka, Keka will be the solid and to my mind the best archiver solution for Mac for that even do not mind paying money.