Practical Use of Apple’s Dictionary (2)

I wrote some articles about the way of converting StarDict dictionaries into the Apple’s Dictionary format by means of DictUnifier.app, but it does not seem that all of StarDict dictionaries can be always easily converted into Apple’s format. Quite a few errors were reported remaining unsolved on the bulletin board of the project “mac-dictionary-kit”.

This issue also occurred to me yesterday, during conversion of Klaus Mylius’ Sanskrit-Deutsch Dictionary:

https://skalldan.files.wordpress.com/2011/06/wpid-tr_error.png

An error occurred during the tr process:

tr: illegal byte sequence

In order to execute the script in Mac OS X terminal, I installed sdconv (a command line tool of DictUnifier.app, as I mentioned in yesterday’s post), then tried again the same process.

$ cd ~/tmp
$ wget http://mac-dictionary-kit.googlecode.com/files/sdconv-0.3.tar.bz2
$ bunzip2 -c sdconv-0.3.tar.bz2 | tar xvf -
$ cp -auv sdconv /usr/local/
$ /usr/local/sdconv/convert stardict-mylius-sanskrit-deutsch.tar.bz2
  ...
- Building mylius.dictionary.
- Cleaning objects directory.
- Preparing dictionary template.
- Preprocessing dictionary sources.
tr: Illegal byte sequence
Error.
 ...

OK. I confirmed the same error. Next time, I tried with LC_ALL=C (given a hint here).

$ LC_ALL=C /usr/local/sdconv/convert stardict-mylius-sanskrit-deutsch.tar.bz2 
  ...
- Building mylius.dictionary.
- Cleaning objects directory.
- Preparing dictionary template.
- Preprocessing dictionary sources.
utf8 "xB9" does not map to Unicode at /usr/local/sdconv/bin/make_line.pl line 51, <> chunk 5045.
utf8 "xC4" does not map to Unicode at /usr/local/sdconv/bin/make_line.pl line 51, <> chunk 9403.
utf8 "xB9" does not map to Unicode at /usr/local/sdconv/bin/make_line.pl line 51, <> chunk 25687.
utf8 "xB9" does not map to Unicode at /usr/local/sdconv/bin/make_line.pl line 51, <> chunk 27566.
utf8 "xB9" does not map to Unicode at /usr/local/sdconv/bin/make_line.pl line 51, <> chunk 36932.
utf8 "xDC" does not map to Unicode at /usr/local/sdconv/bin/make_line.pl line 51, <> chunk 50059.
utf8 "xC4" does not map to Unicode at /usr/local/sdconv/bin/make_line.pl line 51, <> chunk 52864.
- Extracting index data.
- Preparing dictionary bundle.
- Adding body data.
- Preparing index data.
- Building key_text index.
- Building reference index.
- Fixing dictionary property.
- Copying CSS.
- Finished building objects/mylius.dictionary.
Done.
  ...

In this turn, the conversion completed even with some warnings. Check out the operation from Dictionary.app:

https://skalldan.files.wordpress.com/2011/06/wpid-mylius.png

I don’t have full assurance of the continuous stability, but for the moment, it seems good.

DictUnifier.app is a very useful tool, but it appears that we must not put too much confidence in it.

Convert Babylon Dictionaries?!

While browsing the reports of the mac-dictionary-kit’s Issues, I found the interesting comment posted on the Issue 4, “Stardict-babylon format not supported” (Comment no. 27). He explained “HOW CONVERT/ADD BABYLON DICTIONARIES TO MAC DICTIONARY”. Provided that this comes up on Mac OS X, we could get use of thousands of Babylon free dictionaries found here:

http://www.babylon.com/free-dictionaries/

After having finished following these procedures by myself, to tell the conclusion first, I found them a little bit complicated. Furthermore, to realize this conversion, we have to work on Linux (such as Ubuntu). So this will not be a topic toward all of Mac Users.

Nevertheless, I will concisely note the procedure below before I forget it. Remember that it could be useful, but without any warranty. I’m NOT matured with Linux, please let me know if I commit fatal mistakes.

The idea is to convert existing Babylon dictionaries to StarDict format (on Linux), then StarDict to Apples Dictionary (on Mac OS X). A dictionary that I’ve chosen is Jeffrey Hopkins’ Tibetan-Sanskrit-English Dictionary, which is provided here only as *.bgl format.

Hereafter for a while, I will work on Ubuntu 10.10 (Maverick Meerkat) using VMWare Fusion (3.1.3) in Mac OSX.

In a terminal (Ubuntu):

$ cd ubuntu-work   # <-- shared with osx
$ mkdir hopkins
$ cd hopkins
$ wget http://buddhistinformatics.ddbc.edu.tw/glossaries/files/babylon-hopkins.ddbc.bgl
 ...
$ ls
babylon-hopkins.ddbc.bgl
$ sudo apt-get install stardict stardict-tools dictconv
 ...
$ dictconv babylon-hopkins.ddbc.bgl -o babylon-hopkins.ddbc.ifo # <-- pay attention "*.ifo"
...

Results
File: babylon-hopkins.ddbc.ifo
Title: Jeffrey Hopkins' Tibetan-Sanskrit-English Dictionary
Author: Jeffrey Hopkins
Email: da@ddbc.edu.tw
Version: 
License: This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Description: 
Original Language: Other
Destination Language: English
Headwords: 18441
Words: 18382

$ ls
babylon-hopkins.ddbc.idx babylon-hopkins.ddbc.bgl   # <-- confirm 3 files (*.ifo, *.dict, *.idx)
babylon-hopkins.ddbc.ifo babylon-hopkins.ddbc.dict
$ mkdir hopkins-stardict
$ mv babylon-hopkins.ddbc.idx babylon-hopkins.ddbc.ifo babylon-hopkins.ddbc.dict hopkins-stardict/
$ sudo cp -auv hopkins-stardict /usr/share/stardict/dic/hopkins
$ stardict &  # <-- launch stardict

At this point, you may get to see the following window:

https://skalldan.files.wordpress.com/2011/06/wpid-hopkins_fault.png

At first sight, it seems OK, but some characters are broken and the HTML tags (<p> </p>, etc.) remain displayed.

https://skalldan.files.wordpress.com/2011/06/wpid-hopkins_fault_2.png

Bizarre… I have to check the contents.

EDIT: 2011/06/18
Thanks to a knowledgeable adviser who left a helpful comment, this will be solved… See first his comment and the followings. There will be a solution to this puzzle. Then if you are interested in my labor, please come back here and continue reading.

$ cd hopkins-stardict/
$ stardict2txt babylon-hopkins.ddbc.ifo   # <-- convert to text file
Write to file: babylon-hopkins.ddbc.txt
$ emacs babylon-hopkins.ddbc.txt

https://skalldan.files.wordpress.com/2011/06/wpid-hopkins_mojibake.png

Some characters are not correctly converted (ā, ī, ū, etc.). So I MANUALLY picked up the mojibake, verifying the correct characters as compared with GoldenDict’s display1, and tried to convert HTML tags into linefeed code (n) like this:

# hopkins_conv.sed

# for unicode characters 
s/Ä/ā/g
s/√/√/g
s/á¹­/ṭ/g
s/ṃ/ṃ/g
s/Å›/ś/g
s/ṇ/ṇ/g
s/ñ/ñ/g
s/á¹›/ṛ/g
s/á¸/ḍ/g
s/Å«/ū/g
s/á¹£/ṣ/g
s/Ä€/Ā/g
s/Ä«/ī/g
s/ḥ/ḥ/g
s/á¹…/ṅ/g
s/á¹/ṝ/g
s/Åš/Ś/g
s/—/---/g

# for HTML tags
s/<p><b>/n/g
s/<b>/n/g
s/</b>/n/g
s/</p>//g
s/<ul><li>//g
s/</li><li>/n/g
s/</li></ul>//g
s/nn/n/g

This manual procedures were so annoying and some oversight may remain… Does anyone know the better way?

EDIT: 2011/06/18
Don’t follow this way. For Mojibakes, just type this in ubuntu shell:
iconv -f UTF-8 -t ISO-8859-1 babylon-hopkins.ddbc.txt -o babylon-hopkins.utf8.ddbc.txt
To remove html tags, you may use html2text.

And then, apply this to original file (babylon-hopkins.ddbc.txt).

$ sed -f hopkins_conv.sed babylon-hopkins.ddbc.txt > babylon-hopkins-rev.txt
$ /usr/lib/stardict-tools/tabfile babylon-hopkins-rev.txt
Convert over.
babylon-hopkins-rev wordcount: 18382
$ mkdir hopkins-rev
$ mv babylon-hopkins-rev.ifo babylon-hopkins-rev.idx babylon-hopkins-rev.dict.dz hopkins-rev/
$ sudo cp -auv hopkins-rev /usr/share/stardict/dic/hopkins
$ stardict &

This time seems good.

https://skalldan.files.wordpress.com/2011/06/wpid-hopkins_suc.png https://skalldan.files.wordpress.com/2011/06/wpid-hopkins_suc_2.png

OK, now we come back to Mac OS X.

In a terminal (Mac OS X):

$ cd ~/ubuntu-work/hopkins/hopkins-stardict/  # <-- shared with ubuntu
$ tar -jcvf babylon-stardict-hopkins.tar.bz2 hopkins-rev
$ /usr/local/sdconv/convert babylon-stardict-hopkins.tar.bz2 -n Hopkins_Tibetan_Dictionary -i hopkins
  ...
Done.
To test the new dictionary, try Dictionary.app.
$ open -a Dictionary.app

https://skalldan.files.wordpress.com/2011/06/wpid-hopkins_mac.png https://skalldan.files.wordpress.com/2011/06/wpid-hopkins_mac_22.png

Voilà, it works!

Footnotes:

1 GoldenDict is a multifunctional dictionary search program, which supports multiple dictionaries’ formats, such as Babylon, StarDict, Dictd, ABBYY Lingvo, and so on. For details, see jalasthāna’s earlier article.

19 thoughts on “Practical Use of Apple’s Dictionary (2)

  1. It seems to me that the mojibake is a result of the fact that in some step of the conversion process the UTF-8 encoded characters were read as if the encoding was Windows-1252 (CP-1252). You can see that by the misinterpretation of ‘ṇ’ as ṇ . This requires ‡ to have the code point 128(bin), which it has in Windows-1252, but not f.e. in Latin-1. (See here for more details: http://jalasthana.de/wiki/Fonts_and_Encodings)

    Maybe you are lucky and the misinterpretation can be corrected by misinterpreting it again. Try the following in a Ubuntu shell:

    iconv -f UTF-8 -t Windows-1252 babylon-hopkins.ddbc.txt -o babylon-hopkins.utf8.ddbc.txt

    In general I guess the problem should be reported to the author of dictconv Raul Fernandes (rgfbr@yahoo.com.br). Or to anybody else who knows C++, as the tool is open source: http://ktranslator.sourceforge.net/download.html

    To remove html-tags from files you may try the tool html2text (available from the ubuntu repositories). I find that quite useful.

    • Thank you for a useful information.
      I tried your suggestion in a Ubuntu shell:

      iconv -f UTF-8 -t Windows-1252 babylon-hopkins.ddbc.txt -o babylon-hopkins.utf8.ddbc.txt

      Unfortunately, the result was that the garbled characters were replaced by the other garbled characters…

      I am always confused with moibake problems, but I will try it again after reading the presented page (this is awesome).

      • Manuel san,

        I found the answer:

        iconv -f UTF-8 -t ISO-8859-1 babylon-hopkins.ddbc.txt -o babylon-hopkins.utf8.ddbc.txt

        As you said, I was so lucky. Thanks again for a big hint.

      • You are right. It works if one supposes a misinterpretation as latin-1. So my Windows-1252 hypothesis was wrong… This is puzzling. I should read my own wiki article again I guess :-D
        But I am happy, that it works now.

      • Without your suggestion, I would NEVER find it.

        Like my handle (skal ldan, having luck), I also have the luck to get the opportunity for studying about the puzzling encodings, coming to know your instruction, clear and so detail wiki page ;-)

  2. Pingback: 辞書.app を活用する | Amrta

  3. Pingback: Practical Use of Apple’s Dictionary | Amrta

  4. Pingback: 辞書.app を活用する (3) | Amrta

  5. Pingback: フランス語辞書備忘録 | Amrta

    • As you said, I confirmed the same message “Format not supported”.

      The simple answer is that the format of this dictionary can not be correctly converted by DictUnifier.app, but I guess if you remove HTML tags from this dictionary’s contents, DictUnifier.app may be able to convert it into Apple’s format. But in order to check the contents of stardict-collins5, you have to work on Linux. The procedures are a little complicated as I mentioned in this article…

      • Did you already make it? I tried to remove the HTML tags of the dictionary on Ubuntu. If you can stand that some useful links would be all erased from that dictionary, you can just type the following lines in a shell.

        $ stardict2txt Collins5.ifo
        $ sed -e "s/<br>/\\\n/g" Collins5.txt > Collins5_br.txt
        $ sed -e "s/<[^>]*>//g" Collins5_br.txt > Collins5_rev.txt
        $ tabfile Collins5_rev.txt
        
  6. Pingback: docXter » DictUnifier – StarDict

  7. Can’t get the Thai-English Stardict dictionary to convert for anything. Dictunifier (old and newer gui versions), and this too all give me the same error, with no .dictionary file created: Error: Parse failure [์ยาเสพติด 10758034 0 ์ยาเสพติด ]. normalize_key_text aborted. Error. ditto: can’t get real path for source “/Applications/DictUnifier.app/Contents/Resources/sdconv-0.2/bin/build_dict.sh” lexitron-te-2.0 Dictionary.xml Dictionary.css DictInfo?.plist

    Any ideas?

  8. I tried this process but when I use tabfile the program gives me an error that says: “No tab! Skipping line”. But I have \t after every word entry. Would you happen to have a solution to this?

    • also, I used textutil to convert the html to txt by first changing the extension to html from txt then converting with utf8 encoding:

      textutil -inputencoding utf8 -convert txt -encoding utf8 ~/Desktop/dict.txt

      and also breaking up the txt file into smaller chunks with split (so textutil could convert, it only converted 40 MB if I try the whole thing), renaming the chunks to the html extension, and then after conversion concatenating them with cat.

      • actually that’s:
        textutil -inputencoding utf8 -convert txt -encoding utf8 ~/Desktop/*.html

  9. Pingback: Sanskrit .dictionary files for Mac OS - lalitaalaalitah

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s