Cyrillic encodings | Maria Gouskova

It seems to be increasingly uncommon for programs to have an easy way to guess encodings correctly and to manipulate them gracefully. Or maybe this is just a Linux problem. In any case, I’ve found myself looking this up repeatedly, after failing to get the right results in various text editors (VIM, Gedit, xed). This used to be so easy in Mac OS’s TextWrangler–there’s an actual menu item, “Reopen using encoding…”, with a drop-down list. You just pecked around the Cyrillic options until you saw something other than alphabet salad.

Anyway, here is the way to view and change encoding. One common source of problems is that the encoding in the file’s metadata is often wrong. For example, Sharoff’s frequency lists are encoded in CP-1251, which he says himself here. But the files claim to be in ISO-8859.


$ file lemma_al.txt

lemma_al.txt: ISO-8859 text, with CRLF line terminators

This mismatch is what causes VIM and its ilk to display trash instead of Cyrillic. Since the original information is lost, you have to do some guessing. In this case, the following command worked on first try:

$ iconv -f cp1251 -t utf8 lemma_al.txt -o lemma_al_utf8.txt

This converts -f from encoding CP-1251 -t to encoding UTF-8, taking the next argument as the input and the -o argument as the output. Open it in a text editor to see if it did the trick.

Now, CP-1251 is just one encoding. What others might you have to try? There is a good review here. The usual legacy encodings are koi8r, koi8u, cp866, ruscii, cp1251, iso8859. They are known under different names sometimes, so you might have to do some digging to get it right.