It seems to be increasingly uncommon for programs to have an easy way to guess encodings correctly and to manipulate them gracefully. Or maybe this is just a Linux problem. In any case, I’ve found myself looking this up repeatedly, after failing to get the right results in various text editors (VIM, Gedit, xed). This used to be so easy in Mac OS’s TextWrangler–there’s an actual menu item, “Reopen using encoding…”, with a drop-down list. You just pecked around the Cyrillic options until you saw something other than alphabet salad.
Anyway, here is the way to view and change encoding. One common source of problems is that the encoding in the file’s metadata is often wrong. For example, Sharoff’s frequency lists are encoded in CP-1251, which he says himself here. But the files claim to be in ISO-8859.
$ file lemma_al.txt
lemma_al.txt: ISO-8859 text, with CRLF line terminators
This mismatch is what causes VIM and its ilk to display trash instead of Cyrillic. Since the original information is lost, you have to do some guessing. In this case, the following command worked on first try:
This converts -f from encoding CP-1251 -t to encoding UTF-8, taking the next argument as the input and the -o argument as the output. Open it in a text editor to see if it did the trick.
Now, CP-1251 is just one encoding. What others might you have to try? There is a good review here. The usual legacy encodings are koi8r, koi8u, cp866, ruscii, cp1251, iso8859. They are known under different names sometimes, so you might have to do some digging to get it right.
Gouskova, Maria. 2021. Phonological asymmetries between roots and affixes. Submitted to the Blackwell Companion to Morphology, Eds. Peter Ackema, Sabrina Bendjaballah, Eulàlia Bonet, and Antonio Fábregas.
This review surveys the phonological asymmetries between roots and non-roots (affixes, clitics). It starts with an extraphonological, structural definition of roots, and considers those non-phonological properties that are phonologically relevant: they are easily borrowed, and they are most deeply embedded. The empirical portion of the review concentrates on templaticism and size restrictions, asymmetries in segmental contrast/inventories, the properties of multi-root words (compounds), and accentual characteristics that differ between roots and affixes. The theoretical section surveys theories that account for these properties: Prosodic Morphology, Positional Faithfulness, the cycle and its analogs, and Anti-Faithfulness. I then critically review several recent and not-so-recent proposals that blur the line between affixes and roots, using the ‘root’ designation diacritically or recasting diacritic distinctions as structural distinctions. The concluding section discusses the role of roots in phonological learnability.
Gouskova, Maria and Jonathan David Bobaljik. 2021/to appear. The lexical core of a complex functional affix: Russian baby-diminutive -onok. Natural Language and Linguistic Theory. [pdf]
Like other syntactic elements, affixes are sometimes said to be heads or modifiers. In Russian, one suffix,-onok, can be either: as a head, it is a size diminutive denoting baby animals, and as a modifier, it is an evaluative with a dismissive/affectionate flavor. Various grammatical properties of this suffix differ between the two uses: gender, declension class, and interaction with suppletive alternations, both as target and trigger. We explore a reductionist account of these differences: the baby diminutive comprises a lexical morpheme plus a functional nominalizing head, while the evaluative affix is the lexical morpheme alone. We contend that our account is superior to two conceivable alternatives: first, the view that these are homophonous but unrelated affixes, and second, a cartographic alternative, whereby diminutives attach at different levels in a universal structure.
These are some resources for phonetics students who want to know what languages have certain sounds, how these sounds are produced, and where in the world the languages are spoken.
World Atlas of Linguistic Structures: this is a resource on linguistic typology–classification of languages according to various characteristics. There is a page listing features of interest, and the atlas can be searched for specific language names, as well. Here is the page on the velar nasal, for example:
As with any typological resource, it is a good starting point, but you should always look at primary sources for further research.
The UCLA Phonetics Vowels and Consonants page: A classic resource that goes with Peter Ladefoged’s books A Course in Phonetics and Vowels and Consonants. For many languages, there are audio files of minimal pairs illustrating unusual contrasts. The audio was often recorded in the field so the quality is sometimes fuzzy. A newer version of the same materials can be accessed on Keith Johnson’s website.
International Dialects of English Archive: this has recordings of English speakers reading the same two texts. For American dialects, there are multiple speakers from each state, and their age and some other demographic information is given:
Articulatory IPA: A great collection of short MRI, ultrasound, and schematic videos illustrating various sounds.
Illustrations of the IPA: From the Cambridge University Press Journal of the International Phonetics Association, a series of articles that do sketches of individual languages’ sound systems. Search the journal contents by language name or sound type. Many of the articles are open source, and they come with audio files of high quality that go with the transcriptions in the book. To see the audio files for an article, click on its “Supplementary Materials” tab.
UPSID: the UCLA Phonological Segment Inventory Database. This is one of the older databases, with just 451 languages, but it is supposed to be balanced geographically and genetically (that is, related languages are not overrepresented). It’s a good starting point for researching the typology of sound inventories.
PhoNE (Phonology in the NorthEast) is the current incarnation of a series of annual workshops, mostly on phonology, which have been meeting on the East Coast for over two decades.
Historically, the names were acronyms based on the participating schools:
RUMMIT was the Rutgers-UMass-MIT phase of the meeting. This name was used from 2009 until 2014 or so.
UMMM was the UMass-MIT Meeting on phonology, a.k.a. MUMM. These names were used 2008-9.
HUMDRUM stood for “Hopkins, U of Maryland, Rutgers, Umass”. This name was used in 2000-2009.
RUMJCLaM was the “Rutgers-UMass Joint Class Meeting”. Before that, RUMD. These names were used in the 1990’s.
Here are the locations and dates of previous meetings. Corrections are welcome, and thanks to Juliet Stanton for help in tracking these down!
2019: Yale, April 13
2018: MIT, March 31
2017: UMass, April 8
2016: NYU, April 9
2015: Yale, April 2
2014: MIT, April 26
2013: UMass, April 6
2011: Rutgers, May 16
2010: MIT, December 4
2009: UMass, November 1
2009: MIT, May 9
2008: UMass, November 22
2008: MIT, March 29
2008: Rutgers, April 26
1998: UMass (RUMJClaM)
1997: MIT (as Bay and Berkshires Phonology)
This is a 30-page overview of phonological features, which I wrote for the phonology classes I teach at NYU. It is intended to be accessible to both undergraduates and graduates; I usually ask the undergrads to read sections 1-4 and 9, and the grads to read the whole thing. If you would like to cite this review in your work, refer to it as follows:
Gouskova, Maria. 2016. Features in Phonology. [pdf] Ms., New York University.