A first step (!) in developing an inductive general grammar is to obtain complete digital representations of a sufficiently large number of languages. There are publicly-availabled annotated corpora (e.g., treebanks) for 70 languages or so. That represents about 1% of the world's languages, and most of those 70 come from a single language family (Indo-European). One avenue toward obtaining digitizations of more languages (and, by definition, of low-resource languages) is to collaborate with speakers. I have been developing software called CLD that is intended to support such collaboration.
-
Seal download - which includes CLD.
Quick start:
- Untar it, call the resulting directory $seal
- Put $seal/bin on your PATH, $seal/python on your PYTHONPATH
-
$ cld foo.cld create_test
- creates two subdirectories in the working directory, 'foo.cld' and 'media'. They contain a tiny fake corpus. -
$ cld foo.cld
- starts a web server in Python at port 8000, that writes logging information to stdout, and opens a browser window on localhost:8000.
- [80] CLD: Software for Computational Language Description - a draft paper, likely to undergo further revision.