Chula to create 80m-word Thai 'corpus'

Chulalongkorn University academics are working on a three-year effort to produce an 80-million-word corpus of modern Thai language to honour His Majesty the King's 80th birthday and National Thai Language Day next year.
A corpus is a computer-stored collection of writings or recorded remarks for linguistic analysis. Its size is defined by the amount of samples it contains, not lexical entries like in a dictionary. Many institutions have created corpuses including the 100-million-word British National Corpus (BNC); the 450-million-word Bank of England corpus; the 100-million-word American National Corpus; and the 34-million-word Hellenic National Corpus. They are used in various works including dictionary-making. For example, the UK publisher Collins' Cobuild dictionaries; Longman's Dictionary of Contemporary English, and the Oxford dictionaries use them. Many Thai organisations have created corpuses for internal use, but copyright issues prevented them from allowing public access, said Asst Prof Wirote Aroonmanakun of the Department of Linguistics. Only the National Electronics and Computer Technology Centre's 400,000-word corpus is open to the public. But its data is mostly derived from academic papers and some news articles. Thus, it could not represent the entire field of Thai language, said project manager Kachain Tansiri. Seeing the necessity of having a thorough and standardised corpus, the academics started the Thai National Corpus (TNC) in October last year for completion in September 2008, said Wirote, who designed, standardised and created its programmes. The project was taken under Her Royal Highness Princess Maha Chakri Sirindhorn's patronage which offered the project organisers Bt3 million in funds. The organisers from the university's Department of Linguistics and Department of Electrical Engineering will present the complete TNC work to the Princess. Wirote said they would then consult with the Princess about its distribution: whether to use the Internet or a CD-ROM format. The TNC would serve as a reference for Thai studies researchers, students and teachers, while boosting the country's image of giving importance to its national language and supporting its research with a proper methodology and technology. The corpus saves time for researchers who need not make-up word samples themselves or mark samples in books, Kachain said. The coverage allows them to access the language's real usage in all mediums, making their work more credible, he added. Following the BNC model, the TNC contains 75 per cent informative texts to 25 per cent imaginative texts. Each sample would not exceed 40,000 words. Wirote gave more weighting to informative texts because they were more frequently used in people's daily lives. The text samples were also classed by "Medium" into 60 per cent books, 25 per cent periodicals, and the rest from miscellaneous sources (both published and unpublished). The text included could not be older than 1957 to ensure the items were not outdated. Kachain said the TNC would for now focus on written language, because to record and transcribe speech into computer-readable texts would require more time, money and manpower. As for concerns over copyright, Kachain said viewers could not access a full-text of the sample, which will be shown in a continuous stretch. All words in text samples would be marked by a computer programme with information such as type, gender of the writers, their ages and so forth. One obstacle to the work was the slowness in obtaining copyright permission to include the selected text from books and publications that date back 10 years, said Kachain. This process was just made a few months ago and many lecturers have helped by contacting publishers and writers. Although some publishers including Aksornsobhon Co Ltd and Viriyah Co Ltd had agreed to participate in the project, much more text was needed, said Wirote. He invited local writers and people with written texts to submit them at http://www.arts.chula.ac.th/ ~ling/TNC/
Premyuda Boonroj The Nation
|