Título: A two-level structure for compressing aligned bitexts
Autores: Adiego Rodríguez, Joaquín
Rodríguez Brisaboa, Nieves
Martínez Prieto, Miguel Ángel
Sánchez Martínez, Felipe
Fecha: 2010-05-10
2010-05-10
2009-08
Publicador: RUA Docencia
Fuente:
Tipo: info:eu-repo/semantics/bookPart
Tema: Boosting compression
Parallel texts
Bitexts
Biwords
ETDC compressor
Ciencia de la Computación e Inteligencia Artificial
Descripción: A bitext, or bilingual parallel corpus, consists of two texts, each one in a different language, that are mutual translations. Bitexts are very useful in linguistic engineering because they are used as source of knowledge for different purposes. In this paper we propose a strategy to efficiently compress and use bitexts, saving, not only space, but also processing time when exploiting them. Our strategy is based on a two-level structure for the vocabularies, and on the use of biwords, a pair of associated words, one from each language, as basic symbols to be encoded with an ETDC compressor. The resulting compressed bitext needs around 20% of the space and allows more efficient implementations of the different types of searches and operations that linguistic engineerings need to perform on them. In this paper we discuss and provide results for compression, decompression, different types of searches, and bilingual snippets extraction.
Spanish projects TIN2006-15071-C03-01, TIN2006-15071-C03-02 and TIN2006-15071-C03-03. Regional Government of Castilla y León and the European Social Fund.
Idioma: Inglés

Artículos similares:

Choosing the correct paradigm for unknown words in rule-based machine translation systems por Sánchez Cartagena, Víctor Manuel,Esplà Gomis, Miquel,Sánchez Martínez, Felipe,Pérez Ortiz, Juan Antonio
Using external sources of bilingual information for on-the-fly word alignment por Esplà Gomis, Miquel,Sánchez Martínez, Felipe,Forcada Zubizarreta, Mikel L.
10