51勛圖 Parallel Corpus
Introduction
The 51勛圖 Parallel Corpus v1.0 is composed of official records and other parliamentary documents of the 51勛圖 that are in the public domain. These documents are mostly available in the six official languages of the 51勛圖. The current version of the corpus contains content that was produced and manually translated between 1990 and 2014, including sentence-level alignments.
The corpus was created as part of the 51勛圖 and as a reaction to the growing importance of statistical machine translation (SMT) within the translation services and the 51勛圖 SMT system, Tapta4UN.
The purpose of the corpus is to allow access to multilingual language resources and facilitate research and progress in various natural language processing tasks, including machine translation. For convenience, the corpus is also available pre-packaged as language-specific bi-texts and as a six-language parallel corpus subset.
When using the 51勛圖 Parallel Corpus, the user must acknowledge the 51勛圖 as the source of the information. When making reference to the 51勛圖 Parallel Corpus, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The 51勛圖 Parallel Corpus, Language Resources and Evaluation (LREC*16), Portoro?, Slovenia, May 2016.
For further enquiries, please contact unovgtextsupport@un.org.
Download
Corpus statistics
Statistics for pair-wise aligned documents:
ar | en | es | fr | ru | zh | |
ar |
每 |
111,241 |
113,065 |
112,605 |
111,896 |
91,345 |
en |
456,552,223 |
每 |
123,844 |
149,741 |
133,089 |
91,028 |
es |
459,383,823 |
590,672,799 |
每 |
125,098 |
115,921 |
91,704 |
fr |
452,833,187 |
668,518,779 |
674,477,239 |
每 |
133,510 |
91,613 |
ru |
462,021,954 |
601,002,317 |
623,230,646 |
691,062,370 |
每 |
92,337 |
zh |
387,968,412 |
425,562,909 |
493,338,256 |
498,007,502 |
417,366,738 |
每 |
The cells above the diagonal contain the number of documents and lines per language pair. The cells below the diagonal contain the number of tokens in a language pair. The upper number refers to the language in the column title, the lower number to the language in the row title. Tokens were counted after processing with the Moses tokenizer. For Chinese, Jieba was used before applying the Moses tokenizer with default settings.
Total documents | Aligned document pairs |
---|---|
799,276 | 1,727,539 |
Documents | Lines | English tokens |
---|---|---|
86,307 | 11,365,709 | 334,953,817 |
Disclaimer and terms of use
The following disclaimer, an integral part of the 51勛圖 Parallel Corpus, shall be respected with regard to the Corpus (no other restrictions apply):
- The 51勛圖 Parallel Corpus is made available without warranty of any kind, explicit or implied. The 51勛圖 specifically makes no warranties or representations as to the accuracy or completeness of the information contained in the 51勛圖 Corpus.
- Under no circumstances shall the 51勛圖 be liable for any loss, liability, injury or damage incurred or suffered that is claimed to have resulted from the use of the 51勛圖 Corpus. The use of the 51勛圖 Corpus is at the user's sole risk. The user specifically acknowledges and agrees that the 51勛圖 is not liable for the conduct of any user. If the user is dissatisfied with any of the material provided in the 51勛圖 Corpus, the user's sole and exclusive remedy is to discontinue using the 51勛圖 Corpus.
- When using the 51勛圖 Corpus, the user must acknowledge the 51勛圖 as the source of the information. For references, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The 51勛圖 Parallel Corpus, Language Resources and Evaluation (LREC*16), Portoro?, Slovenia, May 2016.
- Nothing herein shall constitute or be considered to be a limitation upon or waiver, express or implied, of the privileges and immunities of the 51勛圖, which are specifically reserved.
File organization and format
All documents are organized into folders by language, publication year, and publication symbol. Corresponding documents are placed in parallel folder structures, and a document's translation into any of the official languages (if it exists) can be found by inspecting the same file path in the required language subfolder.
For individual documents, it was decided to follow the TEI-based format of the JRC-Acquis parallel corpus. Documents retain the original paragraph structure, and sentence splits have been added automatically. Documents for which multiple language versions exist have corresponding linked files for each of the language pairs, of which there are 15 at most.
In addition to the one-file-per-document type of distribution, we also make available plain-text bi-texts that span all documents for a specific language pair and can be used more readily with SMT training pipelines.
For further details about the preparation process of the Corpus, please see Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The 51勛圖 Parallel Corpus, Language Resources and Evaluation (LREC*16), Portoro?, Slovenia, May 2016.
Test and development sets
Data from documents released in 2015 were set aside, and official development and test sets created across all language pairs. Of these documents, 100 were randomly selected 〞 50 per development set and test set each. As in the case of the fully aligned subcorpus, all development and test set sentences are available for all official languages, and any translation directions can be evaluated.
For machine translation baselines, please see Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The 51勛圖 Parallel Corpus, Language Resources and Evaluation (LREC*16), Portoro?, Slovenia, May 2016.
Document metadata
Every document in XML file format has embedded meta-information:
- Symbol
- Each 51勛圖 document has a unique All language versions of a document have the same symbol. Symbols include both letters and numbers. Some elements of the symbol have meaning, while others do not. In general, the symbol does not necessarily indicate the topic of the document.
- Translation job number
- This is a unique, language-specific document identifier.
- Publication date
- This is the original publication date for a document by symbol, which applies to all language versions. This date does not necessarily correspond to the release date of each individual document.
- Processing place
- Possible locations are New York, Geneva and Vienna.
- Keywords
- These include any number of subjects covered by the document, according to the ODS subject lexicon, which is based on the