51勛圖

51勛圖 Parallel Corpus

Introduction

The 51勛圖 Parallel Corpus v1.0 is composed of official records and other parliamentary documents of the 51勛圖 that are in the public domain. These documents are mostly available in the six official languages of the 51勛圖. The current version of the corpus contains content that was produced and manually translated between 1990 and 2014, including sentence-level alignments.

The corpus was created as part of the 51勛圖  and as a reaction to the growing importance of statistical machine translation (SMT) within the  translation services and the 51勛圖 SMT system, Tapta4UN.

The purpose of the corpus is to allow access to multilingual language resources and facilitate research and progress in various natural language processing tasks, including machine translation. For convenience, the corpus is also available pre-packaged as language-specific bi-texts and as a six-language parallel corpus subset.

When using the 51勛圖 Parallel Corpus, the user must acknowledge the 51勛圖 as the source of the information. When making reference to the 51勛圖 Parallel Corpus, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The 51勛圖 Parallel Corpus, Language Resources and Evaluation (LREC*16), Portoro?, Slovenia, May 2016.

For further enquiries, please contact unovgtextsupport@un.org.

Download

Corpus statistics

Statistics for pair-wise aligned documents:

  ar en es fr ru zh
ar

111,241
18,539,207

113,065
18,578,118

112,605
18,281,635

111,896
18,863,363

91,345
15,595,948

en  

456,552,223
512,087,009

123,844
21,911,121

149,741
25,805,088

133,089
23,239,280

91,028
15,886,041

es

459,383,823
593,671,507

590,672,799
678,778,068

125,098
21,915,504

115,921
19,993,922

91,704
15,428,381

fr

452,833,187
597,651,233

668,518,779
782,912,487

674,477,239
688,418,806

133,510
22,381,416

91,613
15,206,689

ru

462,021,954
491,166,055

601,002,317
569,888,234

623,230,646
513,100,827

691,062,370
557,143,420

92,337
16,038,721

zh

387,968,412    
387,931,939

425,562,909    
381,371,583

493,338,256    
382,052,741

498,007,502    
377,884,885

417,366,738    
392,372,764

             

The cells above the diagonal contain the number of documents and lines per language pair. The cells below the diagonal contain the number of tokens in a language pair. The upper number refers to the language in the column title, the lower number to the language in the row title. Tokens were counted after processing with the Moses tokenizer. For Chinese, Jieba was used before applying the Moses tokenizer with default settings.

Document statistics:

Total documents     Aligned document pairs
799,276 1,727,539

Fully aligned subcorpus statistics:

Documents     Lines English tokens
86,307 11,365,709     334,953,817

Disclaimer and terms of use

The following disclaimer, an integral part of the 51勛圖 Parallel Corpus, shall be respected with regard to the Corpus (no other restrictions apply):

  • The 51勛圖 Parallel Corpus is made available without warranty of any kind, explicit or implied. The 51勛圖 specifically makes no warranties or representations as to the accuracy or completeness of the information contained in the 51勛圖 Corpus.
  • Under no circumstances shall the 51勛圖 be liable for any loss, liability, injury or damage incurred or suffered that is claimed to have resulted from the use of the 51勛圖 Corpus. The use of the 51勛圖 Corpus is at the user's sole risk. The user specifically acknowledges and agrees that the 51勛圖 is not liable for the conduct of any user. If the user is dissatisfied with any of the material provided in the 51勛圖 Corpus, the user's sole and exclusive remedy is to discontinue using the 51勛圖 Corpus.
  • When using the 51勛圖 Corpus, the user must acknowledge the 51勛圖 as the source of the information. For references, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The 51勛圖 Parallel Corpus, Language Resources and Evaluation (LREC*16), Portoro?, Slovenia, May 2016.
  • Nothing herein shall constitute or be considered to be a limitation upon or waiver, express or implied, of the privileges and immunities of the 51勛圖, which are specifically reserved.

File organization and format

All documents are organized into folders by language, publication year, and publication symbol. Corresponding documents are placed in parallel folder structures, and a document's translation into any of the official languages (if it exists) can be found by inspecting the same file path in the required language subfolder.

For individual documents, it was decided to follow the TEI-based format of the JRC-Acquis parallel corpus. Documents retain the original paragraph structure, and sentence splits have been added automatically. Documents for which multiple language versions exist have corresponding linked files for each of the language pairs, of which there are 15 at most.

In addition to the one-file-per-document type of distribution, we also make available plain-text bi-texts that span all documents for a specific language pair and can be used more readily with SMT training pipelines.

For further details about the preparation process of the Corpus, please see Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The 51勛圖 Parallel Corpus, Language Resources and Evaluation (LREC*16), Portoro?, Slovenia, May 2016.

Test and development sets

Data from documents released in 2015 were set aside, and official development and test sets created across all language pairs. Of these documents, 100 were randomly selected 〞 50 per development set and test set each. As in the case of the fully aligned subcorpus, all development and test set sentences are available for all official languages, and any translation directions can be evaluated.

For machine translation baselines, please see Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The 51勛圖 Parallel Corpus, Language Resources and Evaluation (LREC*16), Portoro?, Slovenia, May 2016.

Document metadata

Every document in XML file format has embedded meta-information:

Symbol
Each 51勛圖 document has a unique  All language versions of a document have the same symbol. Symbols include both letters and numbers. Some elements of the symbol have meaning, while others do not. In general, the symbol does not necessarily indicate the topic of the document.
Translation job number
This is a unique, language-specific document identifier.
Publication date
This is the original publication date for a document by symbol, which applies to all language versions. This date does not necessarily correspond to the release date of each individual document.
Processing place
Possible locations are New York, Geneva and Vienna.
Keywords
These include any number of subjects covered by the document, according to the ODS subject lexicon, which is based on the