register

Evaluation data for epidemic surveillance purposes

The BECorpus used for some evaluations can be downloaded a this url : https://code.google.com/p/becorpus/. The informations for the files still available online (102 on June 15th 2014) is accessible in json format

The reference corpus built for evaluating DAnIEL can be downloaded here : dataniel. This archive contains the documents in html format as well as annotations in json format.

The corpus contains 2089 documents in 5 languages (Chinese, English, Greek, Polish and Russian). Each document has been manually cleaned in order to keep the text and the paragraph marks. Each file is encoded in UTF-8.

This corpus has been annotated by native speakers not involved in DAnIEL's developement. The guidelines given to our annotators can be found here.

Some statistics :

Chinese English Greek Polish Russian Cumulated corpus
Number of documents (relevant) 446 (16) 475 (31) 390 (26) 352 (30) 426 (41) 2089 (144)
Length in paragraphs 4428 6791 3543 3512 2891 21165
average +- standard deviation 9.9+-10.5 14.29+-7.23 9.08+-7.78 9.97+-6.95 6.78+-6.11 10.13+-8.3
Length in characters 1.14(10^6) 1.35(10^6) 2.05(10^6) 1.04(10^6) 1.56(10^6) 7.17(10^6)
average +- standard deviation 2568+-2796 2858+-1611 5264+-5489 2971+-2188 3680+-5895 3432+-4085