The BECorpus used for some evaluations can be downloaded a this url : https://code.google.com/p/becorpus/. The informations for the files still available online (102 on June 15th 2014) is accessible in json format
The reference corpus built for evaluating DAnIEL can be downloaded here : dataniel. This archive contains the documents in html format as well as annotations in json format.
The corpus contains 2089 documents in 5 languages (Chinese, English, Greek, Polish and Russian). Each document has been manually cleaned in order to keep the text and the paragraph marks. Each file is encoded in UTF-8.
This corpus has been annotated by native speakers not involved in DAnIEL's developement. The guidelines given to our annotators can be found here.
Some statistics :
Chinese | English | Greek | Polish | Russian | Cumulated corpus | |
Number of documents (relevant) | 446 (16) | 475 (31) | 390 (26) | 352 (30) | 426 (41) | 2089 (144) |
Length in paragraphs | 4428 | 6791 | 3543 | 3512 | 2891 | 21165 |
average +- standard deviation | 9.9+-10.5 | 14.29+-7.23 | 9.08+-7.78 | 9.97+-6.95 | 6.78+-6.11 | 10.13+-8.3 |
Length in characters | 1.14(10^6) | 1.35(10^6) | 2.05(10^6) | 1.04(10^6) | 1.56(10^6) | 7.17(10^6) |
average +- standard deviation | 2568+-2796 | 2858+-1611 | 5264+-5489 | 2971+-2188 | 3680+-5895 | 3432+-4085 |