The BECorpus used for some evaluations can be downloaded a this url : https://code.google.com/p/becorpus/. The informations for the files still available online (102 on June 15th 2014) is accessible in json format
The reference corpus built for evaluating DAnIEL can be downloaded here : dataniel. This archive contains the documents in html format as well as annotations in json format.
The corpus contains 2089 documents in 5 languages (Chinese, English, Greek, Polish and Russian). Each document has been manually cleaned in order to keep the text and the paragraph marks. Each file is encoded in UTF-8.
This corpus has been annotated by native speakers not involved in DAnIEL's developement. The guidelines given to our annotators can be found here.
Some statistics :
|Number of documents (relevant)||446 (16)||475 (31)||390 (26)||352 (30)||426 (41)||2089 (144)|
|Length in paragraphs||4428||6791||3543||3512||2891||21165|
|average +- standard deviation||9.9+-10.5||14.29+-7.23||9.08+-7.78||9.97+-6.95||6.78+-6.11||10.13+-8.3|
|Length in characters||1.14(10^6)||1.35(10^6)||2.05(10^6)||1.04(10^6)||1.56(10^6)||7.17(10^6)|
|average +- standard deviation||2568+-2796||2858+-1611||5264+-5489||2971+-2188||3680+-5895||3432+-4085|