About DAnIEL (Data Analysis for Information Extraction in any Language)

Description | Website | Publications | Troubleshooting | Versions

System description

DAnIEL is a fully automated press article analyzer. Its main purpose is to perform multilingual epidemic surveillance. Ii uses character n-grams as local analysis grain and text structure as global decision grain. DAnIEL relies on maximal repeated strings (rstr-max) detection computed in linear time The Python implementation used in DAnIEl has been mainly developed par Romain Brixtel , it is freely available on Google Code. Relying on character strings permits to avoid the use of language-specific grammars and analyzers (stemmers, POS taggers...). The text-level decision is based on stylistic properties of press articles The two main properties are the 5W rule and journalists collective style (read for instance Nadine Lucas: in French or in English).

These genre properties enhance the quality of information filtering by giving clues to separate primary and secondary pieces of information. The genericity of these principles, specific to the genre more than to the different languages, reduces the cost of the description. The process used by DAnIEL can be compared to humans "skim-over" reading strategy. Checking repetitions at key positions, rather than independent analysis of each and every sentence of the text, allows quick and efficient data analysis.

This approach is parsimonious because it limits the language-dependent ressources involved. The marginal cost for covering a new language is small compared to existing approaches. Few vocabulary and is needed to achieve results comparable to the state-of-the-art. website

This website serves numerous purposes:
  • Corpus distribution : the documents and annotations used for experiments can be downloaded here, more details are given on the corpus page
  • Annotation : collect annotators opinion on documents (e.g. does this document describe an epidemic event?) in order to test and evaluate DAnIEL. For more information: guidelines
  • Publication : show DAnIEL results on different corpora. See hereabove the links "True positives", "False negatives"...
  • Demonstration : allow the user to test DAnIEL
  • Real-time surveillance: deploy DAnIEL to cover web news in real-time (to come)

DAnIEL-related publications

  • [AIIM_2015] Multilingual Event Extraction for Early Epidemic Detection with Romain Brixtel, Antoine Doucet and Nadine Lucas. Artificial Intelligence in Medicine p. 131-143 CORE "A" Journal - Impact Factor 2,019 - Bib Pdf
  • [TALN_2015]Évaluation intrinsèque et extrinsèque du nettoyage de pages Web with Romain Brixtel and Charlotte Lecluze. Traitement Automatique des Langues Naturelles (TALN) 2015, p. 411-417, CORE C conference - Bib Pdf
  • [ICHI_2013] Any Language Early Detection of Epidemic Diseases from Web News Streams with Romain Brixtel, Antoine Doucet and Nadine Lucas, International Conference on Healthcare Informatics (ICHI) 2013 Acceptation rate<20% - Bib Pdf
  • [AIME_2013] Added-value of automatic multilingual text analysis for epidemic surveillance Gaël Lejeune, Romain Brixtel, Charlotte Lecluze, Antoine Doucet et Nadine Lucas, à paraître dans Artificial Intelligence in Medicine (AIME) 2013 p. 284-294
    Conférence Core "A", Acceptance rate (long articles): 27% - Bib Pdf
  • [TALN_2013] DAnIEL : Veille épidémiologique multilingue parcimonieuse Gaël Lejeune, Romain Brixtel, Charlotte Lecluze, Antoine Doucet et Nadine Lucas, demonstration in the french NLP conference (TALN) 2013, p. 77-78
    Conférence Core "C" - Bib Pdf
  • [JapTAL_2012] DAnIEL: Language Independent Character-Based News Surveillance with Romain Brixtel, Antoine Doucet et Nadine Lucas, Springer LNCS 2012, IX, 334. Lecture Notes in Artifical Intelligence, Vol 7614. Bib, Pdf
  • [Rhet-Trad_2012] Pour une approche cibliste en TAL: le cas de l'analyse automatique de la presse, with Christine Durieux, communication in International symposium Rhetorics and Translation, Orléans january 2012
  • [CLIA-COLING_2010] Filtering news for epidemic surveillance: towards processing more languages with fewer resources, with Antoine Doucet, Roman Yangarber et Nadine Lucas, The Fourth International Workshop On Cross Lingual Information Access, Coling 2010, pp 3-10 Bib Pdf Workshop of a CORE "A" conference
  • [JADT_2010] Tentative d'approche multilingue en extraction d'information, with Antoine Doucet et Nadine Lucas, JADT 2010 Rome, pp 1259-1268 Bib Pdf CORE "C" conference
  • [MINUCS_2009] A proposal for a multilingual epidemic surveillance system, with Mohamed Hatmi, Antoine Doucet Silja Huttunen et Nadine Lucas, Springer, LNCS 2010, Volume 40 Part 17 pp 343-348 Bib Pdf
  • [AMICT_2009] Structure patterns in Information Extraction: a multilingual solution?, Advances in Method of Information and Communication Technology AMICT09 , Volume 11 pp 105-111, Petrozavodsk, Russia, May 2009 Bib Pdf

Other applications of the character-based approach:

  • [DEFT_2012] Détection de mots-clés par approches au grain caractère et au grain mot with Gaëlle Doualan, Mathieu Boucher, Romain Brixtel et Gaël Dias, DEfi Fouille de Textes, TALN 2012,Workshop of a Core "C" conference
  • [DEFT_2011] Deft 2011: appariements de résumés et d'articles scientifiques fondés sur des distributions de chaînes de caractères with Romain Brixtel, Emmanuel Giguet et Nadine Lucas, DEfi Fouille de Textes, TALN 2011, pp 53-64 Bib Pdf Workshop of a Core "C" conference


Please report any problem to gael DOT lejeune AT unicaen DOT fr.

Known, and sometimes corrected, problems:

  • Slowness of website : sql requests have been improved, it should be better now
  • Display problems : seems to be related to Internet Explorer, has to be fixed but you can try firefox or chrome instead.
  • Strange characters : the database contained numerous encodings causing strange display, its now fixed
  • Translation tool disappearing : you may have disactivated/uninstalled flash and the google translate tool needs it

Main versions:

VersionsAnalysis grain Number of languagesAdded languagesPublications
1.0 Words 1 FrenchAMICT_2009
2.0 Characters (LCS) 3 English, SpanishMINUCS_2009, JADT_2010
2.3 Characters (LCS) 4 Chinese CLIA-COLING_2010
3.0 Characters (py-rstr-max)7 Greek, Polish, RussianJapTAL_2012, Rhet-Trad-2012
3.5 Characters (py-rstr-max)17 Arabic, Czech, Finnish, German,... ICHI_2013, AIME_2013, TALN_2013
4 Characters (py-rstr-max)53 Hungarian, Japanese, Portuguese, Romanian ... AIIM_2015, TALN_2015