Resources

Machine Translation

  • Compare different Machine Translation paradigms. You can download the latest data from its GitHub repository. More details in our EACL17 paper.
  • Tuning Statistical Machine Translation on character sequences. An extension to SMT toolkits to tune on character sequences (rather than the default word sequences as used by e.g. BLEU) by adding the chrF1 evaluation metric. Available as part of the Joshua SMT toolkit. More details in our WMT16 paper.
  • DELiC4MT. A software tool that allows to perform diagnostic evaluation of Machine Translation systems over linguistic checkpoints. Visit its website.
  • Webservices for aligners. Source code of webservices for several sentential and subsentential aligners and workflows. Details can be found on my paper “Towards a User-Friendly Webservice Architecture for Statistical Machine Translation in the PANACEA project (EAMT 2011)”. [tgz].
  • Catalan<->Italian Machine Translation system. Collaboration with the Apertium project to create a translator for the language pair Catalan-Italian by exploiting the existing translators for the pairs Spanish<->Catalan and Spanish<->Italian. You can download the latest data from the SVN repository.

Older stuff (warning: not actively maintained!)

  • Linking Wikipedia categories to Wordnet synsets. A set of polysemous nouns from WordNet 2.1 which are mapped to Wikipedia categories. The disambiguation task should then identify, for each noun, which of its senses, if any, corresponds to the mapped/s category/ies. Download the evaluation data (20090714).
  • tfidfwrap provides a TF-IDF C++ class by wrapping tfidf (Tf-idf library in python, http://code.google.com/p/tfidf/) which: “constructs an IDF corpus and stopword list either from documents specified by the client, or by reading from input files. It computes IDF for a specified term based on the corpus, or generates keywords ordered by tf-idf for a specified document”. Download the source [tgz] and the corpus and stopword TF-IDF files generated from the English
    Wikipedia (dump from January 2008) [tgz], the Italian Wikipedia [tgz] and the Spanish Wikipedia [tgz].
  • wiki_db_access (C++ Wikipedia API). A free software (GPL licensed) package that includes a C++ API (tested under GNU/Linux and Win32) to access Wikipedia in DataBase format and utilities to download and import the required data. Download the source [tgz] (version 20100324).
  • Manually disambiguated mappings between WordNet 1.5 and WordNet 3.0: a set of manually disambiguated mappings as a result of the upgrading of Inter-Lingual connections of the Italian WordNet from version 1.5 to 3.0. Download the mappings [tgz].
  • DRAMNERI. A free software (GPL licensed) application to Named Entity Recognition (NER). This is a knowledge-based and customizable tool to perform NER. It is fully documented and has been succesfully tested under GNU/Linux and Win32 although it should work in any platform where a C++ compiler with STL support is available.
    Download the source with documentation and examples [tgz], win32 binaries [tgz] or linux binaries [tgz] (version 0.2.1)
  • WinDRAMNERI. A freeware Spanish frontend for DRAMNERI (v.0.2.x) which runs under Win32 platforms. It may need additional DLL and OCX files (the application will print a message asking for them if so). Developed and contributed by Carlos Leonel Chinchilla Calvo (clchinchilla(–at–)gmail.com). Download the executable [exe].
  • Tagged entries of the Simple English Wikipedia. 3517 randomly selected entries manually tagged with NER categories (NONE, LOC, ORG, PER). Download as a compressed plain text file [tgz].