9th May 2013, Adrian Scoica (adrian.scoica@gmail.com) This corpus was compiled from the Romanian Wikipedia dump from the 8th of March 2013. During the dump processing, only articles with contents larger or equal in size to 10 KB were kept. This leaves a total of 9668 articles, which are stored in two separate files in groups of 5000 and 4668 articles per file, respectively. Each of the files representing at most 5000 articles was processed in three stages: (A) plain text (*.txt) The plain text format contains the raw text content of the articles, after stripping away the textile markup. The textile markup was stripped using the Nokogiri ruby library, by first creating an HTML representation of the document, then removing all