6th June 2013, Adrian Scoica (adrian.scoica@gmail.com) This corpus was compiled from the German Wikipedia dump from the 9th of March 2013. During the dump processing, only articles with contents larger or equal in size to 10 KB were kept. This leaves a total of 105820 articles, which are stored in separate files in groups of 5000 articles per file (with the exception of one group, which only has 820 articles). Each of the files representing at most 5000 articles was processed in three stages: (A) plain text (*.txt) The plain text format contains the raw text content of the articles, after stripping away the textile markup. The textile markup was stripped using the Nokogiri ruby library, by first creating an HTML representation of the document, then removing all