9th May 2013, Adrian Scoica (adrian.scoica@gmail.com) This corpus was compiled from the Spanish Wikipedia dump from the 8th of March 2013. During the dump processing, only articles with contents larger or equal in size to 10 KB were kept. This leaves a total of 72764 articles, which are stored in separate files in groups of 5000 articles per file (with the exception of one group, which only has 2764 articles). Each of the files representing at most 5000 articles was processed in three stages: (A) plain text (*.txt) The plain text format contains the raw text content of the articles, after stripping away the textile markup. The textile markup was stripped using the Nokogiri ruby library, by first creating an HTML representation of the document, then removing all