A Kotlin project which extracts ngram counts from Wikipedia data dumps.
Download the latest jar from releases.
You can also clone the repository and build with maven:
$ git clone https://github.com/TomerAberbach/wikipedia-ngrams.git
$ cd wikipedia-ngrams
$ mvn packageA fat jar called wikipedia-ngrams-VERSION-jar-with-dependencies.jar will be in a newly created target directory.
DISCLAIMER: Many of these commands will take a very long time to run.
Download the latest Wikipedia data dump using wget:
$ wget -np -nd -c -A 7z https://dumps.wikimedia.org/metawiki/latest/metawiki-latest-pages-meta-current.xml.bz2Or using axel:
$ axel --num-connections=3 https://dumps.wikimedia.org/metawiki/latest/metawiki-latest-pages-meta-current.xml.bz2To speed up the download you should replace https://dumps.wikimedia.org with the mirror closest to you.
Once downloaded, extract the zipped data using a tool like lbzip2 and feed the resulting enwiki-latest-pages-articles.xml file into WikiExtractor:
$ python3 WikiExtractor.py --no_templates --json enwiki-latest-pages-articles.xmlThis will output a large directory structure with root directory text.
Finally, run wikipedia-ngrams.jar with the desired ngram "n" (2 in this example) and the path to directory output of WikiExtractor:
$ java -jar wikipedia-ngrams.jar 2 textNote that you may need to increase the maximum heap size and/or disable GC overhead limit.
contexts.txt and 2-grams.txt files will be in an out directory. contexts.txt caches the "sentences" in the Wikipedia data dump. To use this cache in your next run (with n = 3 for example), run the following command:
$ java -jar wikipedia-ngrams.jar 3 out/contexts.txtThe outputted files will not be sorted. Use a command-line tool like sort to do so.
Note that OutOfMemoryError is not a legitimate issue. The burden is on the user to allocate enough heap space and have a large enough RAM (consider allocating a larger swap file).