Wikipedia Ngrams

A Kotlin project which extracts ngram counts from Wikipedia data dumps.

Download

Download the latest jar from releases.

You can also clone the repository and build with maven:

$ git clone https://github.com/TomerAberbach/wikipedia-ngrams.git
$ cd wikipedia-ngrams
$ mvn package

A fat jar called wikipedia-ngrams-VERSION-jar-with-dependencies.jar will be in a newly created target directory.

Usage

DISCLAIMER: Many of these commands will take a very long time to run.

Download the latest Wikipedia data dump using wget:

$ wget -np -nd -c -A 7z https://dumps.wikimedia.org/metawiki/latest/metawiki-latest-pages-meta-current.xml.bz2

Or using axel:

$ axel --num-connections=3 https://dumps.wikimedia.org/metawiki/latest/metawiki-latest-pages-meta-current.xml.bz2

To speed up the download you should replace https://dumps.wikimedia.org with the mirror closest to you.

Once downloaded, extract the zipped data using a tool like lbzip2 and feed the resulting enwiki-latest-pages-articles.xml file into WikiExtractor:

$ python3 WikiExtractor.py --no_templates --json enwiki-latest-pages-articles.xml

This will output a large directory structure with root directory text.

Finally, run wikipedia-ngrams.jar with the desired ngram "n" (2 in this example) and the path to directory output of WikiExtractor:

$ java -jar wikipedia-ngrams.jar 2 text

Note that you may need to increase the maximum heap size and/or disable GC overhead limit.

contexts.txt and 2-grams.txt files will be in an out directory. contexts.txt caches the "sentences" in the Wikipedia data dump. To use this cache in your next run (with n = 3 for example), run the following command:

$ java -jar wikipedia-ngrams.jar 3 out/contexts.txt

The outputted files will not be sorted. Use a command-line tool like sort to do so.

Note that OutOfMemoryError is not a legitimate issue. The burden is on the user to allocate enough heap space and have a large enough RAM (consider allocating a larger swap file).

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src/main/kotlin/com/tomeraberbach/wikipedia		src/main/kotlin/com/tomeraberbach/wikipedia
.gitignore		.gitignore
license		license
pom.xml		pom.xml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Wikipedia Ngrams

Download

Usage

Dependencies

License

About

Uh oh!

Releases 1

Sponsor this project

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

Uh oh!

License

TomerAberbach/wikipedia-ngrams

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Ngrams

Download

Usage

Dependencies

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Sponsor this project

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages