sh0dan // VoxPod

Sunday, July 20, 2008

Parser for Wikipedia to mDict

I promised to release the Wikipedia parser I mentioned below. I finally added customization as commandline parameters, so here it is:

Download: WikiParser 0.1.

Basic usage is

java -jar "WikiParser.jar" [parameters] "InputFile.xml.bz2" "OutputFile.txt"

The output file should then be compatible with mdxbuilder. It is "Compact HTML", and encoding is "UTF8". As with all these small projects there are thousands of possible improvements, but honestly it is good enough for me, so I don't plan to implement them. But the source code is supplied, so you are free to make modifications. You are free to release them, only conditions is that you credit me as the original author.

Possible improvements:
  • Create database with titles, and resolve redirects, so links to redirected pages point to the correct page.
  • Select pages to be included based on the number of links to the page.
  • Better table parsing.
  • GUI.
Anyway, hope it is useful for a few of you.

0 Comments:

Post a Comment

<< Home