Parser for Wikipedia to mDict
I promised to release the Wikipedia parser I mentioned below. I finally added customization as commandline parameters, so here it is:
Download: WikiParser 0.1.
Basic usage is
java -jar "WikiParser.jar" [parameters] "InputFile.xml.bz2" "OutputFile.txt"
The output file should then be compatible with mdxbuilder. It is "Compact HTML", and encoding is "UTF8". As with all these small projects there are thousands of possible improvements, but honestly it is good enough for me, so I don't plan to implement them. But the source code is supplied, so you are free to make modifications. You are free to release them, only conditions is that you credit me as the original author.
Possible improvements:
Download: WikiParser 0.1.
Basic usage is
java -jar "WikiParser.jar" [parameters] "InputFile.xml.bz2" "OutputFile.txt"
The output file should then be compatible with mdxbuilder. It is "Compact HTML", and encoding is "UTF8". As with all these small projects there are thousands of possible improvements, but honestly it is good enough for me, so I don't plan to implement them. But the source code is supplied, so you are free to make modifications. You are free to release them, only conditions is that you credit me as the original author.
Possible improvements:
- Create database with titles, and resolve redirects, so links to redirected pages point to the correct page.
- Select pages to be included based on the number of links to the page.
- Better table parsing.
- GUI.
0 Comments:
Post a Comment
<< Home