sh0dan // VoxPod

Thursday, October 16, 2008

MDX Wikiparser 1.0

Here is version 1.0 of my wiki parser. It is still rather rough, but since you can download the most popular wikipedias below, it is only here for reference.

Requires JRE or JDK 1.5 or later to be installed.
Requires MySQL 5.0 or later installed.

Create a new shema in your database called wikindex
or something similar.

To run the project from the command line, go to the dist folder and
type the following:

java -jar "WikiParser.jar" [parameters] "InputFile.xml.bz2" "OutputFile.txt"

I have also included a sample bat file you can use as a basis for your own conversions. Be sure to adjust --databaseurl --user: and --password, if needed.

For fast indexing, you should have your database on a ramdisk, or an SSD, if you have one. You can find a good free ramdisk for 2000, XP and Vista here. 500MB should be enough for the english wikipedia.

Download WikiParser v1.0.

7 Comments:

  • Your new version gives me an error:
    "Cannot open DB Could not open connection to database:Communications link failure"

    I got me the latest dump from Wikipedia and with your Version 0.1 it seems to work. Earlier conversions ended up with garbled articles, so i thought i give the promising version number 1.0 a chance. So far no luck. :(

    By Blogger Corrodan, at 8:11 pm  

  • You must set up MySQL, and create a schema called wikindex with all-access to your user. Be sure to adjust --databaseurl --user: and --password, if needed.

    By Blogger Klaus Post, at 10:20 pm  

  • Duh. Thanks, i managed to overlook the part about the database.

    Will setup the Database tomorrow. Again, thanks for your tool!

    By Blogger Corrodan, at 10:24 pm  

  • Hi Klaus. Are you gonna to develop this project? Now it works almost perfect. In the latest Polish Wiki created by myself with your scripts, some internal links to other articles don't work. Anyway GREAT WORK. Thank you

    By Anonymous Anonymous, at 4:10 pm  

  • Hello,Klaus.I'm not familiar with MySQL and seeking help.I set up MySQL and input "create schema wikindex" to create the schema.But your program told me that it can't find table wikindex.links .So can you tell me more detail on how to create the schema? Thanks.

    By Anonymous Anonymous, at 10:54 am  

  • HI Klaus,
    Thanks - I used wikiparser 0.1, which worked mostly fine. There were a few minor formattings issues:
    1) images anchortext included some junk like: thumb|150px|Anchortext
    2) Categories were included (mostly junk)
    3) No newline before first category.

    I wanted to try your new 1.0. But I don't want to install MySQL.
    Is it possible to run without a database - I tried the --noindex option, but it still looks for the db.

    By Blogger Unknown, at 3:35 pm  

  • @MartinS2: v1.0 doesn't run without a DB, so you have to set one up.

    v0.1 has serious formatting problems, most of which are fixed in v1.0.

    @Jim: Seems like it requires your schema to be called "wikindex". Sorry for the slight inconvenience.

    @paraw: I had to convert the polish version to UTF16. It seems like there is a bug in mdxbuilder.

    By Blogger Klaus Post, at 10:46 pm  

Post a Comment

<< Home