sh0dan // VoxPod

Sunday, May 31, 2009

MDX WikiParser v1.1

Here is a revised version of the Wikipedia to mDict converter.

Download: Wikiparser v1.1.

Changes from v1.0 to 1.1:
  • New DB design. Faster, takes a bit more space.
  • Better CPU scaling.
  • Fixed locked schema name - now you truly can call your DB shema what you like.
  • Fixed table formatting bug.
  • Result is 200% faster indexing and 50% faster processing on Quad-Core.

21 Comments:

  • Which file should I convert: "pages-meta-current.xml.bz2" or "pages-articles.xml.bz2"?

    By Anonymous pawaw, at 3:06 pm  

  • @pawaw: You should download "pages-articles.xml.bz2".

    By Blogger Klaus Post, at 3:09 pm  

  • Is there any significant difference between "MDict(compact HTML)" and "MDict(html)" in MdxBuilder settings? What should be set?

    By Anonymous pawaw, at 12:24 pm  

  • @pawaw: I only think it is a matter of compression. I use "compact html".

    By Blogger Klaus Post, at 12:39 pm  

  • Thanks for great work! Are you thinking about adding image address parsing and automatically downloading for some specific area, such as math formulas? That will be really appreciated. :)

    By Blogger Unknown, at 5:35 am  

  • Hi Klaus - Please could you let me know what WikiParser parameters you used for the May '09 English Jumbo extract? Thanks in advance.

    By Anonymous Steve, at 11:12 pm  

  • @Steve: java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 2 --maxlength 100000 --maxlengthtrim=70000 --noredirects --noexternal --noindex --norecount e:\current.en.xml.bz2 f:\en-wiki-jumbo.txt
    These are the settings used for the other versions:
    java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 2 --noredirects --noexternal e:\current.da.xml.bz2 n:\da-wiki.txt

    java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 4 --maxlength 5000 --maxlengthtrim=4000 --noredirects --noexternal --skiptables --databaseurl="127.0.0.1/wiki" e:\current.pl.xml.bz2 n:\pl-wiki.txt

    java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 4 --maxlength 5000 --maxlengthtrim=4000 --noredirects --noexternal --skiptables e:\current.es.xml.bz2 n:\es-wiki.txt

    java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 4 --maxlength 7000 --maxlengthtrim=6000 --noredirects --noexternal --skiptables e:\current.zh.xml.bz2 n:\zh-wiki.txt


    java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 9 --maxlength 4500 --maxlengthtrim=3500 --noredirects --noexternal --skiptables e:\current.de.xml.bz2 n:\de-wiki.txt

    java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 9 --maxlength 5500 --maxlengthtrim=4500 --noredirects --noexternal --skiptables e:\current.fr.xml.bz2 n:\fr-wiki.txt

    java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 4 --maxlength 7000 --maxlengthtrim=6000 --noredirects --noexternal --skiptables e:\current.pl.xml.bz2 n:\pl-wiki.txt

    java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 5 --maxlength 5000 --maxlengthtrim=4000 --noredirects --noexternal --skiptables e:\current.ru.xml.bz2 n:\ru-wiki.txt

    java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 5 --maxlength 5000 --maxlengthtrim=4000 --noredirects --noexternal --skiptables e:\current.ja.xml.bz2 n:\ja-wiki.txt

    java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 4 --maxlength 7000 --maxlengthtrim=6000 --noredirects --noexternal --skiptables e:\current.it.xml.bz2 n:\it-wiki.txt

    java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 3 --noredirects --noexternal --skiptables e:\current.pt.xml.bz2 n:\pt-wiki.txt

    java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 15 --maxlength 5000 --maxlengthtrim=4200 --noredirects --noexternal --simple --skiptables --noindex --norecount e:\current.en.xml.bz2 n:\en-wiki.txt

    java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 2 --maxlength 100000 --maxlengthtrim=70000 --noredirects --noexternal --noindex --norecount e:\current.en.xml.bz2 f:\en-wiki-jumbo.txt

    java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 50 --maxlength 500 --maxlengthtrim=300 --noredirects --noexternal --simple --skiptables --noindex --norecount e:\current.en.xml.bz2 n:\en-wiki-mini.txt

    By Blogger Klaus Post, at 10:08 am  

  • Hi Klaus - thank you very much for the information. It is appreciated. One parameter I just don't seem to understand is [maxlengthtrim]. If [maxlength] is 100k and [maxlengthtrim] is 70k what affect does this have on the extraction of the article that matches this criteria?

    By Anonymous Steve, at 12:23 pm  

  • @Steve: ok, this is actually also quite complex, I'll try to explain.

    First of all, these limits are imposed after the text has been processed, which you might have guessed.

    The reason there are two limits is that I want the cutoff to be at a natural text break, such as a section endning or something similar.

    First of all, if the output content is smaller than "maxlength", nothing is obviously done to the text.

    If it is longer than "maxlength", the program reverses to the "maxlengthtrim" position in the text, and starts looking for a natural break in the text from there.

    If the break is within "maxLength" x 2, all text after the break is cut off. If not, the text is simply cut at "maxlengthtrim".

    By Blogger Klaus Post, at 12:56 pm  

  • Hi Klaus - excellent, now I understand! Thanks for taking the time to reply.

    By Anonymous Steve, at 1:18 pm  

  • Hi Klaus - is WikiParser v1.1 excluding all articles with a colon ':' in the article title? In the English Jumbo edition from May, not one of the 2.4M articles has a colon in the article title. Therefore, articles such as 'Star Trek: Enterprise' are missing. The article that is included: 'Star Trek Enterprise' is blank. I have the same problem with the latest dump I've created as well.

    By Anonymous Steve, at 6:55 pm  

  • Hey there Klaus, and thank you for creating this raw but powerful tool.
    As always happens "in the tubes of the internets" I bumped in here after a couple of chained links, and am now eager to have a pocket version of the latest enwiki and itwiki dump on the 16GB microsd of my brand new hd2.
    Yet, before leaving this power consuming Seven system turned on during the night in the same room that I sleep in, I'd rather use my low-drain ubuntu home server to do the job. Do you think this script will work with a linux mysql/java installation as well, or should I go straight installing XP in a virtualbox and adding mysql and java to it?
    Also, +1 on Steve's request, does your parser skip the colon in the page titles?
    Anyway, props for this little but powerful thing ;)

    By Blogger ephestione, at 10:23 am  

  • @Ephestione
    I have done some test with this wikiparser (v 1.1), I dont think it is best to use virtual machine, it is much better if you use a dedicated windows based apps, you would also need to have a RAM Drive (software) to make the process faster, you might even want to tweak the source code in order to tackle this memory problem, somehow the java will randomly exit due to "out-of-memory" problem (I have only 2GB memory), the latest dump I tested is Wikipedia dumps on march 12, 2010, when successfully extracted (about 3.5 days later), it has about 11.8 GB text file, which will be later converted into 3.9 GB MDX files.
    I plan to share it but I dont know how I can manage to upload it to internet. It is way too big!!! :(

    As for the colon,.. unfortunately the article will be skipped by parser/it wont show up..

    Cheers,

    Andre
    http://invictium.wordpress.com

    By Anonymous Anonymous, at 9:28 am  

  • Hi, I try to run your program, but after "create database wikindex" command it showed an error "SQL STMTA Error Table 'wikindex.links' doesn't exists".
    Can you please create some scripts for creating database with all privilegies, tables, etc.

    thank you

    By Anonymous CoSpi, at 5:18 pm  

  • @cospi

    Actually the program did what you asked for, it will initiate a "links" table, I also have experienced the same message however when I manually created wikindex database and make the root (Mysql) password (null), there is no problem afterwards..

    Andre

    By Anonymous Anonymous, at 12:14 pm  

  • Hello..I don't know if this topic is still valid or not but I'd like to know if there is any way to use wikiparser without using a database?

    By Anonymous Anonymous, at 1:42 pm  

  • No, it is not. Is anyone still using mdict?

    By Blogger Klaus Post, at 1:47 pm  

  • I am using mdict on my mobile.

    By Anonymous Anonymous, at 4:33 pm  

  • Hi Klaus, or anyone. I have a question. What is the reason that when I create the dump, are skipped short articles (less than about 170 letters). I use the command: java -jar "WikiParser.jar" current.bz2 pl-wiki.txt

    By Anonymous Micha$, at 7:45 pm  

  • If you use default settings, only "--minlinks" is active. Set it to '0' to keep all articles with "--minlinks 0"

    "--minlength" does what you describe, but is 0 by default.

    By Blogger Klaus Post, at 8:01 pm  

  • It's working! VERY thanks!

    By Anonymous Micha$, at 8:52 pm  

Post a Comment

<< Home