sh0dan // VoxPod: MDX WikiParser v1.1

Sunday, May 31, 2009

MDX WikiParser v1.1

Here is a revised version of the Wikipedia to mDict converter.

Download: Wikiparser v1.1.

Changes from v1.0 to 1.1:

New DB design. Faster, takes a bit more space.
Better CPU scaling.
Fixed locked schema name - now you truly can call your DB shema what you like.
Fixed table formatting bug.
Result is 200% faster indexing and 50% faster processing on Quad-Core.

21 Comments:

Which file should I convert: "pages-meta-current.xml.bz2" or "pages-articles.xml.bz2"?

By pawaw, at 3:06 pm
@pawaw: You should download "pages-articles.xml.bz2".

By Klaus Post, at 3:09 pm
Is there any significant difference between "MDict(compact HTML)" and "MDict(html)" in MdxBuilder settings? What should be set?

By pawaw, at 12:24 pm
@pawaw: I only think it is a matter of compression. I use "compact html".

By Klaus Post, at 12:39 pm
Thanks for great work! Are you thinking about adding image address parsing and automatically downloading for some specific area, such as math formulas? That will be really appreciated. :)

By Unknown, at 5:35 am
Hi Klaus - Please could you let me know what WikiParser parameters you used for the May '09 English Jumbo extract? Thanks in advance.

By Steve, at 11:12 pm
@Steve: java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 2 --maxlength 100000 --maxlengthtrim=70000 --noredirects --noexternal --noindex --norecount e:\current.en.xml.bz2 f:\en-wiki-jumbo.txt
These are the settings used for the other versions:
java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 2 --noredirects --noexternal e:\current.da.xml.bz2 n:\da-wiki.txt

java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 4 --maxlength 5000 --maxlengthtrim=4000 --noredirects --noexternal --skiptables --databaseurl="127.0.0.1/wiki" e:\current.pl.xml.bz2 n:\pl-wiki.txt

java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 4 --maxlength 5000 --maxlengthtrim=4000 --noredirects --noexternal --skiptables e:\current.es.xml.bz2 n:\es-wiki.txt

java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 4 --maxlength 7000 --maxlengthtrim=6000 --noredirects --noexternal --skiptables e:\current.zh.xml.bz2 n:\zh-wiki.txt

java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 9 --maxlength 4500 --maxlengthtrim=3500 --noredirects --noexternal --skiptables e:\current.de.xml.bz2 n:\de-wiki.txt

java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 9 --maxlength 5500 --maxlengthtrim=4500 --noredirects --noexternal --skiptables e:\current.fr.xml.bz2 n:\fr-wiki.txt

java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 4 --maxlength 7000 --maxlengthtrim=6000 --noredirects --noexternal --skiptables e:\current.pl.xml.bz2 n:\pl-wiki.txt

java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 5 --maxlength 5000 --maxlengthtrim=4000 --noredirects --noexternal --skiptables e:\current.ru.xml.bz2 n:\ru-wiki.txt

java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 5 --maxlength 5000 --maxlengthtrim=4000 --noredirects --noexternal --skiptables e:\current.ja.xml.bz2 n:\ja-wiki.txt

java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 4 --maxlength 7000 --maxlengthtrim=6000 --noredirects --noexternal --skiptables e:\current.it.xml.bz2 n:\it-wiki.txt

java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 3 --noredirects --noexternal --skiptables e:\current.pt.xml.bz2 n:\pt-wiki.txt

java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 15 --maxlength 5000 --maxlengthtrim=4200 --noredirects --noexternal --simple --skiptables --noindex --norecount e:\current.en.xml.bz2 n:\en-wiki.txt

java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 2 --maxlength 100000 --maxlengthtrim=70000 --noredirects --noexternal --noindex --norecount e:\current.en.xml.bz2 f:\en-wiki-jumbo.txt

java -jar "WikiParser.jar" --databaseurl="127.0.0.1/wiki" --minlinks 50 --maxlength 500 --maxlengthtrim=300 --noredirects --noexternal --simple --skiptables --noindex --norecount e:\current.en.xml.bz2 n:\en-wiki-mini.txt

By Klaus Post, at 10:08 am
Hi Klaus - thank you very much for the information. It is appreciated. One parameter I just don't seem to understand is [maxlengthtrim]. If [maxlength] is 100k and [maxlengthtrim] is 70k what affect does this have on the extraction of the article that matches this criteria?

By Steve, at 12:23 pm
@Steve: ok, this is actually also quite complex, I'll try to explain.

First of all, these limits are imposed after the text has been processed, which you might have guessed.

The reason there are two limits is that I want the cutoff to be at a natural text break, such as a section endning or something similar.

First of all, if the output content is smaller than "maxlength", nothing is obviously done to the text.

If it is longer than "maxlength", the program reverses to the "maxlengthtrim" position in the text, and starts looking for a natural break in the text from there.

If the break is within "maxLength" x 2, all text after the break is cut off. If not, the text is simply cut at "maxlengthtrim".

By Klaus Post, at 12:56 pm
Hi Klaus - excellent, now I understand! Thanks for taking the time to reply.

By Steve, at 1:18 pm
Hi Klaus - is WikiParser v1.1 excluding all articles with a colon ':' in the article title? In the English Jumbo edition from May, not one of the 2.4M articles has a colon in the article title. Therefore, articles such as 'Star Trek: Enterprise' are missing. The article that is included: 'Star Trek Enterprise' is blank. I have the same problem with the latest dump I've created as well.

By Steve, at 6:55 pm
Hey there Klaus, and thank you for creating this raw but powerful tool.
As always happens "in the tubes of the internets" I bumped in here after a couple of chained links, and am now eager to have a pocket version of the latest enwiki and itwiki dump on the 16GB microsd of my brand new hd2.
Yet, before leaving this power consuming Seven system turned on during the night in the same room that I sleep in, I'd rather use my low-drain ubuntu home server to do the job. Do you think this script will work with a linux mysql/java installation as well, or should I go straight installing XP in a virtualbox and adding mysql and java to it?
Also, +1 on Steve's request, does your parser skip the colon in the page titles?
Anyway, props for this little but powerful thing ;)

By ephestione, at 10:23 am
@Ephestione
I have done some test with this wikiparser (v 1.1), I dont think it is best to use virtual machine, it is much better if you use a dedicated windows based apps, you would also need to have a RAM Drive (software) to make the process faster, you might even want to tweak the source code in order to tackle this memory problem, somehow the java will randomly exit due to "out-of-memory" problem (I have only 2GB memory), the latest dump I tested is Wikipedia dumps on march 12, 2010, when successfully extracted (about 3.5 days later), it has about 11.8 GB text file, which will be later converted into 3.9 GB MDX files.
I plan to share it but I dont know how I can manage to upload it to internet. It is way too big!!! :(

As for the colon,.. unfortunately the article will be skipped by parser/it wont show up..

Cheers,

Andre
http://invictium.wordpress.com

By Anonymous, at 9:28 am
Hi, I try to run your program, but after "create database wikindex" command it showed an error "SQL STMTA Error Table 'wikindex.links' doesn't exists".
Can you please create some scripts for creating database with all privilegies, tables, etc.

thank you

By CoSpi, at 5:18 pm
@cospi

Actually the program did what you asked for, it will initiate a "links" table, I also have experienced the same message however when I manually created wikindex database and make the root (Mysql) password (null), there is no problem afterwards..

Andre

By Anonymous, at 12:14 pm
Hello..I don't know if this topic is still valid or not but I'd like to know if there is any way to use wikiparser without using a database?

By Anonymous, at 1:42 pm
No, it is not. Is anyone still using mdict?

By Klaus Post, at 1:47 pm
I am using mdict on my mobile.

By Anonymous, at 4:33 pm
Hi Klaus, or anyone. I have a question. What is the reason that when I create the dump, are skipped short articles (less than about 170 letters). I use the command: java -jar "WikiParser.jar" current.bz2 pl-wiki.txt

By Micha$, at 7:45 pm
If you use default settings, only "--minlinks" is active. Set it to '0' to keep all articles with "--minlinks 0"

"--minlength" does what you describe, but is 0 by default.

By Klaus Post, at 8:01 pm
It's working! VERY thanks!

By Micha$, at 8:52 pm

sh0dan // VoxPod

Sunday, May 31, 2009

MDX WikiParser v1.1

21 Comments:

About

About Me

Previous

Email