sh0dan // VoxPod

Sunday, May 31, 2009

MDX WikiParser v1.1

Here is a revised version of the Wikipedia to mDict converter.

Download: Wikiparser v1.1.

Changes from v1.0 to 1.1:
  • New DB design. Faster, takes a bit more space.
  • Better CPU scaling.
  • Fixed locked schema name - now you truly can call your DB shema what you like.
  • Fixed table formatting bug.
  • Result is 200% faster indexing and 50% faster processing on Quad-Core.

Monday, May 25, 2009

Wikipedia for mDict - May 2009

Here are the updated Wikipedia for mDict.

Languages:
  • English
  • German
  • Spanish
  • Portuguese
  • Russian
  • Chinese
  • Japanese
  • Polish
  • Italian
  • Danish (direct download below)
Download from LegalTorrents.

Changes:
  • Title added to all pages.
  • Spanish version added.
  • Chinese version added.
  • English "Jumbo" edition added, with 2.3 million full articles.

Since the Danish wikipedia is so small, here is a direct download link for the Danish Wikipedia.

If you do not have access to BitTorrent download, here is a mini version of the English Wikipedia (155.000 articles, very abbreviated).

If you are interested in helping doing further updates, please contact me.

Thursday, February 12, 2009

Introducing RawSpeed

I have spent quite a lot of my spare time in the last two months creating a file loader for RAW files. My primary motivation, beside general interest was, that I was frustrated there wasn't any fast open source loaders out there.

So my main objectives were to make a very fast loader that worked for 75% of the cameras out there, and was able to decode a RAW file at close to the optimal speed. The last 25% of the cameras out there could be serviced by a more generic loader, or convert their images to DNG - which as a sidenote usually compresses better than your camera.

So far, I have support of the latest cameras from Canon, Nikon, Sony, Pentax and Olympus, besides DNG support. I plan to add Panasonic cameras to the mix, but that's about it. RawSpeed has been integrated into the upcoming version of Rawstudio, which is being maintained by my friend Anders Brander.

My natural choice of language was C++, and while I considered adding assembler, I haven't seem any obvious spots where it would help more than a few percent. That would also make it fairly easy to port, as a side-bonus. My C++ style is "object-oriented C", with carefully chosen STL use, inspired by AviSynth, so I knew it wouldn't hinder performance much.

Compared to other raw decoders out there, I decided from the start to completely forego file streams. The image must be fully loaded into memory before decoding can start. This would first of all avoid many system calls, non-serial IO, and futhermore makes it possible to have threaded IO and decoding, by having an IO thread.

I feel like I must mention the inspiration. Obviously the closest relative is libopenraw. It seems like a nice project. I considered contributing to the project, but since I had already made a TIFF parser, and making it "streamless" would make it almost a complete rewrite, I decided I might as well write my own. At least then I cannot blame bugs on other people.

The other wellknown project out there is Dave Coffin's dcraw. David has put an incredible amount of work into reverse-engineering the strangest formats out there and creating a solid application that works for jsut about any camera out there. While he has a radically different coding style than most of what you find out there, the amount of work and dedication put into this project is very admirably.

So far the only thing you can see it the Subversion repository, but I will make a formal release, when the time is right. If you want to follow the development, you can subscribe to the rawstudio commit mailing list, where my updates are also posted.

That's it for now. Next time I hope to be able to tell you a bit about what the library actually contains.

Thursday, October 16, 2008

MDX Wikiparser 1.0

Here is version 1.0 of my wiki parser. It is still rather rough, but since you can download the most popular wikipedias below, it is only here for reference.

Requires JRE or JDK 1.5 or later to be installed.
Requires MySQL 5.0 or later installed.

Create a new shema in your database called wikindex
or something similar.

To run the project from the command line, go to the dist folder and
type the following:

java -jar "WikiParser.jar" [parameters] "InputFile.xml.bz2" "OutputFile.txt"

I have also included a sample bat file you can use as a basis for your own conversions. Be sure to adjust --databaseurl --user: and --password, if needed.

For fast indexing, you should have your database on a ramdisk, or an SSD, if you have one. You can find a good free ramdisk for 2000, XP and Vista here. 500MB should be enough for the english wikipedia.

Download WikiParser v1.0.

Wednesday, September 10, 2008

Wikipedia mDict for Windows Mobile

I'm still having some fun playing with my wikipedia parser I posted below. The main reason why I think this kind of exercise is fun is because of the immense amount of data it contains.

I took the time, and implemented the first two points on my "possible improvements" list below. I knew that a database would be the logical next step, so I fired up my MySQL, and implemented a simple first pass, that would index all links in the wikipedia. The second pass would then use these statistics to select the most relevant articles based on these stats.

So that way I was able to get a much more consistent subset of wikipedia, with inline resolved redirects.

So here is a dump of the most popular wikipedias for mDict, a free dictionary reader that can be used on Windows Mobile and Windows Smartphones :

(Updated May 2009)
Download from LegalTorrents™

Installation:
Copy the MDX file to your SD Card/Internal storage, and select Library/Search All then it should add wikipedia to your library.


Lastest version is for mDict 3.0

Hosting kindly provided by LegalTorrents™. Any donations on the page will be split 15/85 to LegalTorrents and Wikipedia Foundation.

Feel free to comment, if you have specific requests.