Freitag, Januar 04, 2008

Open Source licensed scientific software



Software becomes more and more important in science as in other areas of life. Scientist have a tradition to publish their work very openly but that does often not include the source code of the software that was developed to carry out simulations which has some obvious problems such as:
- other scientists cannot check the software for errors,
- other scientists cannot fix the bugs and easily reproduce the results,
- other scientists cannot base their new research on already existing software and have to write it completely from scratch again and again,
- software package from different authors cannot be combined easily.

But things are getting better. One field of scientific research where we can see some improvement is machine learning - http://en.wikipedia.org/wiki/Machine_learning which is a broad subfield of artificial intelligence and concerned with the design and development of algorithms and techniques that allow computers to "learn". Sören Sonnenburg et.al. wrote a paper about "The Need for Open Source Software in Machine Learning" which is available at http://jmlr.csail.mit.edu/papers/v8/sonnenburg07a.html. They even created a portal with the goal to support a community creating a comprehensive open source machine learning environment at http://mloss.org.

An increasing number of software package are available in Debian like
- some simple-to-use utilities to apply compression techniques to the process of discovering and learning patterns: http://packages.debian.org/sid/complearn-gui
- a python package for convex optimization: http://packages.debian.org/sid/python-cvxopt
- a library for support vector machines: http://packages.debian.org/sid/libsvm2
- a machine-learning library: http://packages.debian.org/sid/libtorch3-dev
- an object-oriented programming language designed for researchers, experimenters, and engineers interested in large-scale numerical and graphic applications: http://packages.debian.org/sid/lush
- a large scale machine learning toolbox: http://packages.debian.org/sid/shogun-python-modular
- a data mining software in java: http://packages.debian.org/sid/weka

I'd like to know if you are using some of the packages or some other scientific software in Debian. Feel free to leave comment. Or maybe you are missing something in Debian?

If you are an author or user of some free software related to the topic of machine learning please consider registering it at http://mloss.org.

Kommentare:

Jonas hat gesagt…

As a linguist, I miss good tools to aid in thorough analysis of corpus linguistics data. Or maybe I just haven't found them.

Still, there are SOME decent tools for small subsets of my field but nothing as thorough as I would like.

Kinda odd in a way...it's basically only software that deals with huge amounts of text. No *ix based OS with grep, awk, sed, bash, perl and so on should have any problems whatsoever in providing good tools for this kind of research. But I am still stuck with using XP + a proprietary app in VirtualBox for my research.

Unfortunately, my programming experience is limited and so is my time so I can't help much in rectifying this problem.

Kumar Appaiah hat gesagt…

Dear Torsten,

While CS and machine learning isn't my area, I am really interested in the numerical tools, such as GNU Octave, python-scipy and python-numpy. I am also the maintainer of a communication and signal processing library (libitpp), and, though not regularly, a user of smbolic packages such as maxima and python-sympy.

HTH.

Kumar

Torsten Werner hat gesagt…

Dear Kumar,

it might be interesting for you that scilab will be relicensed this year under a license that is compatible with the Debian Free Software Guidelines.

Cheers,
Torsten

Kumar Appaiah hat gesagt…

Dear Torsten,

The news about Scilab is very good, and I wasn't aware of it. Thanks!

Kumar

Google user hat gesagt…

grass, gromacs and abinit are other high quality examples.

Ken B hat gesagt…

We're missing any implementation of Maximum Entropy Learning. My preference would be to include Hal Daumé III's software for this (http://hal3.name/software.html) if he's willing to suitably license it, because it's he's got the most flexible versions of Support Vector Machines, and Maximum Entropy available.

You should also include OpenNLP (http://opennlp.sf.net/ and http://maxent.sf.net/) for us computational lingusitics types.
The Stanford NLP group has lots of natural language processing software available under the GPL as well http://nlp.stanford.edu/software/index.shtml
I make use of this software daily in my work.

You also need conditional random fields. Ask Hannah Wallach what software is good, as she's one of the inventors of these nifty probabilistic models.

novakyu hat gesagt…

Kinda off-topic, but there's a very common reason that most scientists do not publish their tools (unless asked for): the "program" is an ugly bunch of hacks that others can't possibly be expected to make use of.

Well, or that's what I overheard two CS graduate students talking.

In my own experience, I have written a few analysis tools for my experiments, but there is whole lot more work to be done before it's fit for publishing, and it kinda tends to be specific for the experiment, lacking general applicability.

Torsten Werner hat gesagt…

I have heard that is not only a problem of scientific software but even of so called 'professional' software. ;-)