mloss08 Program
by Cheng Soon Ong on November 6, 2008 (1 comment)
Just in case you haven't checked our workshop page recently, we have finalised our program. We had a surprisingly large number of submissions, ranging from quite mature projects to small radical ideas. In the end, we decided that we should try to squeeze in as many projects as possible, and at the same time try to keep some diversity in the program; i.e. we didn't want to have all slots taken up by large mature machine learning frameworks.
Our theme this year is "interoperability, interoperability, interoperability". The dream is to have some way for machine learning software to talk to each other. We are still a long way from being able to plug and play different tools for machine learning, and we hope to make a start by discussing this at the workshop. Of course, machine learning research is not only about software, but it is also about the data. Our afternoon discussion session will be about "UCI 2.0", and how we should go about it. There was a recent editorial in Nature Cell Biology about the need for standardizing bioinformatics data, and this blog post highlights three properties of scientific data.
Hope to see you at NIPS!
Reviewing software
by Cheng Soon Ong on October 13, 2008 (0 comments)
The review process for the current NIPS workshop mloss08 is now underway. There are a couple of interesting thoughts that I had while discussing this process with Soeren and Mikio, as well as some of the program committee. The two issues are:
- Who should review a project?
- What are the review criteria?
Reviewer Choice
Unlike standard machine learning projects, choosing a reviewer for a mloss project has to be comfortable with three different aspects of the system, namely:
- The machine learning problem (e.g. Graphical models, kernel methods, or reinforcement learning)
- The programming language, or at least the paradigm (e.g. object oriented programming)
- The operating environment (which may be a particular species of make on a version of Linux)
There is also projects about a particular application area of machine learning, such as brain-computer interfaces which put an additional requirement on the understanding of the reviewer.
However, if one looks at the set of people who satisfy all those criteria for a particular project, one usually ends up with only a handful of potential researchers, most of which would have a conflict of interest with the submitted project. So, often I would choose a reviewer who is an expert in one of the three areas and hope that he or she would be able to figure out the rest. Is there a better solution?
Review Criteria
The JMLR review criteria are:
- The quality of the four page description.
- The novelty and breadth of the contribution.
- The clarity of design.
- The freedom of the code (lack of dependence on proprietary software).
- The breadth of platforms it can be used on (should include an open-source operating system).
- The quality of the user documentation (should enable new users to quickly apply the software to other problems, including a tutorial and several non-trivial examples of how the software can be used).
- The quality of the developer documentation (should enable easy modification and extension of the software, provide an API reference, provide unit testing routines).
- The quality of comparison to previous (if any) related implementations, w.r.t. run-time, memory requirements, features, to explain that significant progress has been made.
This year's workshop has the theme of interoperability and coorperation. Therefore it is also a review criteria. The important question is how to weight the different aspects? The answer is not at all clear. There is a basic level of adherence which is necessary for each of the criteria, above which is it difficult to trade off the different aspects quantitatively. For example does very good user documentation excuse very poor code design? Does being able to run on many different operating systems excuse very poor run time memory and computational performance?
Put your comments below or come to this year's workshop and discuss this!
GNU Octave on Free Software Foundations High Priority List
by Soeren Sonnenburg on October 6, 2008 (1 comment)
The Free Software Foundation (FSF) maintains a high priority list of software projects and can be found here.
Quoting the FSF:
The FSF high-priority projects list serves to foster the development of projects that are important for increasing the adoption and use of free software and free software operating systems. [...] Some of the most important projects on our list are replacement projects. These projects are important because they address areas where users are continually being seduced into using non-free software by the lack of an adequate free replacement.
With rank eight among the top ten prioritized software projects is GNU Octave --- a free software Matlab replacement.
As this is very relevant to our community that is strongly dominated by Matlab, I would like to encourage everyone to try out octave 3.0. If you tried octave 2.x or any earlier version at some point, it really matured a lot. It supports all the data types like cell arrays, dense or sparse arrays you know from matlab and yes it has all these plotting functions like plot, surf3d etc too. And if you ever tried to extend matlab using C code, support is really much better from the octave side not to mention the killer feature: Octave is fully supported by swig! Still not convinced? We will have John W. Eaton to introduce octave to us at the NIPS'08 MLOSS Workshop. So what are you waiting for, give octave a try and see how you can help!
Differences between paid and volunteer FOSS contributors
by Soeren Sonnenburg on October 3, 2008 (0 comments)
I just stumbled across a very interesting article titled Differences between paid and volunteer FOSS contributors that I am going to almost fully quote below. The original article was written by Martin Michlmayr and can be found here. Almost full quote follows:
There's a lot of debate these days about the impact of the increasing number of paid developers in FOSS communities that started as volunteer efforts and still have significant numbers of volunteers. Evangelia Berdou's PhD thesis "Managing the Bazaar: Commercialization and peripheral participation in mature, community-led Free/Open source software projects" contains a contains a wealth of information and insights about this topic.
Berdou conducted interviews with members of the GNOME and KDE projects. She found that paid developers are often identified with the core developer group which is responsible for key infrastructure and often make a large number of commits. Furthermore, she suggested that the groups may have different priorities: "whereas [paid] developers focus on technical excellence, peripheral contributors are more interested in access and practical use".
Based on these interviews, she formulated the following hypotheses which she subsequently analyzed in more detail:
- Paid developers are more likely to contribute to critical parts of the code base.
- Paid developers are more likely to maintain critical parts of the code base.
- Volunteer contributors are more likely to participate in aspects of the project that are geared towards the end-user.
- Programmers and peripheral contributors are not likely to participate equally in major community events.
Berdou found all hypotheses to be true for GNOME but only hypothesis two and four were confirmed for KDE.
In the case of GNOME, Berdou found that hired developers contribute to the most critical parts of the project, that they maintained most modules in core areas and that they maintained a larger number modules than volunteers. Two important differences were found in KDE: paid developers attend more conferences and they maintain more modules.
Berdou's research contains a number of important insights:
- Corporate contributions are important because paid developers contribute a lot of changes, and they maintain core modules and code.
- While it's clear that the involvement of paid contributors is influenced by the strategy of their company, Berdou wonders whether another reason why they often contribute to core code is because they "develop their technical skills and their understanding of the code base to a greater extent than volunteers who usually contribute in their free time". It's therefore important that projects provide good documentation and other help so volunteers can get up to speed quickly.
- Since many volunteers cannot afford to attend community events, projects should provide travel funds. This is something I see more and more: for example, Debian funds some developers to attend Debian conference and the Linux Foundation has a grant program to allow developers to attend events.
- Paid developers often maintain modules they are not paid to directly contribute to. A reason for this is that they continue to maintain modules in their spare time when their company tells them to work on other parts of the code.
The rest of the article can be found here.
Deadline extension mloss 08
by Cheng Soon Ong on September 30, 2008 (0 comments)
Murphy's law has struck us. After happily running for more than a year, the hardware that is running mloss.org is facing some strange difficulties the day before our deadline for mloss 08. So, if you cannot submit, don't panic.
So, to be fair we've decided to extend the deadline to next Monday.
http://mloss.org/workshop/nips08/
Final Call for Contributions: NIPS*08 MLOSS Workshop
by Soeren Sonnenburg on September 25, 2008 (1 comment)
This is the final call for contributions for the NIPS*08 MLOSS workshop to be held on Friday, December 12th, 2008 in Whistler, British Columbia, Canada.
The deadline for the submissions is approaching quickly, just one week remains until October 1, 2008. We accept all kinds of machine learning (related) software submissions for the workshop. If accepted, you will be given a chance to present your software at the workshop, which is a great opportunity to make your piece of software more known to the NIPS audience and to receive valuable feedback.
We have decided to use mloss.org for managing the submissions. You basically just have to register your project with mloss.org and put the tag nips2008 to it. For more information, have a look at the workshop page.
Data sources
by Cheng Soon Ong on September 19, 2008 (1 comment)
For people who are interested in algorithms development, we are often faced with the "have a hammer, looking for a nail" problem. Once we have confirmed that the standard machine learning datasets (for example at UCI ) do not offer a useful application area where does one go? Below, I look at four websites which list data and also software associated with data. The information is not collected with machine learning in mind, and so a user would probably need to write preprocessing scripts to convert stuff into something useful.
A common theme is that just providing blobs of data isn't enough, one has to provide data as well as interfaces or processing tools for it. The other common theme is that these are just listings of data, and not an archival copy.
theinfo
This is a site for large data sets and the people who love them:
the scrapers and crawlers who collect them,
the academics and geeks who process them,
the designers and artists who visualize them.
It's a place where they can exchange tips and tricks,
develop and share tools together, and begin to integrate their particular projects.
theinfo.org classifies the activities that people want to do to data into three different ones: get, process, view. In the get section, they provide a list of links to sources of data, which includes things from US congressional district boundaries to stock ticker data which requires a (free) registration. Unfortunately, the list of datasets is a static list, and does not provide useful slicing capabilities. In the view section, there is a nice list of different visualizations of datasets, for example a visualization of trends in twitter or worldmapper which morphs the area of a country to correspond to the size of a certain variable of interest, such as the number of internet users.
However, the really nice thing about this site is that for each section, it lists tools of the trade and tips and tricks which are bits of software which are related to collecting, processing and visualizing data. These are the kinds of things which simplifies our data analysis tasks. There doesn't seem to be a tool for each of the data sources listed yet, which means that a machine learner may still need to write his scraping tool to get data.
infochimps
There are many sources to find out something about everything.
Until now, there’s been no good place for you to find out everything about something.
This site is still in beta, and currently only provides a list of datasets. They promise to allow uploading of your datasets in the full version. What's nice about the design is that you can slice the list of datasets according to a list of predefined fields or tags. So, in a sense, the design is very much like mloss.org, depending on community involvement to keep the repository fresh and up to date. Most of the data seems to be in tabular format (csv, xls), but they support yaml, which means that in principle more complex structures can exist.
They provide the infinite monkeywrench which is a scripting language to process data.
(the site seems to be having some problems recently, possibly due to the imminent v1.0)
datamob
Datamob highlights the connection between public data sources
and the interfaces people are building for them
They list hot new datasets and hot new interfaces, which are the latest listings. They have a short list of machine learning data which includes the venerable UCI and also Netflix. There is a simple submit form which allows one to add a link to the source of data or interface. They don't aim to be comprehensive but instead but rather the best place to see how public data is being put to use online. However, it is a pity that the two lists seem to be independent. It would be nice to see which datasets uses which interfaces.
Looking at one of the visualizations (under interfaces) of the 2008 presidential donations, it pointed out something interesting: often when visualizing data, there are not enough pixels on a screen to represent what you want.
ckan
Those familiar with freshmeat, CPAN or PyPI
can think of CKAN as providing an analogous service for open knowledge.
They package data in a predefined format, which allows them to design an API. In particular, they encourage open data, that is material that people are free to use, reuse and redistribute without restriction. The predefined package allows them to attach much more meta-data to each submission, and in the long run would allow more automated processing. For example, they allow the download of the meta-data of citeseer, which is dublin core compliant with additional metadata fields, including citation relationships (References and IsReferencedBy), author affiliations, and author addresses.
The REST API essentially defines how client software can upload and download data, and allows querying of what resources are available.
NIPS Workshop 2008 accepted
by Mikio Braun on September 8, 2008 (10 comments)
We are glad to announce that our workshop at this years NIPS conference has been accepted! We are tentatively scheduled for Friday, December 12th, 2008. The workshop will be held at Whistler, British Columbia, Canada.
We accept software submissions for the workshop. The deadline for the submissions is October 1, 2008. If accepted, you can present your software to the workshop audience, which is a great opportunity to make your piece of software more known to the NIPS audience.
We have decided to use mloss.org for managing the submissions. You basically just have to register your project with mloss.org and put the tag nips2008 to it. For more information, have a look at the workshop page.
New JMLR-MLOSS publication and progress updates for September 2008
by Soeren Sonnenburg on September 3, 2008 (0 comments)
Again almost two months have passed since the last progress report. Well as Cheng already posted, we finally took the time and made a slightly polished version of the mloss.org source code available.
And the usual statistics follows, mloss.org now has 235 registered users and 129 software projects.
Finally, the mloss project liblinear - a library to very train linear SVMs in very little time - got accepted in JMLR and we again highlight the software interlinking it with the jmlr publication.
Software Freedom Law Center on GPL compliance
by Mikio Braun on August 22, 2008 (0 comments)
The Software Freedom Law Center has posted a guide on how to ensure that you do not violate the GNU Public License when using GPL'd software in your project. ArsTechnica also has a few comments.
The guide might also come in very handy if you're legal department is eager to learn more about the implications of using open source software.