by Cheng Soon Ong on October 30, 2014 (0 comments)
How many of the papers that are in the top 100 most cited about software?
21, with an additional 12 papers which are not specifically about software itself, but about methods or statistics that were implemented later in software. When you take a step back and think about the myriad areas of research and the stratospheric numbers of citations the top 100 get, it is quite remarkable that one fifth of the papers are actually about software. I mean really about software, not software as an afterthought. Some examples:
- The CLUSTAL_X Windows interface: Flexible strategies for multiple sequence alignment aided by quality analysis tools.
- PROCHECK: a program to check the stereochemical quality of protein structures.
- MrBayes 3: Bayesian phylogenetic inference under mixed models
To put in perspective how rarified the air is in the top 100 citations, the if we combined all citations received by all JMLR papers in the last five years (according to SCImago), this one gigantic paper would not even make it into the top 100.
Yes, yes, citations do not directly measure the quality of the paper, and there are size of community effects and all that. To be frank, being highly cited seems to be mostly luck.
In the spirit of open science, here is a bar plot showing these numbers, and here is my annotated table which I updated from the original table. For a more mainstream view of the data, look at the Nature article.
by Cheng Soon Ong on July 28, 2014 (1 comment)
Just in case there are people who follow this blog but not John Langford's, there is going to be an open machine learning workshop on 22 August 2014 in MSR, New York, organised by Alekh Agarwal, Alina Beygelzimer, and John Langford.
As it says on John's blog: If you are interested, please email msrnycrsvp at microsoft.com and say “I want to come” so we can get a count of attendees for refreshments.
by Cheng Soon Ong on July 22, 2014 (1 comment)
What would you include a linux distribution to customise it for machine learning researchers and developers? Which are the tools that would cover the needs of 90% of PhD students who aim to do a PhD related to machine learning? How would you customise a mainstream linux distribution to (by default) include packages that would allow the user to quickly be able to do machine learning on their laptop?
There are several communities which have their own custom distribution:
- Scientific Linux which is based on Red Hat Enterprise Linux is focused making it easy for system administrators of larger organisations. The two big users are FermiLab and CERN who each have their own custom "spin". Because of its experimental physics roots, it does not have a large collection of pre-installed scientific software, but makes it easy for users to install their own.
- Bio-Linux is at the other end of the spectrum. Based on Ubuntu, it aims to provide a easy to use bioinformatics workstation by including more than 500 bioinformatics programs, including graphical menus to them and sample data for testing them. It is targeted at the end user, with simple instructions for running it Live from DVD or USB, to install it, and to dual boot it.
- Fedora Scientific is the latest entrant, providing a nice list of numerical tools, visualisation packages and also LaTeX packages. Its documentation lists packages for C, C++, Octave, Python, R and Java. Version control is also not forgotten. A recent summary of Fedora Scientific was written as part of Open Source Week.
It would seem that Fedora Scientific would satisfy the majority of machine learning researchers, since it provides packages for most things already. Some additional tools that may be useful include:
- tools for managing experiments and collecting results, to make our papers replicable
- GPU packages for CUDA and OpenCL
- Something for managing papers for reading, similar to Mendeley
- Something for keeping track of ideas and to do lists, similar to Evernote
There's definitely tons of stuff that I've forgotten!
Perhaps a good way to start is to have the list of package names useful for the machine learning researcher in some popular package managers such as yum, apt-get, dpkg. Please post your favourite packages in the comments.
by Cheng Soon Ong on June 3, 2014 (0 comments)
GSoC 2014 is between 19 May and 18 August this year. The students should now be just sinking their teeth into the code, and hopefully having a lot of fun while gaining invaluable experience. This amazing program is in its 10th year now, and it is worth repeating how it benefits everyone:
students - You learn how to write code in a team, and work on projects that are long term. Suddenly, all the software engineering lectures make sense! Having GSoC in your CV really differentiates you from all the other job candidates out there. Best of all, you actually have something to show your future employer that cannot be made up.
mentors - You get help for your favourite feature in a project that you care about. For many, it is a good introduction to project management and supervision.
organisation - You recruit new users and, if you are lucky, new core contributors. GSoC experience also tends to push projects to be more beginner friendly, and to make it easier for new developers to get involved.
I was curious about how many machine learning projects were in GSoC this year and wrote a small ipython notebook to try to find out.
Looking at the organisations with the most students, I noticed that the Technical University Vienna has come together and joined as a mentoring organisation. This is an interesting development, as it allows different smaller projects (the titles seem disparate) to come together and benefit from a more sustainable open source project.
On to machine learning... Using a bunch of heuristics, I tried to identify machine learning projects from the organisation name and project titles. I found more than 20 projects with variations of "learn" in them. This obviously misses out projects from R some of which are clearly machine learning related, but I could not find a rule to capture them. I am pretty sure I am missing others too. I played around with some topic modelling, but this is hampered by the fact that I could not figure out a way to scrape the project descriptions from the dynamically generated list of project titles on the GSoC page.
Please update the source with your suggestions!
by Cheng Soon Ong on March 30, 2014 (1 comment)
There has been a flurry of articles recently outlining 10 simple rules for X, where X has something to do with data science, computational research and reproducibility. Some examples are:
- 10 Simple Rules for the Care and Feeding of Scientific Data (kudos for using a open collaborative writing tool!)
- Ten Simple Rules for Reproducible Computational Research
- Ten Simple Rules for Effective Computational Research
These articles provide a great resource to get started on the long road to doing "proper science". Some common suggestions which are relevant to practical machine learning include:
Use version control
Start now. No, not after your next paper, do it right away! Learn one of the modern distributed version control systems, git or mercurial currently being the most popular, and get an account on github or bitbucket to start sharing. Even if you don't share your code, it is a convenient offsite backup. Github is the most popular for open source projects, but bitbucket has the advantage of free private accounts. If you have an email address from an educational institution, you get the premium features for free too.
Distributed version control systems can be conceptually daunting, but it is well worth the trouble to understand the concepts instead of just robotically type in commands. There are numerous tutorials out there, and here are some which I personally found entertaining, git foundations and hginit. For those who don't like the command line, have a look at GUIs such as sourcetree, tortoisegit, tortoisehg, and gitk. If you work with other people, it is worth learning the fork and pull request model, and use the gitflow convention.
Please add your favourite tips and tricks in the comments below!
Open source your code and scripts
Publish everything. Even the two lines of Matlab that you used to plot your results. The readers of your NIPS and ICML papers are technical people, and it is often much simpler for them to look at your Matlab plot command than to parse the paragraph that describes the x and y axes, the meaning of the colours and line types, and the specifics of the displayed error bars. Tools such as ipython notebooks and knitr are examples of easy to implement literate programming frameworks that allow you to make your supplement a live document.
It is often useful to try to conceptually split your computational code into "programs" and "scripts". There is no hard and fast rule for where to draw the line, but one useful way to think about it is to contrast code that can be reused (something to be installed), and code that runs an experiment (something that describes your protocol). An example of the former is your fancy new low memory logistic regression training and testing code. An example of the latter is code to generate your plots. Make both types of code open, document and test them well.
Make your data a resource
Your result is also data. When open data is mentioned, most people immediately conjure images of the inputs to prediction machines. But intermediate stages of your workflow are often left out of making things available. For example, if in addition to providing the two lines of code for plotting, you also provided your multidimensional array containing your results, your paper now becomes a resource for future benchmarking efforts. If you made your precomputed kernel matrices available, other people can easily try out new kernel methods without having to go through the effort of computing the kernel.
Efforts such as mldata.org and mlcomp.org provide useful resources to host machine learning oriented datasets. If you do create a dataset, it is useful to get an identifier for it so that people can give you credit.
Challenges to open science
While the articles call these rules "simple", they are by no means easy to implement. While easy to state, there are many practical hurdles to making every step of your research reproducible .
Unlike publishing a paper, where you do all your work before publication, publishing a piece of software often means that you have to support it in future. It is remarkably difficult to keep software available in the long term, since most junior researchers move around a lot and often leave academia altogether. It is also challenging to find contributors that can help out in stressful periods, and to keep software up to date and useful. Open source software suffers from the tragedy of the commons, and it quickly becomes difficult to maintain.
While it is generally good for science that everything is open and mistakes are found and corrected, the current incentive structure in academia does not reward support for ongoing projects. Funding is focused on novel ideas, publications are used as metrics for promotion and tenure, and software gets left out.
The secret branch
When developing a new idea, it is often tempting to do so without making it open to public scrutiny. This is similar to the idea of a development branch, but you may wish to keep it secret until publication. The same argument applies for data and results, where there may be a moratorium. I am currently unaware of any tools that allow easy conversion between public and private branches. Github allows forks of repositories, which you may be able to make private.
Once a researcher gets fully involved in an application area, it is inevitable that he starts working on the latest data generated by his collaborators. This could be the real time stream from Twitter or the latest double blind drug study. Such datasets are often embargoed from being made publicly available due to concerns about privacy. In the area of biomedical research there are efforts to allow bona fide researchers access to data, such as dbGaP. It seamlessly provides a resource for public and private data. Instead of a hurdle, a convenient mechanism to facilitate the transition from private to open science would encourage many new participants.
What is the right access control model for open science?
Data is valuable
It is a natural human tendency to protect a scarce resource which gives them a competitive advantage. For researchers, these resources include source code and data. While it is understandable that authors of software or architects of datasets would like to be the first to benefit from their investment, it often happens that these resources are not made publicly available even after publication.
by Cheng Soon Ong on November 14, 2013 (0 comments)
We were very lucky this year to have an amazing set of keynote speakers at ACML 2013 who have made key contributions to getting machine learning into the real world. Here are some links to the open source software projects that they mentioned during their talks. The videos of the talks should be available at some point on the ACML website
We started off with Geoff Holmes, who spoke at MLOSS 06. He told us about how WEKA has been used in industry (satisfying Kiri Wagstaff's Challenge #2), and the new project for streaming data MOA. Later in the day, Chih-Jen Lin told us how important it was to understand both machine learning and optimisation, such that you can exploit the special structure for fast training of SVMs. This is how he obtained amazing speedups in LIBLINEAR. On the second day, Ralf Herbrich (who also gave a tutorial) gave us a behind the scenes tour of TrueSkill, the player matching algorithm used on XBox Live. Source code in F# is available here and the version generalised to track skill over time is available here.
Thanks to Geoff, Chih-Jen and Ralf for sharing their enthusiasm!
by Mark Reid on September 1, 2013 (1 comment)
I was recently asked to become an Action Editor for the Machine Learning and Open Source Software (MLOSS) track of Journal of Machine Learning Research. Of course, I gladly accepted since the aim of the JMLR MLOSS track (as well as the broader MLOSS project) -- to encourage the creation and use of open source software within machine learning -- is well aligned with my own interests and attitude towards scientific software.
Shortly after I joined, one of the other editors raised a question about how we are to interpret an item in the review criteria that states that reviewers should consider the "freedom of the code (lack of dependence on proprietary software)" when assessing submissions. What followed was an engaging email discussion amongst the Action Editors about the how to clarify our position.
After some discussion (summarised below), we settled on the following guideline which tries to ensure MLOSS projects are as open as possible while recognising the fact that MATLAB, although "closed", is nonetheless widely used within the machine learning community and has an open "work-alike" in the form of GNU Octave:
Dependency on Closed Source Software
We strongly encourage submissions that do not depend on closed source and proprietary software. Exceptions can be made for software that is widely used in a relevant part of the machine learning community and accessible to most active researchers; this should be clearly justified in the submission.
The most common case here is the question whether we will accept software written for Matlab. Given its wide use in the community, there is no strict reject policy for MATLAB submissions, but we strongly encourage submissions to strive for compatibility with Octave unless absolutely impossible.
There were a number of interesting arguments raised during the discussion, so I offered to write them up in this post for posterity and to solicit feedback from the machine learning community at large.
Reviewing and decision making
A couple of arguments were put forward in favour of a strict "no proprietary dependencies" policy.
Firstly, allowing proprietary dependencies may limit our ability to find reviewers for submissions -- an already difficult job. Secondly, stricter policies have the benefit of being unambiguous, which would avoid future discussions about the acceptability of future submission.
Promoting open ports
An argument made in favour of accepting projects with proprietary dependencies was that doing so may actually increase the chances of its code being forked to produce a version with no such dependencies.
Where do we draw the line?
Some of us had concerns about what exactly constitutes a proprietary dependency and came up with a number of examples that possibly fall into a grey area.
For example, how do operating systems fit into the picture? What if the software in question only compiles on Windows or OS X? These are both widely used but proprietary. Should we ensure MLOSS projects also work on Linux?
Taking a step up the development chain, what if the code base is most easily built using proprietary development tools such as Visual Studio or XCode? What if libraries such as MATLAB's Statistics Toolbox or Intel's MKL library are needed for performance reasons?
Things get even more subtle when we note that certain data formats (e.g., for medical imaging) are proprietary. Should such software be excluded even though the algorithms might work on other data?
These sorts of considerations suggested that a very strict policy may be difficult to enforce in practice.
What is our focus?
It is pretty clear what position Richard Stallman or other fierce free software advocates would take on the above questions: reject all of them! It is not clear that such an extreme position would necessarily suit the goals of the MLOSS track of JMLR.
Put another way, is the focus of MLOSS the "ML" or the "OSS"? The consensus seemed to be that we want to promote open source software to benefit machine learning, not the other way around.
Looking At The Data
Towards the end of the discussion, I made the argument that if we cannot be coherent we should at least be consistent and presented some data on all the accepted MLOSS submissions. The list below shows the breakdown of languages used by the 50 projects that have been accepted to the JMLR track to date. I'll note that some projects use and/or target multiple languages and that, because I only spent half an hour surveying the projects, I may have inadvertently misrepresented some (if I've done so, let me know).
C++: 15; Java: 13; MATLAB:11; Octave: 10; Python:9; C: 5; R: 4.
From this we can see that MATLAB is fairly well-represented amongst the accepted MLOSS projects. I took a closer look and found that of the 11 projects that are written in (or provide bindings for) MATLAB, all but one of them provide support for GNU Octave compatibility as well.
I think the position we've adopted is realistic, consistent, and suitably aspirational. We want to encourage and promote projects that strive for openness and the positive effects it enables (e.g., reproducibility and reuse) but do not want to strictly rule out submissions that require a widely used, proprietary platform such as MATLAB.
Of course, a project like MLOSS is only as strong as the community it serves so we are keen to get feedback about this decision from people who use and create machine learning software so feel free to leave a comment or contact one of us by email.
Note: This is a cross-post from Mark's blog at Inductio ex Machina.
by Cheng Soon Ong on August 14, 2013 (0 comments)
How good is the software associated with scientific papers? There seems to be a general impression that the quality of scientific software is not that great. How do we check for software quality? Well, by doing code review.
In an interesting experiment between the Mozilla Science Lab and PLoS Computational Biology, a selected number of papers with snippets of code from the latter will be reviewed by engineers from the former.
For more details see the blog post by Kaitlin Thaney.
by Cheng Soon Ong on April 9, 2013 (0 comments)
GSoC has just announced the list of participating organisations. This is a great opportunity for students to get involved in projects that matter, and to learn about code development which is bigger than the standard "one semester" programming project that they are usually exposed to at university.
- 177 of 417 projects were accepted, which is a success rate of 42%.
- 40 of the 177 project are accepted for the first time, which is a 23% proportion of new blood.
These seem to be in the same ballpark as most other competitive schemes for obtaining funding. Perhaps there is some type of psychological "mean" which reviewers gravitate to when they are evaluating submissions. For example, consider that out of the 4258 students that applied for projects in 2012, 1212 students got accepted, a rate of 28%.
To the students out there, please get in touch with potential mentors before putting in your applications. You'd be surprised at how much it could improve your application!
by Cheng Soon Ong on March 18, 2013 (1 comment)
Mikio and I are writing a book chapter about "Open Science in Machine Learning", which will appear in a collection titled "Implementing Computational Reproducible Research". Among many things, we mentioned that machine learning is about inventing new methods for solving problems. Luis Ibanez from Kitware pounced on this statement, and proceeded to give a wonderful argument that we are confusing our roles as scientists with the pressure of being an inventor. The rest of this post is an exact reproduction of Luis' response to our statement.
“... machine learning is concerned with creating new learning methods to perform well on certain application problems.”.
The authors discuss the purpose of machine learning, but under the untold context of “research on machine learning”, and the current landscape of funding research. To clarify, the authors imply that novelty is the purpose of machine learning research. More explicitly, that “developing new methods” is the goal of research.
This is a view (not limited to machine learning) that is commonly widespread, and that in practice is confirmed by the requirements of publishing and pursuit of grant funding. I beg to differ with this view, in the sense that “novelty” is not part of the scientific process at all. Novelty is an artificial condition that has been imposed on scientific workers over the years, due to the need to evaluate performance for the purpose of managing scarce funding resources. The goal of scientific research is to attempt to understand the world by direct observation, crafting of hypothesis and evaluation of hypothesis via reproducible experiments.
The pursuit of novelty (real or apparent) is actually a distraction, and it is one of the major obstacles to the practice of reproducible research. By definition, repeating an experiment, implies, requires and demands to do something that is not new. This distracted overrating of novelty is one of the reasons why scientific workers, and their institutions have come to consider repeatability of experiments as a “waste of time”, since it takes resources away from doing “new things” that could be published or could lead to new streams of funding. This confusion with “novelty” is also behind the lack of interest in reproducing experiments that have been performed by third parties. Since, such actions are “just repeating” what someone else did, and are not adding anything “new”. All, statements that are detrimental to the true practice of the scientific method.
The confusion is evident when one look at calls for proposals for papers in journal, conferences, or for funding programs. All of them call for “novelty”, none of them (with a handful of exceptions) call for reproducibility. The net effect is that we have confused two very different professions: (a) scientific researcher, with (b) inventor. Scientific researchers should be committed to the application of the scientific method, and in it, there is no requirement for novelty. The main commitment is to craft reproducible experiments, since we are after the truth, not after the new. Inventors on the other hand are in the business of coming up with new devices, and are not committed to understanding the world around us.
Most conference, journals, and even funding agencies have confused their role of supporting the understanding the world around us, and have become surrogates for the Patent Office.
In order to make progress in the pursuit of reproducible research, we need to put “novelty” back in its rightful place of being a nice extra secondary or tertiary feature of scientific research, but not a requirement, nor a driving force at all.