Open Thoughts

Reproducibility is not simple

by Cheng Soon Ong on March 30, 2014 (1 comment)

There has been a flurry of articles recently outlining 10 simple rules for X, where X has something to do with data science, computational research and reproducibility. Some examples are:

Best practices

These articles provide a great resource to get started on the long road to doing "proper science". Some common suggestions which are relevant to practical machine learning include:

Use version control

Start now. No, not after your next paper, do it right away! Learn one of the modern distributed version control systems, git or mercurial currently being the most popular, and get an account on github or bitbucket to start sharing. Even if you don't share your code, it is a convenient offsite backup. Github is the most popular for open source projects, but bitbucket has the advantage of free private accounts. If you have an email address from an educational institution, you get the premium features for free too.

Distributed version control systems can be conceptually daunting, but it is well worth the trouble to understand the concepts instead of just robotically type in commands. There are numerous tutorials out there, and here are some which I personally found entertaining, git foundations and hginit. For those who don't like the command line, have a look at GUIs such as sourcetree, tortoisegit, tortoisehg, and gitk. If you work with other people, it is worth learning the fork and pull request model, and use the gitflow convention.

Please add your favourite tips and tricks in the comments below!

Open source your code and scripts

Publish everything. Even the two lines of Matlab that you used to plot your results. The readers of your NIPS and ICML papers are technical people, and it is often much simpler for them to look at your Matlab plot command than to parse the paragraph that describes the x and y axes, the meaning of the colours and line types, and the specifics of the displayed error bars. Tools such as ipython notebooks and knitr are examples of easy to implement literate programming frameworks that allow you to make your supplement a live document.

It is often useful to try to conceptually split your computational code into "programs" and "scripts". There is no hard and fast rule for where to draw the line, but one useful way to think about it is to contrast code that can be reused (something to be installed), and code that runs an experiment (something that describes your protocol). An example of the former is your fancy new low memory logistic regression training and testing code. An example of the latter is code to generate your plots. Make both types of code open, document and test them well.

Make your data a resource

Your result is also data. When open data is mentioned, most people immediately conjure images of the inputs to prediction machines. But intermediate stages of your workflow are often left out of making things available. For example, if in addition to providing the two lines of code for plotting, you also provided your multidimensional array containing your results, your paper now becomes a resource for future benchmarking efforts. If you made your precomputed kernel matrices available, other people can easily try out new kernel methods without having to go through the effort of computing the kernel.

Efforts such as mldata.org and mlcomp.org provide useful resources to host machine learning oriented datasets. If you do create a dataset, it is useful to get an identifier for it so that people can give you credit.

Challenges to open science

While the articles call these rules "simple", they are by no means easy to implement. While easy to state, there are many practical hurdles to making every step of your research reproducible .

Social coding

Unlike publishing a paper, where you do all your work before publication, publishing a piece of software often means that you have to support it in future. It is remarkably difficult to keep software available in the long term, since most junior researchers move around a lot and often leave academia altogether. It is also challenging to find contributors that can help out in stressful periods, and to keep software up to date and useful. Open source software suffers from the tragedy of the commons, and it quickly becomes difficult to maintain.

While it is generally good for science that everything is open and mistakes are found and corrected, the current incentive structure in academia does not reward support for ongoing projects. Funding is focused on novel ideas, publications are used as metrics for promotion and tenure, and software gets left out.

The secret branch

When developing a new idea, it is often tempting to do so without making it open to public scrutiny. This is similar to the idea of a development branch, but you may wish to keep it secret until publication. The same argument applies for data and results, where there may be a moratorium. I am currently unaware of any tools that allow easy conversion between public and private branches. Github allows forks of repositories, which you may be able to make private.

Once a researcher gets fully involved in an application area, it is inevitable that he starts working on the latest data generated by his collaborators. This could be the real time stream from Twitter or the latest double blind drug study. Such datasets are often embargoed from being made publicly available due to concerns about privacy. In the area of biomedical research there are efforts to allow bona fide researchers access to data, such as dbGaP. It seamlessly provides a resource for public and private data. Instead of a hurdle, a convenient mechanism to facilitate the transition from private to open science would encourage many new participants.

What is the right access control model for open science?

Data is valuable

It is a natural human tendency to protect a scarce resource which gives them a competitive advantage. For researchers, these resources include source code and data. While it is understandable that authors of software or architects of datasets would like to be the first to benefit from their investment, it often happens that these resources are not made publicly available even after publication.


Keynotes at ACML 2013

by Cheng Soon Ong on November 14, 2013 (0 comments)

We were very lucky this year to have an amazing set of keynote speakers at ACML 2013 who have made key contributions to getting machine learning into the real world. Here are some links to the open source software projects that they mentioned during their talks. The videos of the talks should be available at some point on the ACML website

We started off with Geoff Holmes, who spoke at MLOSS 06. He told us about how WEKA has been used in industry (satisfying Kiri Wagstaff's Challenge #2), and the new project for streaming data MOA. Later in the day, Chih-Jen Lin told us how important it was to understand both machine learning and optimisation, such that you can exploit the special structure for fast training of SVMs. This is how he obtained amazing speedups in LIBLINEAR. On the second day, Ralf Herbrich (who also gave a tutorial) gave us a behind the scenes tour of TrueSkill, the player matching algorithm used on XBox Live. Source code in F# is available here and the version generalised to track skill over time is available here.

Thanks to Geoff, Chih-Jen and Ralf for sharing their enthusiasm!


What does the “OSS” in MLOSS mean?

by Mark Reid on September 1, 2013 (1 comment)

I was recently asked to become an Action Editor for the Machine Learning and Open Source Software (MLOSS) track of Journal of Machine Learning Research. Of course, I gladly accepted since the aim of the JMLR MLOSS track (as well as the broader MLOSS project) -- to encourage the creation and use of open source software within machine learning -- is well aligned with my own interests and attitude towards scientific software.

Shortly after I joined, one of the other editors raised a question about how we are to interpret an item in the review criteria that states that reviewers should consider the "freedom of the code (lack of dependence on proprietary software)" when assessing submissions. What followed was an engaging email discussion amongst the Action Editors about the how to clarify our position.

After some discussion (summarised below), we settled on the following guideline which tries to ensure MLOSS projects are as open as possible while recognising the fact that MATLAB, although "closed", is nonetheless widely used within the machine learning community and has an open "work-alike" in the form of GNU Octave:

Dependency on Closed Source Software

We strongly encourage submissions that do not depend on closed source and proprietary software. Exceptions can be made for software that is widely used in a relevant part of the machine learning community and accessible to most active researchers; this should be clearly justified in the submission.

The most common case here is the question whether we will accept software written for Matlab. Given its wide use in the community, there is no strict reject policy for MATLAB submissions, but we strongly encourage submissions to strive for compatibility with Octave unless absolutely impossible.

The Discussion

There were a number of interesting arguments raised during the discussion, so I offered to write them up in this post for posterity and to solicit feedback from the machine learning community at large.

Reviewing and decision making

A couple of arguments were put forward in favour of a strict "no proprietary dependencies" policy.

Firstly, allowing proprietary dependencies may limit our ability to find reviewers for submissions -- an already difficult job. Secondly, stricter policies have the benefit of being unambiguous, which would avoid future discussions about the acceptability of future submission.

Promoting open ports

An argument made in favour of accepting projects with proprietary dependencies was that doing so may actually increase the chances of its code being forked to produce a version with no such dependencies.

Mikio Braun explored this idea further along with some broader concerns in a blog post about the role of curation and how it potentially limits collaboration.

Where do we draw the line?

Some of us had concerns about what exactly constitutes a proprietary dependency and came up with a number of examples that possibly fall into a grey area.

For example, how do operating systems fit into the picture? What if the software in question only compiles on Windows or OS X? These are both widely used but proprietary. Should we ensure MLOSS projects also work on Linux?

Taking a step up the development chain, what if the code base is most easily built using proprietary development tools such as Visual Studio or XCode? What if libraries such as MATLAB's Statistics Toolbox or Intel's MKL library are needed for performance reasons?

Things get even more subtle when we note that certain data formats (e.g., for medical imaging) are proprietary. Should such software be excluded even though the algorithms might work on other data?

These sorts of considerations suggested that a very strict policy may be difficult to enforce in practice.

What is our focus?

It is pretty clear what position Richard Stallman or other fierce free software advocates would take on the above questions: reject all of them! It is not clear that such an extreme position would necessarily suit the goals of the MLOSS track of JMLR.

Put another way, is the focus of MLOSS the "ML" or the "OSS"? The consensus seemed to be that we want to promote open source software to benefit machine learning, not the other way around.

Looking At The Data

Towards the end of the discussion, I made the argument that if we cannot be coherent we should at least be consistent and presented some data on all the accepted MLOSS submissions. The list below shows the breakdown of languages used by the 50 projects that have been accepted to the JMLR track to date. I'll note that some projects use and/or target multiple languages and that, because I only spent half an hour surveying the projects, I may have inadvertently misrepresented some (if I've done so, let me know).

C++: 15; Java: 13; MATLAB:11; Octave: 10; Python:9; C: 5; R: 4.

From this we can see that MATLAB is fairly well-represented amongst the accepted MLOSS projects. I took a closer look and found that of the 11 projects that are written in (or provide bindings for) MATLAB, all but one of them provide support for GNU Octave compatibility as well.

Closing Thoughts

I think the position we've adopted is realistic, consistent, and suitably aspirational. We want to encourage and promote projects that strive for openness and the positive effects it enables (e.g., reproducibility and reuse) but do not want to strictly rule out submissions that require a widely used, proprietary platform such as MATLAB.

Of course, a project like MLOSS is only as strong as the community it serves so we are keen to get feedback about this decision from people who use and create machine learning software so feel free to leave a comment or contact one of us by email.

Note: This is a cross-post from Mark's blog at Inductio ex Machina.


Code review for science

by Cheng Soon Ong on August 14, 2013 (0 comments)

How good is the software associated with scientific papers? There seems to be a general impression that the quality of scientific software is not that great. How do we check for software quality? Well, by doing code review.

In an interesting experiment between the Mozilla Science Lab and PLoS Computational Biology, a selected number of papers with snippets of code from the latter will be reviewed by engineers from the former.

For more details see the blog post by Kaitlin Thaney.


GSoC 2013

by Cheng Soon Ong on April 9, 2013 (0 comments)

GSoC has just announced the list of participating organisations. This is a great opportunity for students to get involved in projects that matter, and to learn about code development which is bigger than the standard "one semester" programming project that they are usually exposed to at university.

Some statistics:

  • 177 of 417 projects were accepted, which is a success rate of 42%.
  • 40 of the 177 project are accepted for the first time, which is a 23% proportion of new blood.

These seem to be in the same ballpark as most other competitive schemes for obtaining funding. Perhaps there is some type of psychological "mean" which reviewers gravitate to when they are evaluating submissions. For example, consider that out of the 4258 students that applied for projects in 2012, 1212 students got accepted, a rate of 28%.

To the students out there, please get in touch with potential mentors before putting in your applications. You'd be surprised at how much it could improve your application!


Scientist vs Inventor

by Cheng Soon Ong on March 18, 2013 (1 comment)

Mikio and I are writing a book chapter about "Open Science in Machine Learning", which will appear in a collection titled "Implementing Computational Reproducible Research". Among many things, we mentioned that machine learning is about inventing new methods for solving problems. Luis Ibanez from Kitware pounced on this statement, and proceeded to give a wonderful argument that we are confusing our roles as scientists with the pressure of being an inventor. The rest of this post is an exact reproduction of Luis' response to our statement.


“... machine learning is concerned with creating new learning methods to perform well on certain application problems.”.

The authors discuss the purpose of machine learning, but under the untold context of “research on machine learning”, and the current landscape of funding research. To clarify, the authors imply that novelty is the purpose of machine learning research. More explicitly, that “developing new methods” is the goal of research.

This is a view (not limited to machine learning) that is commonly widespread, and that in practice is confirmed by the requirements of publishing and pursuit of grant funding. I beg to differ with this view, in the sense that “novelty” is not part of the scientific process at all. Novelty is an artificial condition that has been imposed on scientific workers over the years, due to the need to evaluate performance for the purpose of managing scarce funding resources. The goal of scientific research is to attempt to understand the world by direct observation, crafting of hypothesis and evaluation of hypothesis via reproducible experiments.

The pursuit of novelty (real or apparent) is actually a distraction, and it is one of the major obstacles to the practice of reproducible research. By definition, repeating an experiment, implies, requires and demands to do something that is not new. This distracted overrating of novelty is one of the reasons why scientific workers, and their institutions have come to consider repeatability of experiments as a “waste of time”, since it takes resources away from doing “new things” that could be published or could lead to new streams of funding. This confusion with “novelty” is also behind the lack of interest in reproducing experiments that have been performed by third parties. Since, such actions are “just repeating” what someone else did, and are not adding anything “new”. All, statements that are detrimental to the true practice of the scientific method.

The confusion is evident when one look at calls for proposals for papers in journal, conferences, or for funding programs. All of them call for “novelty”, none of them (with a handful of exceptions) call for reproducibility. The net effect is that we have confused two very different professions: (a) scientific researcher, with (b) inventor. Scientific researchers should be committed to the application of the scientific method, and in it, there is no requirement for novelty. The main commitment is to craft reproducible experiments, since we are after the truth, not after the new. Inventors on the other hand are in the business of coming up with new devices, and are not committed to understanding the world around us.

Most conference, journals, and even funding agencies have confused their role of supporting the understanding the world around us, and have become surrogates for the Patent Office.

In order to make progress in the pursuit of reproducible research, we need to put “novelty” back in its rightful place of being a nice extra secondary or tertiary feature of scientific research, but not a requirement, nor a driving force at all.


Software Licensing

by Cheng Soon Ong on February 6, 2013 (1 comment)

One of the tricky decisions software authors have to make is "What license should I use for my software?" A recent article in PLoS Computational Biology discusses the different possible avenues open to authors. It gives a balanced view of software licensing, carefully describing the various dimensions authors of software should consider before coming to a decision.

It recommends the following guidelines:

  • For the widest possible distribution consider a permissive FOSS license such as the BSD/MIT, Apache, or ECL.
  • For assuring that derivatives also benefit FOSS, choose a copyleft FOSS license like the GPL, LGPL, or MPL.
  • To those on the fence, there are hybrid or multi-licensing which can achieve the benefits of both open source and proprietary software licenses.
  • For protecting the confidentiality of your code, there is the proprietary license.

Naturally being an open source venue, I strongly encourage people to consider the first two options. We also discuss the distinction between FOSS licences in our position paper from 2007.


Chemical compound and drug name recognition task.

by Martin Krallinger on January 2, 2013 (2 comments)

CALL FOR PARTICIPATION: CHEMDNER task: Chemical compound and drug name recognition task.

( http://www.biocreative.org/tasks/biocreative-iv/chemdner )

TASK GOAL AND MOTIVATION Machine learning methods have been especially useful for the automatic recognition of entity mentions in text, a crucial step for further natural language processing tasks. To promote the development of open source software for indexing documents with compounds and recognizing compound mentions in text.

The goal of this task is to promote the implementation of systems that are able to detect mentions in text of chemical compounds and drugs. The recognition of chemical entities is also crucial for other subsequent text processing strategies, such as detection of drug-protein interactions, adverse effects of chemical compounds or the extraction of pathway and metabolic reaction relations. A range of different methods have been explored for the recognition of chemical compound mentions including machine learning based approaches, rule-based systems and different types of dictionary-lookup strategies.

As has been the case in previous BioCreative efforts (resulting in high impact papers in the field), we expect that successful participants will have the opportunity to publish their system descriptions in a journal article.

CHEMDNER DESCRIPTION The CHEMDNER is one of the tracks posed at the BioCreative IV community challenge (http://www.biocreative.org).

We invite participants to submit results for the CHEMDNER task providing predictions for one or both of the following subtasks:

a) Given a set of documents, return for each of them a ranked list of chemical entities described within each of these documents [Chemical document indexing sub-task]

b) Provide for a given document the start and end indices corresponding to all the chemical entities mentioned in this document [Chemical entity mention recognition sub-task].

For these two tasks the organizers will release training and test data collections. The task organizers will provide details on the used annotation guidelines; define a list of criteria for relevant chemical compound entity types as well as selection of documents for annotation.

REGISTRATION Teams can participate in the CHEMDNER task by registering for track 2 of BioCreative IV. You can register additionally for other tracks too. To register your team go to the following page that provides more detailed instructions: http://www.biocreative.org/news/biocreative-iv/team/

Mailing list and contact information You can post questions related to the CHEMDNER task to the BioCreative mailing list. To register for the BioCreative mailing list, please visit the following page: http://biocreative.sourceforge.net/mailing.html You can also directly send questions to the organizers through e-mail: mkrallinger@cnio[HTML_REMOVED]es

WORKSHOP CHEMDNER is part of the BioCreative evaluation effort. The BioCreative Organizing Committee will host the BioCreative IV Challenge evaluation workshop (http://www.biocreative.org/events/biocreative-iv/CFP/) at NCBI, National Institutes of Health, Bethesda, Maryland, on October 7-9, 2013

CHEMDNER TASK ORGANIZERS Martin Krallinger, Spanish National Cancer Research Center (CNIO) Obdulia Rabal, University of Navarra, Spain Julen Oyarzabal, University of Navarra, Spain Alfonso Valencia, Spanish National Cancer Research Center (CNIO)

REFERENCES - Vazquez, M., Krallinger, M., Leitner, F., & Valencia, A. (2011). Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications. Molecular Informatics, 30(6-7), 506-519. - Corbett, P., Batchelor, C., & Teufel, S. (2007). Annotation of chemical named entities. BioNLP 2007: Biological, translational, and clinical language processing, 57-64. - Klinger, R., Kolářik, C., Fluck, J., Hofmann-Apitius, M., & Friedrich, C. M. (2008). Detection of IUPAC and IUPAC-like chemical names. Bioinformatics, 24(13), i268-i276. - Hettne, K. M., Stierum, R. H., Schuemie, M. J., Hendriksen, P. J., Schijvenaars, B. J., Mulligen, E. M. V., ... & Kors, J. A. (2009). A dictionary to identify small molecules and drugs in free text. Bioinformatics, 25(22), 2983-2991. - Yeh, A., Morgan, A., Colosimo, M., & Hirschman, L. (2005). BioCreAtIvE task 1A: gene mention finding evaluation. BMC bioinformatics, 6(Suppl 1), S2. - Smith, L., Tanabe, L. K., Ando, R. J., Kuo, C. J., Chung, I. F., Hsu, C. N., ... & Wilbur, W. J. (2008). Overview of BioCreative II gene mention recognition. Genome Biology, 9(Suppl 2), S2.


Paper "Ten Simple Rules for the Open Development of Scientific Software" by Prlic and Procter

by Mikio Braun on December 11, 2012 (0 comments)

PLOS Computational Biology has an interesting Editorial on 10 rules for open development of scientific software. The ten rules are:

  1. Don't Reinvent the Wheel
  2. Code Well
  3. Be Your Own User
  4. Be Transparent
  5. Be Simple
  6. Don't Be a Perfectionist
  7. Nurture and Grow Your Community
  8. Promote Your Project
  9. Find Sponsors
  10. Science Counts.

The full article can be found here.


Best Practices for Scientific Computing

by Cheng Soon Ong on November 28, 2012 (0 comments)

I've been following the progress of Software Carpentry for some years now, and have been very impressed by their message that software is the new telescope, and we should invest time and effort to build up skills to ensure that our software is the best quality possible. Otherwise, how can we be sure that our new discoveries are not due to some instrument error?

They wrote a nice short paper titled "Best Practices for Scientific Computing" that highlights practices that would improve the quality of the software, and hence improve research productivity. Here are the 10 recommendations (along with the sub-recommendations).

1. Write programs for people, not computers.

1.1 a program should not require its readers to hold more than a handful of facts in memory at once

1.2 names should be consistent, distinctive, and meaningful

1.3 code style and formatting should be consistent

1.4 all aspects of software development should be broken down into tasks roughly an hour long

2. Automate repetitive tasks.

2.1 rely on the computer to repeat tasks

2.2 save recent commands in a file for re-use

2.3 use a build tool to automate their scientific workflows

3. Use the computer to record history.

3.1 software tools should be used to track computational work automatically

4. Make incremental changes.

4.1 work in small steps with frequent feedback and course correction

5. Use version control.

5.1 use a version control system

5.2 everything that has been created manually should be put in version control

6. Don’t repeat yourself (or others).

6.1 every piece of data must have a single authoritative representation in the system

6.2 code should be modularized rather than copied and pasted

6.3 re-use code instead of rewriting it

7. Plan for mistakes.

7.1 add assertions to programs to check their operation

7.2 use an off-the-shelf unit testing library

7.3 turn bugs into test cases

7.4 use a symbolic debugger

8. Optimize software only after it works correctly.

8.1 use a profiler to identify bottlenecks

8.2 write code in the highest-level language possible

9. Document the design and purpose of code rather than its mechanics.

9.1 document interfaces and reasons, not implementations

9.2 refactor code instead of explaining how it works

9.3 embed the documentation for a piece of software in that software

10. Conduct code reviews.

10.1 use code review and pair programming when bringing someone new up to speed and when tackling particularly tricky design, coding, and debugging problems

10.2 use an issue tracking tool