by Cheng Soon Ong on November 14, 2013 (0 comments)
We were very lucky this year to have an amazing set of keynote speakers at ACML 2013 who have made key contributions to getting machine learning into the real world. Here are some links to the open source software projects that they mentioned during their talks. The videos of the talks should be available at some point on the ACML website
We started off with Geoff Holmes, who spoke at MLOSS 06. He told us about how WEKA has been used in industry (satisfying Kiri Wagstaff's Challenge #2), and the new project for streaming data MOA. Later in the day, Chih-Jen Lin told us how important it was to understand both machine learning and optimisation, such that you can exploit the special structure for fast training of SVMs. This is how he obtained amazing speedups in LIBLINEAR. On the second day, Ralf Herbrich (who also gave a tutorial) gave us a behind the scenes tour of TrueSkill, the player matching algorithm used on XBox Live. Source code in F# is available here and the version generalised to track skill over time is available here.
Thanks to Geoff, Chih-Jen and Ralf for sharing their enthusiasm!
by Mark Reid on September 1, 2013 (1 comment)
I was recently asked to become an Action Editor for the Machine Learning and Open Source Software (MLOSS) track of Journal of Machine Learning Research. Of course, I gladly accepted since the aim of the JMLR MLOSS track (as well as the broader MLOSS project) -- to encourage the creation and use of open source software within machine learning -- is well aligned with my own interests and attitude towards scientific software.
Shortly after I joined, one of the other editors raised a question about how we are to interpret an item in the review criteria that states that reviewers should consider the "freedom of the code (lack of dependence on proprietary software)" when assessing submissions. What followed was an engaging email discussion amongst the Action Editors about the how to clarify our position.
After some discussion (summarised below), we settled on the following guideline which tries to ensure MLOSS projects are as open as possible while recognising the fact that MATLAB, although "closed", is nonetheless widely used within the machine learning community and has an open "work-alike" in the form of GNU Octave:
Dependency on Closed Source Software
We strongly encourage submissions that do not depend on closed source and proprietary software. Exceptions can be made for software that is widely used in a relevant part of the machine learning community and accessible to most active researchers; this should be clearly justified in the submission.
The most common case here is the question whether we will accept software written for Matlab. Given its wide use in the community, there is no strict reject policy for MATLAB submissions, but we strongly encourage submissions to strive for compatibility with Octave unless absolutely impossible.
There were a number of interesting arguments raised during the discussion, so I offered to write them up in this post for posterity and to solicit feedback from the machine learning community at large.
Reviewing and decision making
A couple of arguments were put forward in favour of a strict "no proprietary dependencies" policy.
Firstly, allowing proprietary dependencies may limit our ability to find reviewers for submissions -- an already difficult job. Secondly, stricter policies have the benefit of being unambiguous, which would avoid future discussions about the acceptability of future submission.
Promoting open ports
An argument made in favour of accepting projects with proprietary dependencies was that doing so may actually increase the chances of its code being forked to produce a version with no such dependencies.
Where do we draw the line?
Some of us had concerns about what exactly constitutes a proprietary dependency and came up with a number of examples that possibly fall into a grey area.
For example, how do operating systems fit into the picture? What if the software in question only compiles on Windows or OS X? These are both widely used but proprietary. Should we ensure MLOSS projects also work on Linux?
Taking a step up the development chain, what if the code base is most easily built using proprietary development tools such as Visual Studio or XCode? What if libraries such as MATLAB's Statistics Toolbox or Intel's MKL library are needed for performance reasons?
Things get even more subtle when we note that certain data formats (e.g., for medical imaging) are proprietary. Should such software be excluded even though the algorithms might work on other data?
These sorts of considerations suggested that a very strict policy may be difficult to enforce in practice.
What is our focus?
It is pretty clear what position Richard Stallman or other fierce free software advocates would take on the above questions: reject all of them! It is not clear that such an extreme position would necessarily suit the goals of the MLOSS track of JMLR.
Put another way, is the focus of MLOSS the "ML" or the "OSS"? The consensus seemed to be that we want to promote open source software to benefit machine learning, not the other way around.
Looking At The Data
Towards the end of the discussion, I made the argument that if we cannot be coherent we should at least be consistent and presented some data on all the accepted MLOSS submissions. The list below shows the breakdown of languages used by the 50 projects that have been accepted to the JMLR track to date. I'll note that some projects use and/or target multiple languages and that, because I only spent half an hour surveying the projects, I may have inadvertently misrepresented some (if I've done so, let me know).
C++: 15; Java: 13; MATLAB:11; Octave: 10; Python:9; C: 5; R: 4.
From this we can see that MATLAB is fairly well-represented amongst the accepted MLOSS projects. I took a closer look and found that of the 11 projects that are written in (or provide bindings for) MATLAB, all but one of them provide support for GNU Octave compatibility as well.
I think the position we've adopted is realistic, consistent, and suitably aspirational. We want to encourage and promote projects that strive for openness and the positive effects it enables (e.g., reproducibility and reuse) but do not want to strictly rule out submissions that require a widely used, proprietary platform such as MATLAB.
Of course, a project like MLOSS is only as strong as the community it serves so we are keen to get feedback about this decision from people who use and create machine learning software so feel free to leave a comment or contact one of us by email.
Note: This is a cross-post from Mark's blog at Inductio ex Machina.
by Cheng Soon Ong on August 14, 2013 (0 comments)
How good is the software associated with scientific papers? There seems to be a general impression that the quality of scientific software is not that great. How do we check for software quality? Well, by doing code review.
In an interesting experiment between the Mozilla Science Lab and PLoS Computational Biology, a selected number of papers with snippets of code from the latter will be reviewed by engineers from the former.
For more details see the blog post by Kaitlin Thaney.
by Cheng Soon Ong on April 9, 2013 (0 comments)
GSoC has just announced the list of participating organisations. This is a great opportunity for students to get involved in projects that matter, and to learn about code development which is bigger than the standard "one semester" programming project that they are usually exposed to at university.
- 177 of 417 projects were accepted, which is a success rate of 42%.
- 40 of the 177 project are accepted for the first time, which is a 23% proportion of new blood.
These seem to be in the same ballpark as most other competitive schemes for obtaining funding. Perhaps there is some type of psychological "mean" which reviewers gravitate to when they are evaluating submissions. For example, consider that out of the 4258 students that applied for projects in 2012, 1212 students got accepted, a rate of 28%.
To the students out there, please get in touch with potential mentors before putting in your applications. You'd be surprised at how much it could improve your application!
by Cheng Soon Ong on March 18, 2013 (1 comment)
Mikio and I are writing a book chapter about "Open Science in Machine Learning", which will appear in a collection titled "Implementing Computational Reproducible Research". Among many things, we mentioned that machine learning is about inventing new methods for solving problems. Luis Ibanez from Kitware pounced on this statement, and proceeded to give a wonderful argument that we are confusing our roles as scientists with the pressure of being an inventor. The rest of this post is an exact reproduction of Luis' response to our statement.
“... machine learning is concerned with creating new learning methods to perform well on certain application problems.”.
The authors discuss the purpose of machine learning, but under the untold context of “research on machine learning”, and the current landscape of funding research. To clarify, the authors imply that novelty is the purpose of machine learning research. More explicitly, that “developing new methods” is the goal of research.
This is a view (not limited to machine learning) that is commonly widespread, and that in practice is confirmed by the requirements of publishing and pursuit of grant funding. I beg to differ with this view, in the sense that “novelty” is not part of the scientific process at all. Novelty is an artificial condition that has been imposed on scientific workers over the years, due to the need to evaluate performance for the purpose of managing scarce funding resources. The goal of scientific research is to attempt to understand the world by direct observation, crafting of hypothesis and evaluation of hypothesis via reproducible experiments.
The pursuit of novelty (real or apparent) is actually a distraction, and it is one of the major obstacles to the practice of reproducible research. By definition, repeating an experiment, implies, requires and demands to do something that is not new. This distracted overrating of novelty is one of the reasons why scientific workers, and their institutions have come to consider repeatability of experiments as a “waste of time”, since it takes resources away from doing “new things” that could be published or could lead to new streams of funding. This confusion with “novelty” is also behind the lack of interest in reproducing experiments that have been performed by third parties. Since, such actions are “just repeating” what someone else did, and are not adding anything “new”. All, statements that are detrimental to the true practice of the scientific method.
The confusion is evident when one look at calls for proposals for papers in journal, conferences, or for funding programs. All of them call for “novelty”, none of them (with a handful of exceptions) call for reproducibility. The net effect is that we have confused two very different professions: (a) scientific researcher, with (b) inventor. Scientific researchers should be committed to the application of the scientific method, and in it, there is no requirement for novelty. The main commitment is to craft reproducible experiments, since we are after the truth, not after the new. Inventors on the other hand are in the business of coming up with new devices, and are not committed to understanding the world around us.
Most conference, journals, and even funding agencies have confused their role of supporting the understanding the world around us, and have become surrogates for the Patent Office.
In order to make progress in the pursuit of reproducible research, we need to put “novelty” back in its rightful place of being a nice extra secondary or tertiary feature of scientific research, but not a requirement, nor a driving force at all.
by Cheng Soon Ong on February 6, 2013 (1 comment)
One of the tricky decisions software authors have to make is "What license should I use for my software?" A recent article in PLoS Computational Biology discusses the different possible avenues open to authors. It gives a balanced view of software licensing, carefully describing the various dimensions authors of software should consider before coming to a decision.
It recommends the following guidelines:
- For the widest possible distribution consider a permissive FOSS license such as the BSD/MIT, Apache, or ECL.
- For assuring that derivatives also benefit FOSS, choose a copyleft FOSS license like the GPL, LGPL, or MPL.
- To those on the fence, there are hybrid or multi-licensing which can achieve the benefits of both open source and proprietary software licenses.
- For protecting the confidentiality of your code, there is the proprietary license.
Naturally being an open source venue, I strongly encourage people to consider the first two options. We also discuss the distinction between FOSS licences in our position paper from 2007.
by Martin Krallinger on January 2, 2013 (2 comments)
CALL FOR PARTICIPATION: CHEMDNER task: Chemical compound and drug name recognition task.
( http://www.biocreative.org/tasks/biocreative-iv/chemdner )
TASK GOAL AND MOTIVATION Machine learning methods have been especially useful for the automatic recognition of entity mentions in text, a crucial step for further natural language processing tasks. To promote the development of open source software for indexing documents with compounds and recognizing compound mentions in text.
The goal of this task is to promote the implementation of systems that are able to detect mentions in text of chemical compounds and drugs. The recognition of chemical entities is also crucial for other subsequent text processing strategies, such as detection of drug-protein interactions, adverse effects of chemical compounds or the extraction of pathway and metabolic reaction relations. A range of different methods have been explored for the recognition of chemical compound mentions including machine learning based approaches, rule-based systems and different types of dictionary-lookup strategies.
As has been the case in previous BioCreative efforts (resulting in high impact papers in the field), we expect that successful participants will have the opportunity to publish their system descriptions in a journal article.
CHEMDNER DESCRIPTION The CHEMDNER is one of the tracks posed at the BioCreative IV community challenge (http://www.biocreative.org).
We invite participants to submit results for the CHEMDNER task providing predictions for one or both of the following subtasks:
a) Given a set of documents, return for each of them a ranked list of chemical entities described within each of these documents [Chemical document indexing sub-task]
b) Provide for a given document the start and end indices corresponding to all the chemical entities mentioned in this document [Chemical entity mention recognition sub-task].
For these two tasks the organizers will release training and test data collections. The task organizers will provide details on the used annotation guidelines; define a list of criteria for relevant chemical compound entity types as well as selection of documents for annotation.
REGISTRATION Teams can participate in the CHEMDNER task by registering for track 2 of BioCreative IV. You can register additionally for other tracks too. To register your team go to the following page that provides more detailed instructions: http://www.biocreative.org/news/biocreative-iv/team/
Mailing list and contact information You can post questions related to the CHEMDNER task to the BioCreative mailing list. To register for the BioCreative mailing list, please visit the following page: http://biocreative.sourceforge.net/mailing.html You can also directly send questions to the organizers through e-mail: mkrallinger@cnio[HTML_REMOVED]es
WORKSHOP CHEMDNER is part of the BioCreative evaluation effort. The BioCreative Organizing Committee will host the BioCreative IV Challenge evaluation workshop (http://www.biocreative.org/events/biocreative-iv/CFP/) at NCBI, National Institutes of Health, Bethesda, Maryland, on October 7-9, 2013
CHEMDNER TASK ORGANIZERS Martin Krallinger, Spanish National Cancer Research Center (CNIO) Obdulia Rabal, University of Navarra, Spain Julen Oyarzabal, University of Navarra, Spain Alfonso Valencia, Spanish National Cancer Research Center (CNIO)
REFERENCES - Vazquez, M., Krallinger, M., Leitner, F., & Valencia, A. (2011). Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications. Molecular Informatics, 30(6-7), 506-519. - Corbett, P., Batchelor, C., & Teufel, S. (2007). Annotation of chemical named entities. BioNLP 2007: Biological, translational, and clinical language processing, 57-64. - Klinger, R., Kolářik, C., Fluck, J., Hofmann-Apitius, M., & Friedrich, C. M. (2008). Detection of IUPAC and IUPAC-like chemical names. Bioinformatics, 24(13), i268-i276. - Hettne, K. M., Stierum, R. H., Schuemie, M. J., Hendriksen, P. J., Schijvenaars, B. J., Mulligen, E. M. V., ... & Kors, J. A. (2009). A dictionary to identify small molecules and drugs in free text. Bioinformatics, 25(22), 2983-2991. - Yeh, A., Morgan, A., Colosimo, M., & Hirschman, L. (2005). BioCreAtIvE task 1A: gene mention finding evaluation. BMC bioinformatics, 6(Suppl 1), S2. - Smith, L., Tanabe, L. K., Ando, R. J., Kuo, C. J., Chung, I. F., Hsu, C. N., ... & Wilbur, W. J. (2008). Overview of BioCreative II gene mention recognition. Genome Biology, 9(Suppl 2), S2.
by Mikio Braun on December 11, 2012 (0 comments)
PLOS Computational Biology has an interesting Editorial on 10 rules for open development of scientific software. The ten rules are:
- Don't Reinvent the Wheel
- Code Well
- Be Your Own User
- Be Transparent
- Be Simple
- Don't Be a Perfectionist
- Nurture and Grow Your Community
- Promote Your Project
- Find Sponsors
- Science Counts.
The full article can be found here.
by Cheng Soon Ong on November 28, 2012 (0 comments)
I've been following the progress of Software Carpentry for some years now, and have been very impressed by their message that software is the new telescope, and we should invest time and effort to build up skills to ensure that our software is the best quality possible. Otherwise, how can we be sure that our new discoveries are not due to some instrument error?
They wrote a nice short paper titled "Best Practices for Scientific Computing" that highlights practices that would improve the quality of the software, and hence improve research productivity. Here are the 10 recommendations (along with the sub-recommendations).
1. Write programs for people, not computers.
1.1 a program should not require its readers to hold more than a handful of facts in memory at once
1.2 names should be consistent, distinctive, and meaningful
1.3 code style and formatting should be consistent
1.4 all aspects of software development should be broken down into tasks roughly an hour long
2. Automate repetitive tasks.
2.1 rely on the computer to repeat tasks
2.2 save recent commands in a file for re-use
2.3 use a build tool to automate their scientific workflows
3. Use the computer to record history.
3.1 software tools should be used to track computational work automatically
4. Make incremental changes.
4.1 work in small steps with frequent feedback and course correction
5. Use version control.
5.1 use a version control system
5.2 everything that has been created manually should be put in version control
6. Don’t repeat yourself (or others).
6.1 every piece of data must have a single authoritative representation in the system
6.2 code should be modularized rather than copied and pasted
6.3 re-use code instead of rewriting it
7. Plan for mistakes.
7.1 add assertions to programs to check their operation
7.2 use an off-the-shelf unit testing library
7.3 turn bugs into test cases
7.4 use a symbolic debugger
8. Optimize software only after it works correctly.
8.1 use a profiler to identify bottlenecks
8.2 write code in the highest-level language possible
9. Document the design and purpose of code rather than its mechanics.
9.1 document interfaces and reasons, not implementations
9.2 refactor code instead of explaining how it works
9.3 embed the documentation for a piece of software in that software
10. Conduct code reviews.
10.1 use code review and pair programming when bringing someone new up to speed and when tackling particularly tricky design, coding, and debugging problems
10.2 use an issue tracking tool
by Cheng Soon Ong on October 12, 2012 (0 comments)
In a rather self deprecating title "I wanted to Predict Elections with Twitter and all I got was this Lousy Paper" Daniel Gayo-Avello takes us on a tour of how hard it is to do reproducible research, and how often authors take short cuts. From the abstract:
"Predicting X from Twitter is a popular fad within the Twitter research subculture. It seems both appealing and relatively easy. Among such kind of studies, electoral prediction is maybe the most attractive, and at this moment there is a growing body of literature on such a topic. This is not only an interesting research problem but, above all, it is extremely diﬃcult. However, most of the authors seem to be more interested in claiming positive results than in providing sound and reproducible methods."
It is an interesting survey of papers that use Twitter data.
He lists some flaws in current research on electoral predictions, but they are generally applicable to any machine learning paper (my comments in brackets):
- It's not prediction at all! I have not found a single paper predicting a future result. (Neither is bootstrap nor cross validation prediction)
- Chance is not a valid baseline...
- There is not a commonly accepted way of "counting votes" in Twitter
- There is not a commonly accepted way of interpreting reality! (In supervised learning, we tend to ignore the fact that there is no ground truth in reality.)
- Sentiment analysis is applied as a black-box... (As machine learning algorithm get more complex, more people will tend to use machine learning software as a black box)
- All the tweets are assumed to be trustworthy. (I don't know if anybody is doing adversarial election prediction)
- Demographics are neglected. (The biased sample problem)
- Self-selection bias.
The window is closing on those who want to predict the upcoming US elections from X.