Purpose and Capabilities
FACTORIE aims to provide a full-featured framework for probabilistic graphical models (both directed and undirected) that is both flexible for rapid prototyping and efficient at large scale for deployment in substantial applications.
It supplies infrastructure for representing random variables, creating dependencies among them with factors, running a variety of inference procedures on simple or complex dependency structures, and estimating parameters by various state-of-the-art methods.
FACTORIE's key features include the following:
- It is object-oriented, enabling encapsulation, abstraction and inheritance in the definition of random variables, factors, inference and learning methods.
- It is scalable, with demonstrated success on problems with billions of variables and factors, and on models that have changing structure, such as case factor diagrams. It has also been plugged into a database back-end, representing a new approach to probabilistic databases capable of handling many billions of variables.
- It is flexible, supporting multiple modeling and inference paradigms. Its original emphasis was on conditional random fields, undirected graphical models, MCMC inference, online training, and discriminative parameter estimation. However, it now also supports directed generative models (such as latent Dirichlet allocation), and has support for variational inference, including belief propagation and mean-field methods, as well as dual-decomposition.
- It supports smooth rise in users' learning effort as users descend in layers of abstraction---from command-line tools, to succinct scripting, to customized models and inference procedures. FACTORIE is designed to avoid "impenetrable black-boxes."
- It is embedded into a general purpose programming language, providing model authors with familiar and extensive resources for implementing the procedural aspects of their solution, including the ability to beneficially mix data pre-processing, diagnostics, evaluation, and other book-keeping code in the same files as the probabilistic model specification.
- It allows the use of imperative (procedural) constructs to define the factor graph---an unusual and powerful facet that enables significant efficiencies and also supports the injection of both declarative and procedural domain knowledge into model design.
The structure of generative models can be expressed as a program that describes the generative storyline by creating variables and specifying their parents.
The structure of undirected graphical models can be specified similarly by explicitly creating factors and specifying their neighboring variables. However, most commonly the creation of factors for relational data is defined in templates which contain functions that create the necessary factors in a Turing-complete imperative style.
This usage of imperative programming to define various aspects of factor graph construction and operation is an innovation originated in FACTORIE; we term this approach imperatively-defined factor graphs. The above three methods for specifying relational factor graph structure can be mixed in the same model.
FACTORIE's limitations include the following:
- It does not yet have extensive support for inference and learning with continuous random variables. Support for discrete random variables has been our main emphasis thus far.
- It has only minimal support for automatically selecting an appropriate inference method given a particular graphical model. Our users instead specify which family of inference method they wish to use.
It does not yet have convenient infrastructure for defining simple non-relational graphical models (such as the Sprinkler/Rain/Grass example). Our emphasis thus far has been on large relational data.
- It does not yet have a simple declarative syntax for specifying model structure. It has an extremely flexible mechanism for definition model structure directly in Scala; a simpler front-end syntax for beginners will be added in the future.
- It is not yet connected to a tool for directly producing graphs or other visualizations. For further discussion of FACTORIE's comparison to other related tools see [Tutorial010SimilarTools.scala.html].
FACTORIE comes with pre-built model structures and command-line tools for:
- classification (including document classification, with MaxEnt, NaiveBayes, SVMs and DecisionTrees)
- linear regression
- linear-chain conditional random fields
- topic modeling (including latent Dirichlet allocation, as well as several variants)
- natural language processing, including many standard NLP pipeline components, such as tokenization, sentence segmentation, part-of-speech tagging, named entity recognition, dependency parsing.
FACTORIE has been successfully applied to many tasks, including:
- cross-document entity resolution on 100 million mentions, parallelized and distributed (Wick, Singh, McCallum, ACL, 2012)
- within-document co-reference, supervised (Zheng, Vilnis, Singh, Choi, McCallum, CoNLL, 2013)
- parallel/distributed belief propagation
- transition-based and graph-based dependency parsing
- relation extraction, distantly supervised
- schema matching
- ontology alignment
- parallelized latent Dirichlet allocation.
- Changes to previous version:
Initial Announcement on mloss.org.
No one has posted any comments yet. Perhaps you'd like to be the first?
Leave a comment
You must be logged in to post comments.