mloss.org Bilingual Text Classificationhttp://mloss.orgUpdates and additions to Bilingual Text ClassificationenFri, 09 Apr 2010 15:13:08 -0000Bilingual Text Classification 0.1http://mloss.org/software/view/247/<html><p>The proliferation of multilingual documentation is a common phenomenon
in many official institutions and private companies. In many cases,
this textual information needs to be categorised by hand, entailing a
time-consuming and arduous task.
</p>
<p>This software package implements a series of statistical models for
bilingual text classificacion trained by the EM algorihtm. In this
context data samples must be provided in the format cxy, where c is
the class label, x is a text in a source language and y is its
corresponding translation in a target language. Classification is
performed according to the Bayes' rule:
</p>
<pre><code> c* = argmax_c p(c|x,y) = argmax_c p(x,y|c)·p(c)
</code></pre><p>where p(x,y|c) is a class-conditional bilingual probability that is
supposed to be generated from a t-component mixture model combining a
translation and a language model:
</p>
<pre><code> p(x,y|c) = sum_t p(x,y,t|c) = sum_t p(x|y,t,c)·p(y|t,c)·p(t|c)
</code></pre><p>Depending on the assumptions that we make about the translation model
p(x|y,t,c) and the language model p(y|t,c), we will obtain different
instantiations of bilingual text classifiers. However, two general
approaches can be devised.
</p>
<p>The first approach is the modelisation of each language independently,
making a naive crosslingual-independent assumption. Its corresponding
implementation is "1g1gmc" that represents each language model as a
unigram model:
</p>
<p>p(x|y,t,c) := p(x|t,c) := prod_i p(x_i|t,c)
p(y|t,c) := prod_j p(y_j|t,c)
</p>
<p>The second approach is a natural evolution of the latter by taking
into account the word correlation across languages. Here we combine
the well-known IBM translation model with an n-gram model. Among
these models, the following were implemented in this software package:
</p>
<p>1gM1mc) Unigram-M1 Mixture Model (M1 also known as IBM Model 1)
</p>
<p>p(x|y,t,c) := prod_j sum_i p(x_j,a_j|y_i,t,c)
p(y|t,c) := prod_j p(y_j|t,c)
</p>
<p>1gM2mc) Unigram-M2 Mixture Model (M2 also known as IBM Model 2)
</p>
<p>p(x|y,t,c) := prod_j sum_i p(i|j,|y|,t,c)·p(x_j,a_j|y_i,t,c)
p(y|t,c) := prod_j p(y_j|t,c)
</p>
<p>being |y|, the number of words in y.
</p>
<p>2gM2mc) Bigram-M2 Mixture Model
</p>
<p>p(x|y,t,c) := prod_j sum_i p(i|j,|y|,t,c)·p(x_j,a_j|y_i,t,c)
p(y|t,c) := prod_j p(y_j|y_{j-1},t,c)
</p>
<p>The implementations m1g1gmc and m1gM1mc are straigthforward extensions
to multi-label bilingual text classification. In this case, given a
fixed number k of class labels to be assigned to each text, the k-most
probable classes according to p(c|x,y) are returned.
</p>
<p>Single-label classifiers provide as output for each iteration of the
EM algorithm and from left to right:
</p>
<ul>
<li><p>Information about the number of mixture components and iteration.
</p>
</li>
<li><p>Values of the parameters dependening on the model.
</p>
</li>
<li><p>Log-likelihood, variation in log-likelihood, percentage of error
rate and variation in percentage of error rate for the training,
validation (when implemented) and test sets.
</p>
</li>
</ul>
<p>Multi-label classifiers, in addition to the information regarding the
number of mixture components, iteration and parameter values, output
from left to right:
</p>
<ul>
<li><p>Log-likelihood and variation in log-likelihood for the training set.
</p>
</li>
<li><p>Precision and recall for the test set given that the number of class
labels requested to the classifier is:
</p>
</li>
</ul>
<p> 1) Only one class.
</p>
<p> 2) The average number of class labels per text in the training set.
</p>
<p> 3) Twice the average number of class labels.
</p>
<p> 4) The exact number of class labels for that text (oracle mode).
</p></html>Jorge Civera, Alfons JuanFri, 09 Apr 2010 15:13:08 -0000http://mloss.org/software/rss/comments/247http://mloss.org/software/view/247/multilabel classificationnatural language processingnaive bayesemmixture modelsicml2010bilingual text classificationcrosslingual text classificationtext classificationalignment model