Sally is a small tool for mapping a set of strings to a set of vectors. This mapping is referred to as embedding and allows for applying techniques of machine learning and data mining for analysis of string data. Sally can applied to several types of string data, such as text documents, DNA sequences or log files, where it can handle common formats such as directories, archives and text files of string data.
Sally implements a standard technique for mapping strings to a vector space that is often referred to as vector space model or bag-of-words model. The strings are characterized by a set of features, where each feature is associated with one dimension of the vector space. The following types of features are supported by Sally: bytes, words, n-grams of bytes and n-grams of words.
Sally proceeds by counting the occurrences of the specified features in each string and generating a sparse vector of count values. Alternatively, binary or TF-IDF values can be computed and stored in the vectors. Sally then normalizes the vector, for example using the L1 or L2 norm, and outputs it in a specified format, such as plain text or in LibSVM or Matlab format.
- Changes to previous version:
Fixed bug in FASTA module
Other available revisons
Version Changelog Date 1.0.0
Support for explicit selection of granularity added. Several minor bug fixes. We have reached 1.0
March 26, 2015, 17:01:35 0.9.2
Fixed severe bug in concurrent computation of blended n-grams.
November 19, 2014, 20:28:35 0.9.1
Several minor bugfixes
November 19, 2014, 20:25:09 0.9.0
Support for hash-based dimension reduction: simhash, minhash and Bloom filter. Support for several n-gram variants: regular, sorted, positional and blended n-grams. Simplified configuration.
July 1, 2014, 22:43:51 0.8.2
Support for new version of libarchive. Several major and minor bug fixes.
December 25, 2013, 13:38:59 0.8.1
Support for positional n-grams with shift (similar to weighted-degree kernel with shift) has been added. Several minor bugs have been fixed.
December 27, 2012, 14:34:31 0.8.0
Support for stop words and frequency thresholding has been added. The configuration has been simplified and is more transparent. Several bugs have been fixed.
August 29, 2012, 09:46:55 0.7.1
Several minor bugs have been fixed in the configuration.
May 18, 2012, 16:19:07 0.7
Fixed several minor bugs. Support for signed embedding has been added.
May 14, 2012, 13:17:02 0.6.4
Support for positional and sorted n-grams (n-perms). That is, n-grams bound to a position in strings and n-grams whose symbols are sorted.
February 6, 2012, 17:44:22 0.6.3
Fixed bug in FASTA module
August 22, 2011, 10:54:23 0.6.2
Support for the clustering software CLUTO has been added as a new output module. The documentation has been extended.
July 9, 2011, 18:49:02 0.6.1
Added missing configuration file to package. sigh
April 1, 2011, 09:30:01 0.6.0
Support for a system-wide configuration has been added. Additionally, all configuration parameters can be specified on the command line. The manual page and documentation have been updated and extended
February 22, 2011, 10:27:08 0.5.2
Bugfixes on Linux Improved support for Matlab export
October 8, 2010, 06:50:03 0.5.0
Initial Announcement on mloss.org.
September 29, 2010, 18:02:46
No one has posted any comments yet. Perhaps you'd like to be the first?
Leave a comment
You must be logged in to post comments.