Project details for DCABags

Logo DCABags 0.63

by wbuntine - November 8, 2013, 13:31:04 CET [ Project Homepage BibTeX Download ]

view (8 today), download ( 0 today ), 4 subscriptions

Description:

This is a suite of Perl scripts with man pages for preprocessing text collections to create dictionaries and bag/list files for use by topic modelling software. Output is in various sparse vector formats (ldac, Matlab Topic Toolbox, libsvm, ...). Scripts are also given for extracting data from WEX files (Wikipedia dumps), PubMed XML and Reuters RCV1 files. The input format is a simple text file which can be created with XSL from standard XML formats. The input format has fielded entries for things like text, links and categories. The system uses buffering so input files can be 30Gb. Example use: (1) Generate PMI matrices from Wikipedia for coherence evaluation ala Newman, Lau, Grieser and Baldwin, NAACL 2010. (2) Process RCV1, Pubmed, NewsML or WEX files into sparse matrix "bags". (3) Generate bags in LDAC or sparse matrix format.

User Guide (read this first) and other information at: http://www.nicta.com.au/people/buntinew/softwareanddata#dcabags

Sample prepared input data sets and XSL files available at http://www.nicta.com.au/people/buntinew/softwareanddata#data

Help and discussion forums at https://forge.nicta.com.au/forum/?group_id=129

Changes to previous version:

Cleaned up man pages and created user guide.

BibTeX Entry: Download
URL: Project Homepage
Supported Operating Systems: Agnostic
Data Formats: Txt, Xml
Tags: Topic Modeling, Data Sets, Data Cleaning
Archive: download here

Other available revisons

Version Changelog Date
0.7

Moved distribution and code across to GitHub. Changed "ldac" format to have 0 offset for word indices. Added "document frequency" (df) filtering on selection of tokens for linkTables. Playing with linkParse but its still unuseable generally.

June 5, 2014, 05:34:44
0.63

Cleaned up man pages and created user guide.

November 8, 2013, 13:31:04
0.62

Initial Announcement on mloss.org.

November 2, 2013, 08:11:36

Comments

No one has posted any comments yet. Perhaps you'd like to be the first?

Leave a comment

You must be logged in to post comments.