Project details for ELKI

Logo ELKI 0.6.0

by erich - January 10, 2014, 18:32:28 CET [ Project Homepage BibTeX BibTeX for corresponding Paper Download ]

view (31 today), download ( 2 today ), 4 subscriptions

Description:

ELKI: "Environment for Developing KDD-Applications Supported by Index-Structures" is a development framework for data mining algorithms written in Java. It includes a large variety of popular data mining algorithms, distance functions and index structures.

Its focus is particularly on clustering and outlier detection methods, in contrast to many other data mining toolkits that focus on classification. Additionally, it includes support for index structures to improve algorithm performance such as R*-Tree and M-Tree.

The modular architecture is meant to allow adding custom components such as distance functions or algorithms, while being able to reuse the other parts for evaluation.

This package also includes the source code, since this software is meant for the rapid development of such algorithms, not so much for end users.

Changes to previous version:

Additions and Improvements from ELKI 0.5.5:

Algorithms

Clustering:

  • Hierarchical Clustering - the slower naive variants were added, and the code was refactored
  • Partition extraction from hierarchical clusterings - different linkage strategies (e.g. Ward)
  • Canopy pre-Clustering
  • Naive Mean-Shift Clustering
  • Affinity propagation clustering (both with distances and similarities / kernel functions)
  • K-means variations: Best-of-multiple-runs, bisecting k-means
  • New k-means initialization: farthest points, sample initialization
  • Cheng and Church Biclustering
  • P3C Subspace Clustering
  • One-dimensional clustering algorithm based on kernel density estimation

Outlier detection

  • COP - correlation outlier probabilities
  • LDF - a kernel density based LOF variant
  • Simplified LOF - a simpler version of LOF (not using reachability distance)
  • Simple Kernel Density LOF - a simple LOF using kernel density (more consistent than LDF)
  • Simple outlier ensemble algorithm
  • PINN - projection indexed nearest neighbors, via projected indexes.
  • ODIN - kNN graph based outlier detection
  • DWOF - Dynamic-Window Outlier Factor (contributed by Omar Yousry)
  • ABOD refactored, into ABOD, FastABOD and LBABOD

Distances

  • Geodetic distances now support different world models (WGS84 etc.) and are subtantially faster.
  • Levenshtein distances for processing strings, e.g. for analyzing phonemes (contributed code, see "Word segmentation through cross-lingual word-to-phoneme alignment", SLT2013, Stahlberg et al.)
  • Bray-Curtis, Clark, Kulczynski1 and Lorentzian distances with R-tree indexing support
  • Histogram matching distances
  • Probabilistic divergence distances (Jeffrey, Jensen-Shannon, Chi2, Kullback-Leibler)
  • Kulczynski2 similarity
  • Kernel similarity code has been refactored, and additional kernel functions have been added

Database Layer and Data Types

Projection layer * Parser for simple textual data (for use with Levenshtein distance) Various random projection families (including Feature Bagging, Achlioptas, and p-stable) Latitude+Longitude to ECEF Sparse vector improvements and bug fixes New filter: remove NaN values and missing values New filter: add histogram-based jitter New filter: normalize using statistical distributions New filter: robust standardization using Median and MAD New filter: Linear discriminant analysis (LDA)

Index Layer

  • Another speed up in R-trees
  • Refactoring of M- and R-trees: Support for different strategies in M-tree New strategies for M-tree splits Speedups in M-tree
  • New index structure: in-memory k-d-tree
  • New index structure: in-memory Locality Sensitive Hashing (LSH)
  • New index structure: approximate projected indexes, such as PINN
  • Index support for geodetic data - (Details: Geodetic Distance Queries on R-Trees for Indexing Geographic Data, SSTD13)
  • Sampled k nearest neighbors: reference KDD13 "Subsampling for Efficient and Effective Unsupervised Outlier Detection Ensembles"
  • Cached (precomputed) k-nearest neighbors to share across multiple runs
  • Benchmarking "algorithms" for indexes

Mathematics and Statistics

  • Many new distributions have been added, now 28 different distributions are supported
  • Additional estimation methods (using advanced statistics such as L-Moments), now 44 estimators are available
  • Trimming and Winsorizing
  • Automatic best-fit distribution estimation
  • Preprocessor using these distributions for rescaling data sets
  • API changes related to the new distributions support
  • More kernel density functions
  • RANSAC covariance matrix builder (unfortunately rather slow)

Visualization

  • 3D projected coordinates (Details: Interactive Data Mining with 3D-Parallel-Coordinate-Trees, SIGMOD2013)
  • Convex hulls now also include nested hierarchical clusters

Other

  • Parser speedups
  • Sparse vector bug fixes and improvements
  • Various bug fixes
  • PCA, MDS and LDA filters
  • Text output was slightly improved (but still needs to be redesigned from scratch - please contribute!)
  • Refactoring of hierarchy classes
  • New heap classes and infrastructure enhancements
  • Classes can have aliases, e.g. "l2" for euclidean distance.
  • Some error messages were made more informative.
  • Benchmarking classes, also for approximate nearest neighbor search.
BibTeX Entry: Download
Corresponding Paper BibTeX Entry: Download
URL: Project Homepage
Supported Operating Systems: Platform Independent
Data Formats: Arff, Other, Csv, Parser Extension Api
Tags: Clustering, Visualization, Algorithms, Evaluation, Anomaly Detection, Outlier Detection, Index Structures
Archive: download here

Other available revisons

Version Changelog Date
0.7.1

Additions and improvements from ELKI 0.7.0 to 0.7.1:

Algorithm additions:

  • GriDBSCAN: DBSCAN using grid partitioning (Minkowski distances only)

  • Compare-Means and Sort-Means k-means variations (much faster than traditional k-means)

  • Visualization of dendrograms.

Important bug fixes:

  • Classes with no package ("default package") would cause errors.

  • The fast power function implementation was sometimes returning incorrect results.

  • Random sampling was sometimes not sampling from the full data set.

UI improvements:

  • The file input source will now automatically choose the Arff parser for .arff files.

  • MiniGUI now allows choosing other applications.

  • MiniGUI now displays the command line in a separate field.

  • MiniGUI displays an error message, if an incorrect classpath or JAyatana (on Ubuntu) is detected.

  • Export to png now works, we added a work-around for an open Batik bug.

Smaller changes:

  • Many smaller bug fixes.

  • C-Index for cluster evaluation now can process larger data sets.

  • OPTICS output of undefined reachability fixed.

  • External distance matrixes are easier to use and perform additional checks.

  • Precomputed distance matrixes can answer range and kNN queries.

  • Voronoi visualization can be switched in the menu now.

  • Improved backwards command line compatibility with additional aliases.

  • Added generated @since annotations in JavaDoc.

  • Many new unit tests, renamed to the Java conventions.

  • Low-level reading of service files, to have faster startup.

March 14, 2016, 13:44:02
0.7.0

Additions and Improvements from ELKI 0.6.0:

ELKI is now available on Maven: https://search.maven.org/#artifactdetails|de.lmu.ifi.dbs.elki|elki|0.7.0|jar

Please clone https://github.com/elki-project/example-elki-project for a minimal project example.

Uncertain data types, and clustering algorithms for uncertain data.

Major refactoring of distances - removal of Distance values and removed support for non-double-valued distance functions (in particular DoubleDistance was removed). While this reduces the generality of ELKI, we could remove about 2.5% of the codebase by not having to have optimized codepaths for double-distance anymore. Generics for distances were present in almost any distance-based algorithm, and we were also happy to reduce the use of generics this way. Support for non-double-valued distances can trivially be added again, e.g. by adding the specialization one level higher: at the query instead of the distance level, for example. In this process, we also removed the Generics from NumberVector. The object-based get was deprecated for a good reason long ago, and e.g. doubleValue are more efficient (even for non-DoubleVectors).

Dropped some long-deprecated classes.

K-means:

  • speedups for some initialization heuristics.

  • K-means++ initialization no longer squares distances (again).

  • farthest-point heuristics now uses minimum instead of sum (renamed).

  • additional evaluation criteria.

  • Elkan's and Hamerly's faster k-means variants.

CLARA clustering.

X-means.

Hierarchical clustering:

  • Renamed naive algorithm to AGNES.

  • Anderbergs algorithm (faster than AGNES, slower than SLINK).

  • CLINK for complete linkage clustering in O(n²) time, O(n) memory.

  • Simple extraction from HDBSCAN.

  • "Optimal" extraction from HDBSCAN.

  • HDBSCAN, in two variants.

LSDBC clustering.

EM clustering was refactored and moved into its own package. The new version is much more extensible.

OPTICS clustering:

  • Added a list-based variant of OPTICS to our heap-based.

  • FastOPTICS (contributed by Johannes Schneider).

  • Improved OPTICS Xi cluster extraction.

Outlier detection:

  • KDEOS outlier detection (SDM14).

  • k-means based outlier detection (distance to centroid) and Silhouette coefficient based approach (which does not work too well on the toy data sets - the lowest silhouette are usually where two clusters touch).

  • bug fix in kNN weight, when distances are tied and kNN yields more than k results.

  • kNN and kNN weight outlier have their k parameter changed: old 2NN outlier is now 1NN outlier, as commonly understood in classification literature (1 nearest neighbor other than the query object; whereas in database literature the 1NN is usually the query object itself). You can get the old result back by decreasing k by one easily.

  • LOCI implementation is now only O(n^3 log n) instead of O(n^4).

  • Local Isolation Coefficient (LIC).

  • IDOS outlier detection with intrinsic dimensionality.

  • Baseline intrinsic dimensionality outlier detection.

  • Variance-of-Volumes outlier detection (VOV).

Parallel computation framework, and some parallelized algorithms

  • Parallel k-means.

  • Parallel LOF and variants.

LibSVM format parser.

kNN classification (with index acceleration).

Internal cluster evaluation:

  • Silhouette index.

  • Simplified Silhouette index (faster).

  • Davis-Bouldin index.

  • PBM index.

  • Variance-Ratio-Criteria.

  • Sum of squared errors.

  • C-Index.

  • Concordant pair indexes (Gamma, Tau).

  • Different noise handling strategies for internal indexes.

Statistical dependence measures:

  • Distance correlation dCor.

  • Hoeffings D.

  • Some divergence / mutual information measures.

Distance functions:

  • Big refactoring.

  • Time series distances refactored, allow variable length series now.

  • Hellinger distance and kernel function.

Preprocessing:

  • Faster MDS implementation using power iterations.

Indexing improvements:

  • Precomputed distance matrix "index".

  • iDistance index (static only).

  • Inverted-list index for sparse data and cosine/arccosine distance.

  • Cover tree index (static only).

  • Additional LSH hash functions.

Frequent Itemset Mining:

  • Improved APRIORI implementation.

  • FP-Growth added.

  • Eclat (basic version only) added.

Uncertain clustering:

  • Discrete and continuous data models.

  • FDBSCAN clustering.

  • UKMeans clustering.

  • CKMeans clustering.

  • Representative Uncertain Clustering (Meta-algorithm).

  • Center-of-mass meta Clustering (allows using other clustering algorithms on uncertain objects).

Mathematics:

  • Several estimators for intrinsic dimensionality.

MiniGUI has two "secret" new options: -minigui.last -minigui.autorun to load the last saved configuration and run it, for convenience.

Logging API has been extended, to make logging more convenient in a number of places (saving some lines for progress logging and timing).

November 27, 2015, 18:23:16
0.7.0-20150828

Additions and Improvements from ELKI 0.6.0:

  • Uncertain data types, and clustering algorithms for uncertain data.

  • Major refactoring of distances - removal of Distance values and removed support for non-double-valued distance functions. While this reduces the generality of ELKI, we could remove about 2.5% of the codebase by not having to have optimized codepaths for double-distance anymore. Generics for distances were present in almost any distance-based algorithm, and we were also happy to reduce the use of generics this way. Support for non-double-valued distances can trivially be added again, e.g. by adding the specialization one level higher: at the query instead of the distance level, for example.

  • In this process, we also removed the Generics from NumberVector. The object-based get was deprecated for a good reason long ago, and e.g. doubleValue are more efficient (even for non-DoubleVectors).

  • Dropped some long-deprecated classes

Clustering algorithms:

K-means

  • speedups for some initialization heuristics
  • K-means++ initialization no longer squares distances (again)
  • farthest-point heuristics now uses minimum instead of sum (renamed)
  • additional evaluation criteria
  • Elkan's and Hamerly's faster k-means variants

CLARA clustering

X-means

Hierarchical clustering

  • Renamed naive algorithm to AGNES
  • Anderbergs algorithm (faster than AGNES, slower than SLINK)
  • CLINK for complete linkage clustering in O(n²) time, O(n) memory
  • Simple extraction from HDBSCAN
  • "Optimal" extraction from HDBSCAN
  • HDBSCAN, in two variants

LSDBC clustering

EM clustering was refactored and moved into its own package. The new version is much more extensible.

Parallel computation framework, and some parallelized algorithms

  • Parallel k-means
  • Parallel LOF and variants

Input:

  • LibSVM format parser

Classification:

  • kNN classification (with index acceleration)

Evaluation: Internal cluster evaluation:

  • Silhouette index
  • Simplified Silhouette index (faster)
  • Davis-Bouldin index
  • PBM index
  • Variance-Ratio-Criteria
  • Sum of squared errors
  • C-Index
  • Concordant pair indexes (Gamma, Tau)
  • Different noise handling strategies for internal indexes

Statistical dependence measures:

  • Distance correlation dCor.
  • Hoeffings D.
  • Some divergence / mutual information measures.

Distance functions:

  • Big refactoring.
  • Time series distances refactored, allow variable length series now.
  • Hellinger distance and kernel function.

Preprocessing:

  • Faster MDS implementation using power iterations.

Indexing improvements:

  • Precomputed distance matrix "index".
  • iDistance index (static only).
  • Inverted-list index for sparse data and cosine/arccosine distance.
  • cover tree index (static only).

Frequent Itemset Mining:

  • Improved APRIORI implementation.
  • FP-Growth added.
  • Eclat (basic version only) added.

Uncertain clustering:

  • Discrete and continuous data models
  • FDBSCAN clustering
  • UKMeans clustering
  • CKMeans clustering
  • Representative Uncertain Clustering (Meta-algorithm)
  • Center-of-mass meta Clustering (allows using other clustering algorithms on uncertain objects) (KDD'14)

Outlier detection changes / smaller improvements:

  • KDEOS outlier detection (SDM14)
  • k-means based outlier detection (distance to centroid) and Silhouette coefficient based approach (which does not work too well on the toy data sets - the lowest silhouette are usually where two clusters touch).
  • bug fix in kNN weight, when distances are tied and kNN yields more than k results.
  • kNN and kNN weight outlier have their k parameter changed: old 2NN outlier is now 1NN outlier, as commonly understood in classification literature (1 nearest neighbor ''other than the query object''; whereas in database literature the 1NN is usually the query object itself). You can get the old result back by decreasing k by one easily.
  • LOCI implementation is now only O(n^3 log n) instead of O(n^4).

Various:

  • MiniGUI has two "secret" new options: -minigui.last -minigui.autorun to load the last saved configuration and run it, for convenience.

  • Logging API has been extended, to make logging more convenient in a number of places (saving some lines for progress logging and timing).

September 17, 2015, 10:20:30
0.6.0

Additions and Improvements from ELKI 0.5.5:

Algorithms

Clustering:

  • Hierarchical Clustering - the slower naive variants were added, and the code was refactored
  • Partition extraction from hierarchical clusterings - different linkage strategies (e.g. Ward)
  • Canopy pre-Clustering
  • Naive Mean-Shift Clustering
  • Affinity propagation clustering (both with distances and similarities / kernel functions)
  • K-means variations: Best-of-multiple-runs, bisecting k-means
  • New k-means initialization: farthest points, sample initialization
  • Cheng and Church Biclustering
  • P3C Subspace Clustering
  • One-dimensional clustering algorithm based on kernel density estimation

Outlier detection

  • COP - correlation outlier probabilities
  • LDF - a kernel density based LOF variant
  • Simplified LOF - a simpler version of LOF (not using reachability distance)
  • Simple Kernel Density LOF - a simple LOF using kernel density (more consistent than LDF)
  • Simple outlier ensemble algorithm
  • PINN - projection indexed nearest neighbors, via projected indexes.
  • ODIN - kNN graph based outlier detection
  • DWOF - Dynamic-Window Outlier Factor (contributed by Omar Yousry)
  • ABOD refactored, into ABOD, FastABOD and LBABOD

Distances

  • Geodetic distances now support different world models (WGS84 etc.) and are subtantially faster.
  • Levenshtein distances for processing strings, e.g. for analyzing phonemes (contributed code, see "Word segmentation through cross-lingual word-to-phoneme alignment", SLT2013, Stahlberg et al.)
  • Bray-Curtis, Clark, Kulczynski1 and Lorentzian distances with R-tree indexing support
  • Histogram matching distances
  • Probabilistic divergence distances (Jeffrey, Jensen-Shannon, Chi2, Kullback-Leibler)
  • Kulczynski2 similarity
  • Kernel similarity code has been refactored, and additional kernel functions have been added

Database Layer and Data Types

Projection layer * Parser for simple textual data (for use with Levenshtein distance) Various random projection families (including Feature Bagging, Achlioptas, and p-stable) Latitude+Longitude to ECEF Sparse vector improvements and bug fixes New filter: remove NaN values and missing values New filter: add histogram-based jitter New filter: normalize using statistical distributions New filter: robust standardization using Median and MAD New filter: Linear discriminant analysis (LDA)

Index Layer

  • Another speed up in R-trees
  • Refactoring of M- and R-trees: Support for different strategies in M-tree New strategies for M-tree splits Speedups in M-tree
  • New index structure: in-memory k-d-tree
  • New index structure: in-memory Locality Sensitive Hashing (LSH)
  • New index structure: approximate projected indexes, such as PINN
  • Index support for geodetic data - (Details: Geodetic Distance Queries on R-Trees for Indexing Geographic Data, SSTD13)
  • Sampled k nearest neighbors: reference KDD13 "Subsampling for Efficient and Effective Unsupervised Outlier Detection Ensembles"
  • Cached (precomputed) k-nearest neighbors to share across multiple runs
  • Benchmarking "algorithms" for indexes

Mathematics and Statistics

  • Many new distributions have been added, now 28 different distributions are supported
  • Additional estimation methods (using advanced statistics such as L-Moments), now 44 estimators are available
  • Trimming and Winsorizing
  • Automatic best-fit distribution estimation
  • Preprocessor using these distributions for rescaling data sets
  • API changes related to the new distributions support
  • More kernel density functions
  • RANSAC covariance matrix builder (unfortunately rather slow)

Visualization

  • 3D projected coordinates (Details: Interactive Data Mining with 3D-Parallel-Coordinate-Trees, SIGMOD2013)
  • Convex hulls now also include nested hierarchical clusters

Other

  • Parser speedups
  • Sparse vector bug fixes and improvements
  • Various bug fixes
  • PCA, MDS and LDA filters
  • Text output was slightly improved (but still needs to be redesigned from scratch - please contribute!)
  • Refactoring of hierarchy classes
  • New heap classes and infrastructure enhancements
  • Classes can have aliases, e.g. "l2" for euclidean distance.
  • Some error messages were made more informative.
  • Benchmarking classes, also for approximate nearest neighbor search.
January 10, 2014, 18:32:28
0.6.0-beta1

New beta release, including some new algorithms (ODIN, PINN, full O(n^3) Hierarchical Clustering, new cluster extraction methods from hierarchies), new index structures (in-memory k-d tree, LSH, projected indexes, PINN), new visualizations and much more.

This release requires Java 7, for the new visualizations also JOGL will be needed.

June 23, 2013, 21:28:33
0.5.5

This is mostly a bug fix release. A lot of small issues have been fixed that improve performance, make error reporting a lot better, ease the use of sparse vectors and external precomputed distances, for example.

This will be the last ELKI release to support Java 6. The next ELKI release will require Java 7.

Algorithms

  • Some new LOF variants (LDF, SimpleLOF, SimpleKernelDensityLOF)
  • Correlation Outlier Probabilities (ICDM 2012)
  • A naive mean-shift clustering
  • Single-link clustering (SLINK algorithm) should be significantly faster due to optimized data structures
  • "Benchmarking" algorithms for measuring the performance of index structures

Index layer

  • Bulk loading R-Trees should be faster - in particular Sort Tile Recursive can work very well.
  • M-Trees have been refactored and optimized for double distances

Database layer

  • Bundle format (work in progress): low-level binary format for fast data exchange
  • DBID and DataStore layer received some additional classes for further performance improvements
  • KNN heap structures were revisited. The code is less clean now, but performs better in benchmarks.

Visualizations

  • General clean up and API simplifications
  • Some additional modules and improvements

Various

  • There is a new parameter class, RandomParameter
  • Some new distributions were added, also to the data set generator.

Tutorials

  • The website has new tutorials, including one on a k-means variation that produces equal sized clusters.
December 14, 2012, 18:49:58
0.5.0

Primary release goals:

  • Cluster evaluation: metrics and circle-segment-visualization (ICDE 2012)

  • Outlier detection ensembles (SDM 2011, 2012)

  • Usability improvements, for example by adding an automatic evaluation helper

  • Performance improvements by reducing boxing of primitive types

  • Parallel coordinates visualizations added for high-dimensional data

  • Tons of new algorithms, distance functions, index structures, visualizations, evaluators, ...

http://elki.dbs.ifi.lmu.de/wiki/Releases/ReleaseNotes0.5.0

July 1, 2012, 20:58:25
0.5.0 beta2

The full changelog is not yet up. Here is an excerpt of the new functions in 0.5.0 - further speed improvements - R-Tree flexibility: multiple new split strategies, bulk loaders, insertion strategies, so that ELKI can now do many R-Tree variations, including the original Guttman R-Tree, not only the R*-Tree. - K-Means flexibility: MacQueen and Lloyd style iterations along with various seeding strategies, including K-Means++ - VA-File (static only, not dynamic databases) - Many popular cluster evaluation measures - Alpha shapes, Voronoi cells, Delaunay triangulations in the visualization layer (in the projected space, so 2D!) - Parallel coordinates - Outlier ensemble code, presented at SDM 2012 - Some new algorithms, such as OUTRES

For the final 0.5.0 release we hope to have some approximate outlier detection methods for you (aLOCI, HilOut) as well as some subspace outlier detection methods including HiCS (ICDE 2012, to be presented tomorrow).

June 1, 2012, 21:32:08
0.5.0 beta1

The full changelog is not yet up. Here is an excerpt of the new functions in 0.5.0 - further speed improvements - R-Tree flexibility: multiple new split strategies, bulk loaders, insertion strategies, so that ELKI can now do many R-Tree variations, including the original Guttman R-Tree, not only the R*-Tree. - K-Means flexibility: MacQueen and Lloyd style iterations along with various seeding strategies, including K-Means++ - VA-File (static only, not dynamic databases); partial-VA to come for 0.5.0 final? - Many popular cluster evaluation measures - Alpha shapes, Voronoi cells, Delaunay triangulations in the visualization layer (in the projected space, so 2D!) - Parallel coordinates (only halfway reviewed in beta1, more to come!) - Outlier ensemble code, to be presented at SDM 2012 end of april

For the final 0.5.0 release we hope to have some approximate outlier detection methods for you (aLOCI, HilOut) as well as some subspace outlier detection methods including HiCS (ICDE 2012, to be presented tomorrow).

May 9, 2012, 20:46:08
0.4.1

Bug fix release with a number of minor issues affecting single algorithms, that have accumulated over the previous months. Existing applications should not be affected by this upgrade.

A larger 0.5.0 release is scheduled for early april with new algorithms, but also with API changes.

February 13, 2012, 16:51:35
0.4.0

Initial Announcement on mloss.org.

January 16, 2012, 22:12:23

Comments

No one has posted any comments yet. Perhaps you'd like to be the first?

Leave a comment

You must be logged in to post comments.