Large-Scale Exemplar Clustering (Subset Selection / Facility Location)
Description
Lowrank-ALBCD is a scalable convex exemplar clustering (or subset selection / facility location) solver based on Augmented Lagrangian + Block Coordinate Descent (AL-BCD) that uses column-generation technique to avoid computing the pairwise sample similarity matrix.
In particular, LowrankALBCD solves problem of the form:
by convex relaxation from {0,1} to [0,1], where
is an N by M dissimilarity matrix,
denotes element-wise inner product, N is the number of samples, and M is the number of candidate exemplars. In the default setting, we have M=N and normalized Euclidian distance square is used. We provide a tightening procedure for the relaxation that guarantees to yeild integer solution. (The solver by default uses normalized square Euclidean distance. To use other dissimilarity measure, send an email to eyan@cs.cmu.edu.)
Download and Compile
You can download the code here.
Compile it by "make" in a unix system with g++.
Usage
Usage: LowrankALBCD [data]
output: "clusters"
-l lambda: parameter controling number of clusters (default 0.02*#samples)
-t tightening: incrementally rounding to integer solution (0/1, default 1)
-e eta: Augmented Lagrangian parameter (default 0.01)
-m max_iter: maximum number of iterations (default 1000)
-p precision: primal infeasibility tolerence (default 1e-2)
-c cache_size: maximum number of columns of dissimilarity matrix stored (default 10000)
(p.s. Normalized square Euclidean distance d(xi,xj)=2-\frac{}{\|xi\|_2\|xj\|_2} is used by default.)
Data Format:
LIBSVM (svm-light) format:
Each row
[label] [feature_ID_1]:[value1] [feature_ID_2]:[value2] ......
corresponds to a sample, where [label] is the ground-truth cluster ID (can be filled in with 0 if ground truth is unknown) and features of 0 value are not listed. One can find some examples in the directory "example_data/", and more data of this format can be found at, for example LIBSVM repository.
Output Format:
File "clusters":
Each row
[exemplar_ID] [sample_ID1]:[assign_value1] [sample_ID2]:[assign_value2] ......
corresponds to a cluster, where [exemplar_ID] is the ID of sample selected as exemplar, which is then followed by a sequence of [sample_ID] with non-zero weight on this exemplar.
Example Usages:
- The default behavior of the solver uses "tightening" technique to enforce integer solution. For example, type:
$ ./LowrankALBCD example_data/real-sim.3k
It will solve several convex optimization problems. Each problem is obtained by rounding one more exemplar (a column of W) to integers and solve for the others.
-
For some data, for example:
./LowrankALBCD example_data/german.numer.scale
The solver obtains integer solution without tightening. In this case, the solution is guaranteed to be the optimal solution to the combinatorial (exemplar clustering) problem.
-
If one would like to run convex relaxation solver without tighteing, add "-t 0" as:
./LowrankALBCD -t 0 example_data/real-sim.3k
-
To find different number of clusters, one can tune the parameter "lambda" by "-l" option. For example:
./LowrankALBCD -l 20 example_data/german.numer.scale
./LowrankALBCD -l 10 example_data/german.numer.scale
./LowrankALBCD -l 5 example_data/german.numer.scale
./LowrankALBCD -l 1 example_data/german.numer.scale
will gives 4, 8, 12, 53 clusters respectively.
Citation
You are welcome to use the code. For citation, please use:
Scalable Exemplar Clustering and Facility Location via Augmented Block Coordinate Descent with Column Generation.
Ian E.H. Yen, Dmitry Malioutov, and Abhishek Kumar. ,
In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016.
[PDF]
Bug reports and comments are
always appreciated.