WSCT : The WEIGHTED SEQUENCE CLUSTERING TOOLKIT v0.1
WSCT is a sequence clustering toolkit. Given a set of sequences S, and a number of clusters k, WSCT partitions S into k disjoint partitions according to a minimum cut criterion on the similarity graph between the partitions. The set S consists of discrete sequence of alphabets (of different lengths) drawn from a finite alphabet set.
The sequences may be either weighted or un-weighted. An un-weighted sequence is essentially a sequence of alphabets from the given alphabet set. In the weighted version, a weight is associated with each of the alphabets - so essentially its a sequence of 2-tuples. For example, let A = (a, b, c) be an alphabet set.Then, s1 = <a,a ,b,c,a,b> is an unweighted sequence whereas s2 = <(a,2),(a,4),(b,1),(c,7),(a,2),(b,3)> is a weighted sequence.
1. A project name - files corresponding to a particular project/dataset will have a project name. All files - including results as well as the
temporary files will have this project name as the middle name. For any subsequent example, we shall assume a project name "foo".
2. An alphabet file - this contains the `names' of the finite alphabets. If there are p alphabets, each line on this file will contain one entry with the line number being the integer id of the alphabet on that line. The naming convention of this file is "alphabet.`project name'.txt". Thus, for the project "foo", this file will be called "alphabet.foo.txt'.
3. A data file - this contains the basic sequence data. The file should contain fully weighted or fully un-weighted data. Each line of the file contains one sequence - in terms of the integer id of the corresponding alphabet (and the weight, if applicable). For example, the line
corresponding to s1 = <a,a ,b,c,a,b> will be as follows:
1 1 2 3 1 2
and that corresponding to s2 = <(a,2),(a,4),(b,1),(c,7),(a,2),(b,3)> will be
1 2 1 4 2 1 3 7 1 2 2 3
corresponding to the alphabet file having a, b, and c on line numbers 1, 2 and 3 respectively.
4. A "tmp" directory under the output directory for storing the temporary files.
Two example projects can be found in the directory "data". One is for the unweighted case: project name "unweighted", alphabet file "alphabet.unweighted.txt", data file "sequence.unweighted.data". The other is for the weighted case: project name "weighted", alphabet file "alphabet.weighted.txt", data file "sequence.weighted.data".
Just type wsct.pl on the command prompt to get the usage.
wsct.pl -d <data dir> -s <project name> [-o <output dir>] [-c <# of Cluster>] [-t <basic threshold>] [-p <similarity power>] [-u <unweighted>] [-h <theta threshold>]
-d <data dir> : directory in which the data files will be present
(the directory in which alphabet.foo.txt and sequence.foo.data will be
-s <project name> : basic project/dataset name
(the project name)
-o <output dir> : output directory [default <data dir>]
(output directory where results will be written)
-c <# of clusters> : number of clusters [default 25]
-t <basic threshold> : basic cutoff threshold for similarity graph [default 0.1]
-p <similarity power>: importance component exponent [default 1]
-u <unweighted> : are the sequences unweighted [default no]
(if the data is un-weighted just give -u as an option in the command prompt)
-h <theta threshold> : the outlier cutoff threshold for similarity graph [default 0.3]
For the un-weighted data -
$ wsct.pl -d data -s unweighted -c 2 -u
For the weighted data -
$ wsct.pl -d data -s weighted -c 2
For some large unweighted dataset -
$ wsct.pl -d /home/bar/dataset/large/ -s foo -o /home/bar/output/large/ -c 10 -t 0.3 -p 3 -u -h 0.5
The code is far from stable and does not incorporate error/exception handling. So, if the program exits without completion for certain values of the parameter, do not be surprised. However,for typical values of the parameters, it should run perfectly. If the `pmetis' binary included
does not run properly, please download it from http://www-users.cs.umn.edu/~karypis/metis/metis/download.html
1. Keep the basic threshold (-t option) high (close to 0.5, say) when dealing with large datasets.
2. For large datasets, keep the theta threshold (-h option) high as well.
3. Give a reasonable value for the number of clusters (-c option). In fact, a slightly higher than required value gives purer clusters, some of which can then be merged by a criterion depending on the problem.
For suggestions/feedback please write to: email@example.com