Project guidelines Project topics

Project guidelines

You are encouraged to work in groups of three to five for the term project. This project is a critical part of the course, and a significant factor in determining your grade. First, each group should submit (hardcopy) a one/two page proposal summarizing the proposed project including the plan of attack, indication of where you will get the data from, and at least two key references, on or before March 5. Please feel free to discuss your project with me before that. Also, if you have difficulty in finding a project partner, you could contact the TA with a list of your interest area(s), so that they can try to match you up with other students before the March 5th deadline. Each group shall give a 15-20 minute presentation on their project around late April.

The deadline for the softcopy term paper submission is midnight, May 5th. This copy should be 20-30 pages (1.5 spacing) including figures, tables and/or references in the form of a single pdf file. You may want to refer to the guidelines posted by Professor Elkan at University of California, San Diego for writing your paper. If you want to submit supplementary materials (code, referenced papers) make a folder and give me access via a pointer to the URL/dropbox location. One submission per group.

In some cases I may provide feedback on the final term paper by May 8th, and give you the chance to submit an improved version by May 12th.


Project topics

The project should be centered around some problem with associated data sets that you can mine to provide useful and actionable answers. At the least, this should be an exercise in analyzing a reasonably large dataset. In the process, if you invent new techniques/algorithms or processes, or make inferences that are useful and not done before, of course that is an added bonus. Two types of projects are suggested below.

Group A (Based on a Competition or other Real-World Large Datasets)

Data Mining Competitions

There have been several data mining competitions such as the KDD cup (http://www.kdnuggets.com/competitions/kddcup), and for some the data is still available and you can also find papers on how others have fared on these data sets. There are also several ongoing competitions (e.g. see http://www.kdnuggets.com/competitions).

Warning: these can be quite addictive, but also quite fun and a learning experience.

Yahoo! Webscope Data Sets

I can get some very large data sets from Yahoo!. These data sets include:

A few papers have already been written based on work using these data sets but many opportunities exist to define your own problem and solution. There are several other such large repositories as well.

Group B (Based on Type of Analysis or Application Domain)

Four such topics I am interested in are:

Data Mining for Health Care

List of some Health Care Data Sets. A highlighted area under healthcare is Structured Learning for Bioinformatics. Many bio-informatics problems involve a very large number of variables; often needing a predictive model with less data points than variables. A series of approaches based on LASSO and extensions have been recently developed with very promising results. See this talk and recent publications from Professor Ye's group. This group also has a nice software package that you can use to apply such methods to suitable data sets. Some related papers include those on ADNI and Dartmouth Health Atlas. Also, for the Texas discharge data, papers can by found by doing a Google search for "Texas Hospital Inpatient Discharge Public". I may be also able to get data from local hospitals.

"Affinity" Data Sets

List of some "Affinity" Data Sets.These problems involve finding the affinities among two (or more) sets of entities, such as users and movies, users and web pages/advertisements, etc. Often "side information" in the form of additional attributes of these entities (e.g. demographic information for the users, a social network etc.) is also available to improve predictions. To get an idea of the variety of approaches used for such data, look at Sections 1 and 2 of this tech report, which will also give you many pointers to recent literature, specially references 3, 4, 5, 6, 8, 15, 16, 19, 24 and 26. (PS: Don't get psyched by the math in the subsequent sections. Since I won't be covering variational methods, you need not deal with such approaches).

Mining Structured Data

Here data has some special structure, e.g. graphs, XML documents, etc. See this link for more information.

Scalable/distributed Implementations

You could look at developing specialized algorithms that work better on large data, e.g. see the SGD alternative to training SVMs in the notes on Scalability. I could likely provide access to the TACC supercomputers. You will need adequate background in high-performance computing.

Presto

I'd like to see some projects that use or enhance Presto, http://www.hpl.hp.com/research/presto.htm a parallel and distributed version of R created by HP, and which will be provided to us. You will get the opportunity to work with the lead developer for Presto.

If you want to deviate from the suggestions above, you need to check with me first. At the very least it should involve large ("rows" times "columns" > 1 million) data sets.