ESE: Data Mining


Project guidelines

You are encouraged to work in groups of three to five for the term project to analyze and develop a predictive modeling solution using real business-related data. Individual projects are not allowed. This project is a critical part of the course, and a significant factor in determining your grade. Teams are required to hand in a brief report and prepare a short class presentation of their work. By default, all team members will receive the same score for their project. If a team feels that this is unfair, perhaps due to HIGHLY imbalanced contributions, then every team member needs to provide feedback on the contribution of each of the other team members via email to the instructor by the last day of class. After that I will need to have a meeting with all the members together to mediate.

The deadline for the report is an electronic submission on Canvas on 5/2 by midnight. This copy should be 20-30 pages (1.5 spacing) including figures, tables and/or references in the form of a single pdf file. You may want to refer to the guidelines posted by Professor Elkan at University of California, San Diego for writing your paper. If you want to submit supplementary materials (code, referenced papers) make a folder and give me access via a pointer to the URL/dropbox location.

Dates:

  1. Project outline due March 20th. 2-3 pages describing the problem, data available, some possible approaches you will consider to address the problem, and a short list of references. On Sat, Apr 11, each group will make a 3-4 min in-class presentation on what they plan to do.
  2. In class presentation of project results, May, approx 15-20 mins per group.
  3. Written project report due May 2nd. One submission per group.

Project presentation schedule

Project groups, title and schedule are available on Canvas.


Project topics

The project should be centered around some problem with associated data sets that you can mine to provide useful and actionable answers. At the least, this should be an exercise in analyzing a reasonably large dataset. In the process, if you invent new techniques/algorithms or processes, or make inferences that are useful and not done before, of course that is an added bonus. Two types of projects are suggested below.

Type I: Some projects I am interested in

These will tend to be a bit more advanced, but you can get the help of one of my graduate students. See this list for details.

Type II: Based on a Competition or other Real-World Large Datasets

Data Mining Competitions

There have been several data mining competitions such those hosted by Kaggle (www.kaggle.com). For several of these competitions, as well as those from KDD cup (http://www.kdnuggets.com/competitions/kddcup), the data is still available and you can also find papers on how others have fared on these data sets. There are also several other ongoing competitions (e.g. see http://www.kdnuggets.com/competitions).

Warning: these can be quite addictive, but also quite fun and a learning experience.

Other Public Domain Dataset

There is an astonishing amount and variety of public domain datasets on the web. KDNuggets (http://www.kdnuggets.com/datasets/index.html) provides a long list.

You could be even selective on the topic, for example, if you google “multilabel classification dataset”, the first hit is a bunch of datasets associated with the software Mulan.

The US Government's Open Data policy has also resulted in a treasure trove of data. See (http://www.data.gov)

Type III: Based on Type of Analysis or Application Domain

You can formulate and address a suitable predictive modeling problem based on data from industry or government. It will be your job to acquire and manage the data. The project should be doable within a couple of months, but also non-trivial: at the very least it should involve a large (say "rows" times "columns" > 1 million) data set. Remember that your class presentation is public, however your class report is not, and I (and the TA) can sign NDAs if need be in order to work with you on such a project and to evaluate it. You can choose any topic you want. For example, you could look at healthcare data, or data related to recommendation systems. Some pointers to these two example topics are given below:

Data Mining for Health Care

List of some Health Care Data Sets. CMS, for example has recently released data about Medicare Provider Utilization and Payments, http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/ which has lots of possibilities. Another highlighted area under healthcare is Structured Learning for Bioinformatics. Many bio-informatics problems involve a very large number of variables; often needing a predictive model with less data points than variables. A series of approaches based on LASSO and extensions have been recently developed with very promising results. See this talk and recent publications from Professor Ye's group. This group also has a nice software package that you can use to apply such methods to suitable data sets. Some related papers include those on ADNI and Dartmouth Health Atlas. Also, for the Texas discharge data, papers can by found by doing a Google search for "Texas Hospital Inpatient Discharge Public".

"Affinity" Data Sets

List of some "Affinity" Data Sets. These problems involve finding the affinities among two (or more) sets of entities, such as users and movies, users and web pages/advertisements, etc. Often "side information" in the form of additional attributes of these entities (e.g. demographic information for the users, a social network etc.) is also available to improve predictions. Another associated problem is learning to rank, for which the LETOR datasets/benchmarks are public-domain.