I. Course description II. Course textbooks III. Course outline
IV. Course expectations V. Grading information

Course description

The information explosion of the past few years has us drowning in data but often starved of knowledge. Many companies that gather huge amounts of electronic data have now begun applying data mining techniques to their data warehouses to discover and extract pieces of information useful for making smart business decisions. Effective data mining, as opposed to data dredging, requires an understanding of concepts from exploratory data analysis, pattern recognition, machine learning, heterogeneous data bases, parallel processing and data visualization, in addition to knowing the problem domain.

Given the rich set of topics in this area, I’ll be concentrating on only some core topics. The tentative schedule of classes can be found here.

The course is mostly a set of lectures by me, setting up basic concepts. There will however be a couple of classes reserved for lectures by visiting experts from industry/ academia or used as “flipped classes”.

The last 4-5 classes will consist of student term-project presentations, followed by active discussion.

Term project guidelines are posted.

Information about the course instructor and TA(s) is available on the contact page.


Textbooks

There is no mandatory textbook. My notes will be available via Blackboard, supplemented by some papers. However it will be helpful to you if you have access to the following books:

Author: Trevor Hastie, Robert Tibshirani, and Jerome Friedman
Title: The Elements of Statistical Learning
Publisher: Springer (2nd edition)
ISBN: 0387848576
Notes: Can get it from Amazon, about $70 but well worth it, or download pdf from http://www-stat.stanford.edu/~tibs/ElemStatLearn/
Author: Christopher M. Bishop
Title: Pattern Recognition and Machine Learning
Publisher: Springer
ISBN: 0387310738
Notes: http://research.microsoft.com/en-us/um/people/cmbishop/prml/
Author: Pang-Ning Tan, Michael Steinbach, and Vipin Kumar (TSK)
Title: Introduction to Data Mining
Publisher: Addison-Wesley (2005)
ISBN: 0-321-32136-7
Notes: Some chapters are downloadable from this website

Tentative course outline

1. Introduction and Overview (2 lectures)

The data mining process; model fitting and overfitting; decision theory; probability review; data warehousing (B, Ch 1, 2.1-2.3; HTF Ch 1, 2.1-2.6; TSK, Ch 1)

2. Predictive Modeling/Regression (2-3 lectures)

common issues; linear, non-linear and online methods (B 3.1, 3.2; HTF Ch 2.7, 2.8, 3.1-3.4, 7.1-7.3, 11.1-11.8; TSK, Appendix)

3. Classification (6 lectures)

Generative vs. Discriminative approaches; Decision Trees, Bayesian Belief networks, Evaluation, Kernel methods and SVMs, (B 4.1-4.3.4; 6.1, 6.2, 7.1, 14.4; HTF Ch 4, 7.10, 9.2, 12, 13.3; TSK, Ch 4, 5.2, 5.3, 5.5)

4. Clustering and Co-clustering (4 lectures)

k-means; hierarchical methods, graph partitioning; co-clustering, semi-supervised learning. Market Basket applications (B 9.1, 9.2; HTF Ch 13.1, 13.2, 14.3, 14.4; TSK, Ch 8,parts of 9)

5. Data Pre-Processing, Cleaning, Reduction, Feature Extraction and Visualization (3 lectures)

Data quality; Curse of dimensionality; PCA, Kernel PCA, manifolds (B 12.1; HTF Ch 14.5, 14.8; TSK, Ch 2, 3.1-3.3; Appendix B)

6. Combining Multiple Models (1 lecture)

ensemble learning; bagging and boosting (B 14.2, 14.3, HTF Ch 8.7, 8.8, 10.1-10.7, 16; TSK, Ch 5.6)

7. Intro to Web Mining and Cloud Computing (2 lectures; time permitting)

Google's Pagerank; Hubs and authorities; social networks; Hadoop/MapReduce

8. Special Topics (distributed data mining, topic models etc) (time permitting)

Term Paper Presentations (about 4-5 classes)


Course expectations


PREREQUISITES:

For Grad students:(Graduate standing in Engineering, CS, Maths or Physics) OR (consent of the instructor). You are expected to know basics (undergraduate level) of probability/statistics. Knowledge of basic linear algebra and algorithms will be assumed.

For Undergrads: You must have taken EE351K (Probability/stats) or equivalent. I will also assume knowledge of basic concepts in linear algebra (vector space, eigenvector/value, linear independence) and algorithms (computational complexity, correctness).

Notes, reading lists, scores etc, will be communicated via Blackboard.


Grading information

5+10+30%Project (groups of 3-5): project outline + 20-25 minute presentation + term paper due May 5
25%Homeworks, including paper/topic critiques
5%Pop-quiz (mid Feb)
20%Written Exam (Th, March 21, in class)
5%Pop-quiz (April)

There will be no final exam.

Disabilities statement: "The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. For more information, contact the Office of the Dean of Students at 471-6259, 471-4641 TTY."