I. Course description | II. Course textbooks | III. Course outline |
IV. Course expectations | V. Grading information |
The information explosion of the past few years has us drowning in data but often starved of knowledge. Many companies that gather huge amounts of electronic data have now begun applying data mining techniques to their data warehouses to discover and extract pieces of information useful for making smart business decisions. Effective data mining, as opposed to data dredging, requires an understanding of concepts from exploratory data analysis, pattern recognition, machine learning, heterogeneous data bases, parallel processing and data visualization, in addition to knowing the problem domain.
Given the rich set of topics in this area, I’ll be concentrating on only some core topics. The tentative schedule of classes can be found here.
The course is mostly a set of lectures by me, setting up basic concepts. There will however be a couple of classes reserved for lectures by visiting experts from industry/ academia or used as “flipped classes”.
The last 4-5 classes will consist of student term-project presentations, followed by active discussion.
Term project guidelines are posted.
Information about the course instructor and TA(s) is available on the contact page.
There is no mandatory textbook. My notes will be available via Blackboard, supplemented by some papers. However it will be helpful to you if you have access to the following books:
1. Introduction and Overview (2 lectures)
The data mining process; model fitting and overfitting; decision theory; probability review; data warehousing (B, Ch 1, 2.1-2.3; HTF Ch 1, 2.1-2.6; TSK, Ch 1)2. Predictive Modeling/Regression (2-3 lectures)
common issues; linear, non-linear and online methods (B 3.1, 3.2; HTF Ch 2.7, 2.8, 3.1-3.4, 7.1-7.3, 11.1-11.8; TSK, Appendix)3. Classification (6 lectures)
Generative vs. Discriminative approaches; Decision Trees, Bayesian Belief networks, Evaluation, Kernel methods and SVMs, (B 4.1-4.3.4; 6.1, 6.2, 7.1, 14.4; HTF Ch 4, 7.10, 9.2, 12, 13.3; TSK, Ch 4, 5.2, 5.3, 5.5)4. Clustering and Co-clustering (4 lectures)
k-means; hierarchical methods, graph partitioning; co-clustering, semi-supervised learning. Market Basket applications (B 9.1, 9.2; HTF Ch 13.1, 13.2, 14.3, 14.4; TSK, Ch 8,parts of 9)5. Data Pre-Processing, Cleaning, Reduction, Feature Extraction and Visualization (3 lectures)
Data quality; Curse of dimensionality; PCA, Kernel PCA, manifolds (B 12.1; HTF Ch 14.5, 14.8; TSK, Ch 2, 3.1-3.3; Appendix B)6. Combining Multiple Models (1 lecture)
ensemble learning; bagging and boosting (B 14.2, 14.3, HTF Ch 8.7, 8.8, 10.1-10.7, 16; TSK, Ch 5.6)7. Intro to Web Mining and Cloud Computing (2 lectures; time permitting)
Google's Pagerank; Hubs and authorities; social networks; Hadoop/MapReduce8. Special Topics (distributed data mining, topic models etc) (time permitting)
Term Paper Presentations (about 4-5 classes)
For Grad students:(Graduate standing in Engineering, CS, Maths or Physics) OR (consent of the instructor). You are expected to know basics (undergraduate level) of probability/statistics. Knowledge of basic linear algebra and algorithms will be assumed.
For Undergrads: You must have taken EE351K (Probability/stats) or equivalent. I will also assume knowledge of basic concepts in linear algebra (vector space, eigenvector/value, linear independence) and algorithms (computational complexity, correctness).
Notes, reading lists, scores etc, will be communicated via Blackboard.
5+10+30% | Project (groups of 3-5): project outline + 20-25 minute presentation + term paper due May 5 |
25% | Homeworks, including paper/topic critiques |
5% | Pop-quiz (mid Feb) |
20% | Written Exam (Th, March 21, in class) |
5% | Pop-quiz (April) |
There will be no final exam.
Disabilities statement: "The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. For more information, contact the Office of the Dean of Students at 471-6259, 471-4641 TTY."