Course description

The information explosion of the past few years has us drowning in data but often starved of knowledge. Many companies that gather huge amounts of electronic data have now begun applying data mining techniques to their data warehouses to discover and extract pieces of information useful for making smart business decisions. Effective data mining, as opposed to data dredging, requires an understanding of concepts from exploratory data analysis, pattern recognition, machine learning, heterogeneous data bases, parallel processing and data visualization, in addition to knowing the problem domain.

The information explosion of the past few years has us drowning in data but often starved of knowledge. Many companies that gather huge amounts of electronic data have now begun applying data mining techniques to their data warehouses to discover and extract pieces of information useful for making smart business decisions. Effective data mining, as opposed to data dredging, requires an understanding of concepts from exploratory data analysis, pattern recognition, machine learning, heterogeneous data bases, parallel processing and data visualization, in addition to knowing the problem domain.

Given the rich set of topics in this area, I’ll be concentrating on giving lectures on some core topics. Particular emphasis will be given to techniques for predictive analytics, specially those that are scalable to very large data sets and/or those that are relatively robust when faced with a large number of predictors, and algorithms for heterogeneous or streaming data. Many of these capabilities are essential for handing BIG DATA. We will mostly be using the R language for statistical modeling, but some examples of Python code, mostly from Scikit-Learn, will also be referred to.

The central goal of this course is to convey an understanding of the pros and cons of different predictive modeling techniques, so that you can (i) make an informed decision on what approaches to consider when faced with real-life problems requiring predictive modeling, (ii) apply models properly on real datasets so to make valid conclusions. This goal will be reinforced through both theory and hands-on experience.

Information about machine learning/data mining resources (including data sources) are available in the links page.

Information about the course instructor and TA(s) is available on the contact page.

Textbooks

The material for the lectures is taken from a wide variety of sources. My notes will be available via Canvas. The textbooks for the course are:

Author: Max Kuhn and Kjell Johnson (KJ)
Title: Applied Predictive Modeling
Publisher: Springer
ISBN: 1461468485
Year: 2013

Author: Gareth James, et al (JW)
Title:An Introduction to Statistical Learning: with Applications in R
 Publisher: Springer (2013)
ISBN: 978-1-4614-7138-7

Course outline

KJ, JW refers to the text; Pi refers to paper set i provided through Canvas, as needed. (TSK readings are given for convenience, that is not a text anymore)

January Reading Assignment: TSK ch 1-3; Appendix A, B, C, D; JW Ch 1-3, 6, KJ: Chapter 1,2,3 6.1, 6.2, 19

Area of study: introduction, , R/scikit-learn demos; data quality and pre-processing; intro to regression

February Reading Assignment: TSK Ch 4, 5; parts of Ch 6; JW Ch 4,5, 8.1, KJ 4, 5, rest of Ch 6, 7.1, 7.4, 11

Area of study: predictive modeling; Introduction to classification methods

March Reading Assignment: TSK ch 8, 9; JW Ch 9, 10, KJ Ch 12,13,14.1,14.2,8.1,16

Area of study: more classification; clustering/segmentation; recommender systems

April  Reading Assignment: TSK Ch 5.6; notes/papers; JW 8.2

Area of study: association rules; market basket analysis, Combining multiple models

May Reading Assignment: notes/papers

Area of study: intro to web analytics:: analyzing hyperlink structure, content and usage of web sites (time permitting); project presentations; course wrap-up.

Course expectations

This course requires students to have an undergraduate level understanding of some basic concepts from probability/statistics, data analysis and linear algebra. This is a graduate course so the workload will be medium.

While studying techniques for database representation/modeling, clustering, classification, finding associations and sequence processing, emphasis will be placed on the issues of algorithm scalability, performance, interpretability and the ability to deal with garbage data. 10-15 minute student talks will be interwoven with the lectures, depending on class size. The last two classes will largely consist of student term-project presentations, followed by active discussion.

Grading information

15%Written homeworks (3: Feb, March, April)
15%In-class quiz (Saturdays in Feb, March, May)
20%Mid-term (Sat, Apr)
10%Brief presentation of research paper/topic (groups of 2)
35%Final project (groups of 3-4)
5%Class participation

At the end of the course, you will get a score out of 100 based on the percentages stated above. Your final grade will be solely based on this score. The grade is primarily based on the curve, i.e. is relative to how the whole class performs; however entire curve may shift up or down a bit depending on how the class as a whole performs relative to past classes.  Grading is NOT based on absolute thresholds, e.g. 90+ = A etc.

Supplementary Texts

Author: Trevor Hastie, Robert Tibshirani, and Jerome Friedman (HTF)
Title: The Elements of Statistical Learning
Publisher: Springer (2nd edition)
ISBN: 0387848576
Notes: Can get it from Amazon, about $70 but well worth it, or download pdf from http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Author: Pang-Ning Tan, Michael Steinbach, and Vipin Kumar (TSK)
Title: Introduction to Data Mining
Publisher: Addison-Wesley (2005)
ISBN: 0-321-32136-7
Notes: Some chapters are downloadable from this website

Author: Christopher M. Bishop (B)
Title: Pattern Recognition and Machine Learning
Publisher: Springer
ISBN: 0387310738
Notes: http://research.microsoft.com/en-us/um/people/cmbishop/prml/

Author: Kevin Murphy
Title: Machine Learning: A Probabilistic Perspective,
Publisher: MIT Press
ISBN: 0262018020
Notes: Covers a very wide range of topics. Lots of examples in Matlab, with source code access.

Author :Wes McKinney
Title: Python for Data Analysis
Publisher: O'Reilly

MOOC: Coursera course by Andrew Ng has some very introductory material on linear algebra, e.g. multiplying a matrix with a vector.
Notes:  https://class.coursera.org/ml-003/lecture

 

Disabilities statement: "The University of Texas at Austin provides upon request appropriate academic accommodations for qualified students with disabilities. For more information, contact the Office of the Dean of Students at 471-6259, 471-4641 TTY."

NOTICES: