The information explosion of the past few years has us
drowning in data but often starved of knowledge. Many companies that gather
huge amounts of electronic data have now begun applying data mining techniques
to their data warehouses to discover and extract pieces of information useful
for making smart business decisions. Effective data mining, as opposed to data
dredging, requires an understanding of concepts from exploratory data analysis,
pattern recognition, machine learning, heterogeneous data bases, parallel
processing and data visualization, in addition to knowing the problem domain.
The information explosion of the past few
years has us drowning in data but often starved of knowledge. Many companies
that gather huge amounts of electronic data have now begun applying data mining
techniques to their data warehouses to discover and extract pieces of
information useful for making smart business decisions. Effective data mining,
as opposed to data dredging, requires an understanding of concepts from
exploratory data analysis, pattern recognition, machine learning, heterogeneous
data bases, parallel processing and data visualization, in addition to knowing
the problem domain.
Given the rich set of topics in this area,
I’ll be concentrating on giving lectures on some core topics. Particular
emphasis will be given to techniques for predictive analytics, specially
those that are scalable to very large data sets and/or those that are
relatively robust when faced with a large number of predictors, and algorithms
for heterogeneous or streaming data. Many of these capabilities are essential
for handing BIG DATA. We will mostly be using the R language for statistical
modeling, but some examples of Python code, mostly from Scikit-Learn, will
also be referred to.
The
central goal of this course is to convey an understanding of the pros
and cons of different predictive modeling techniques, so that you can (i) make an informed decision on
what approaches to consider when faced with real-life problems requiring
predictive modeling, (ii) apply models properly on real datasets so to make
valid conclusions. This goal will be reinforced through both theory and hands-on
experience.
Information about machine learning/data mining resources (including data sources) are available in the links page.
Information about the course instructor and TA(s) is available on the contact page.
The material for the lectures is taken from a wide variety of sources. My
notes will be available via Canvas. The textbooks for the course are:
Author: Max Kuhn and Kjell Johnson (KJ)
Title: Applied Predictive Modeling
Publisher: Springer
ISBN: 1461468485
Year: 2013
Author: Gareth James, et al (JW)
Title:An Introduction to Statistical Learning:
with Applications in R
Publisher: Springer (2013)
ISBN: 978-1-4614-7138-7
KJ, JW refers to the text; Pi refers to paper set i provided through Canvas, as needed. (TSK
readings are given for convenience, that is not a text
anymore)
January Reading
Assignment:
TSK ch 1-3; Appendix A, B,
C, D; JW Ch
1-3, 6, KJ: Chapter 1,2,3 6.1, 6.2, 19
Area of study: introduction, , R/scikit-learn demos; data
quality and pre-processing; intro to regression
February Reading
Assignment:
TSK Ch 4, 5; parts of Ch 6; JW Ch 4,5,
8.1, KJ 4, 5, rest of Ch 6, 7.1, 7.4, 11
Area of study: predictive modeling; Introduction
to classification methods
March Reading Assignment:
TSK ch 8, 9; JW Ch 9, 10, KJ Ch 12,13,14.1,14.2,8.1,16
Area of study: more classification;
clustering/segmentation; recommender systems
April
Reading Assignment: TSK Ch 5.6; notes/papers; JW 8.2
Area of study: association rules; market
basket analysis, Combining multiple models
May Reading
Assignment:
notes/papers
Area of study: intro to web analytics:: analyzing hyperlink structure, content and usage of web
sites (time permitting); project presentations; course wrap-up.
This course requires
students to have an undergraduate level understanding of some basic concepts
from probability/statistics, data analysis and linear algebra. This is a
graduate course so the workload will be medium.
While studying
techniques for database representation/modeling, clustering, classification,
finding associations and sequence processing, emphasis will be placed on the
issues of algorithm scalability, performance, interpretability and the ability
to deal with garbage data. 10-15 minute student talks will be interwoven with
the lectures, depending on class size. The last two classes will largely
consist of student term-project presentations, followed by active discussion.
15% | Written homeworks (3: Feb, March, April) |
15% | In-class quiz (Saturdays in Feb, March, May) |
20% | Mid-term (Sat, Apr) |
10% | Brief presentation of research paper/topic (groups of 2) |
35% | Final project (groups of 3-4) |
5% | Class participation |
At the end of the course, you will get a score out of 100 based on the
percentages stated above. Your final grade will be solely based on this score.
The grade is primarily based on the curve, i.e. is relative to how the whole
class performs; however entire curve may shift up or down a bit depending on
how the class as a whole performs relative to past classes. Grading
is NOT based on absolute thresholds, e.g. 90+ = A etc.
Author: Trevor Hastie, Robert Tibshirani, and Jerome Friedman (HTF)
Title: The Elements of Statistical Learning
Publisher: Springer (2nd edition)
ISBN: 0387848576
Notes: Can get it from Amazon, about $70 but well worth it, or download
pdf from http://www-stat.stanford.edu/~tibs/ElemStatLearn/
Author: Pang-Ning Tan, Michael Steinbach, and Vipin Kumar (TSK)
Title: Introduction to Data Mining
Publisher: Addison-Wesley (2005)
ISBN: 0-321-32136-7
Notes: Some chapters are downloadable from this website
Author: Christopher M. Bishop (B)
Title: Pattern Recognition and Machine Learning
Publisher: Springer
ISBN: 0387310738
Notes: http://research.microsoft.com/en-us/um/people/cmbishop/prml/
Author: Kevin Murphy
Title: Machine Learning: A Probabilistic Perspective,
Publisher: MIT Press
ISBN: 0262018020
Notes: Covers a very wide range of topics. Lots of examples in Matlab, with source code access.
Author
:Wes McKinney
Title: Python for Data Analysis
Publisher: O'Reilly
MOOC: Coursera course by Andrew Ng has some very
introductory material on linear algebra, e.g. multiplying a matrix with a
vector.
Notes: https://class.coursera.org/ml-003/lecture
Disabilities
statement: "The University of Texas at Austin provides upon request
appropriate academic accommodations for qualified students with disabilities.
For more information, contact the Office of the Dean of Students at 471-6259,
471-4641 TTY."
NOTICES: