# Big Data (BUS 41201)

## Course Description

BUS 41201 is a course about data mining: the analysis, exploration, and simplification of large high-dimensional datasets. Students will learn how to model and interpret complicated `Big Data' and become adept at building powerful models for prediction and classification.
Techniques covered include an advanced overview of linear and logistic regression, model choice and false discovery rates, multinomial and binary regression, classification, decision trees, factor models, clustering, the bootstrap and cross-validation. We learn both basic underlying concepts and practical computational skills, including techniques for analysis of distributed data.
Heavy emphasis is placed on analysis of actual datasets, and on
development of application specific methodology. Among other examples, we will consider consumer database mining, internet and social media tracking, network analysis, and text mining.

Syllabus
## Teaching Assistants:

Ken McAlinn (kenmcalinn@gmail.com)

Wenxi Li (wenxi.li@chicagobooth.edu)

Jianfei Cao (jcao0@chicagobooth.edu)

## Office Hours:

By appointment

## Review Sessions:

Saturday at Gleacher.

Instructor:

Kenichiro (Ken) McAlinn (Senior Research Professional in
Econometrics
and
Statistics)

## R Resources:

Dowload R ,

R Project Site ,

R Studio
**Tutorials:** Google developer ,
Princeton , TryR
code
school ,
Quick R

**Books:** R in a nutshell ,
Art
of R programming,
Library
E-Books ,
Introductory
Statistics with R

## Piazza link

piazza.com/uchicago/spring2017/busn412010185bigdata/home

## First Class Assignment:

Make yourself familiar with R! The course is a fast paced introduction to a
wide variety of statistical learning methods. Knowing the basics of R before you start will make your life much easier and allow you to concentrate your effort on learning data science tools and concepts.
As a start, I recommend going through R tutorials, such as the TryR
tutorial at

http://tryr.codeschool.com, to
people who are new to R.

## Week 1 : **Inference at scale **

Slides

### Datasets:

**Trucks:**
pickup.R ,

pickup.csv
**Diabetes:**
dm2_pvals.R
,

dm2_fdr.R ,

diabetes.csv
**Cholesterol:**
lipids.R ,

jointGwasMc_LDL.txt
**Extra Code: **
fdr.R

## Week 2 : **Regression **

Slides

### Datasets:

**Orange juice:**
oj.R ,

oj.csv
**Spam:**
spam.R
,

spam.csv
**Extra Code: **
deviance.R

## Week 3 : **Model Selection **

Slides

### Datasets:

**Comscore:**
comscore.R ,

CS2006demographics.csv
,

CS2006domains.csv.csv
,

CS2006sites.txt ,

CS2006totalspend.csv
**Semiconductor:**
semiconductor.R
,

semiconductor.csv
**Extra Code: **
naref.R

## Week 4 : **Treatment Effects **

Slides

### Datasets:

**Abortion:**
abortion.dat ,

abortion.R
,

us_cellphone.csv
**Paidsearch:**
paidsearch.csv ,

paidsearch.R
**Extra Code: **
mab.R

## Week 5 : **Classification **

Slides

### Datasets:

**Credit:**
credit.csv ,

credit.R ,

data_description
**Glass:**
glass.R
**Extra Code: **
roc.R

## Week 6 : **Networks **

Slides

### Datasets:

**Marriage:**
firenze.R ,

firenze.txt
**Karate:**
karate.R
**Lastfm:**
lastfm.R ,

lastfm.csv
**Websearch:**
CaliforniaEdges.csv ,

CaliforniaNodes.txt ,

websearch.R

## Week 7 : **Clustering **

Slides

### Datasets:

**Protein:**
protein.R ,

protein.csv
**Wine:**
wine.R ,

wine.csv
**We8there:**
we8there.R
**Extra Code: **
kIC.R

## Week 8 : **Factor Models **

Slides

### Datasets:

**Protein:**
protein.R ,

protein.csv
**Rollcall:**
rollcall_votes.R ,

rollcall.csv ,

rollcall-members.csv
**NBC:**
nbc_demographics.csv ,

nbc_pilotsurvey.csv ,

nbc_showdetails.csv ,

nbc.R
**Gas:**
gas.R ,

gasoline.csv

## Week 9 : **Trees **

Slides

### Datasets:

**Prostate:**
prostate_cancer.R ,

prostate.csv
**Mcycle:**
mcycle.R
**Calhomes:**
CAhousing.csv ,

calhomes.R