CS 194-16 Introduction to Data Science - UC Berkeley, Spring 2014

Organizations use their data for decision support and to build data-intensive products and services. The collection of skills required by organizations to support these functions has been grouped under the term Data Science. This course will attempt to articulate the expected output of Data Scientists and then equip the students with the ability to deliver against these expectations. The assignments will involve web programming, statistics, and the ability to manipulate data sets with code.

Logistics

Pre-requisites

Pre-requisites for this course include 61A, 61B, 61C and basic programming skills. Knowledge of Python will be useful for the assignments. Students will also be expected to run VirtualBox on their laptops for the assignments.

Please take the class survey here.

Plese set up your machine according to these instuctions.

Grading

Schedule

Class Date Lecture Material Reading Assignments
M 1/27 L1: Introduction/Data Science Process [PPTX] [PDF]
M 2/3 L2: Data Preparation (w/Unix Shell Lab) [PPTX] [PDF] Enterprise Data Analysis and Visualization: An Interview Study Bunny 1 by 5pm on 2/3
Lab 1
M 2/10 L3: Tabular Data (w/Pandas Lab) [PPTX] [PDF] From Databases to Dataspaces: A New Abstraction for Information Management
Schemaless SQL and Schema on Write vs. Schema on Read
Bunny 2 by 5pm on 2/10
Lab 2
F 2/14 Homework 1 out. Due by 2/28
M 2/17 No class - President's Day
M 2/24 L4: Data Cleaning (w/Open Refine Lab) [PPTX] [PDF] Lab 3
F 2/28 Homework 1 Due! Submit using glookup
M 3/3 L5: Part 1 - Guest Lecture: Josh Wills, Director of Data Science, Cloudera; followed by:
Part 2- Data Integration (w/Pandas) [PPTX] [PDF]
WebTables: Exploring the Power of Tables on the Web (Sections 1,2 and 4; others optional)
and OpenRefine Data Augmentation (video)
Bunny 3 by 5pm;
Lab 4
Final Project Group Lists Due Midnight
M 3/10 L6: Exploratory Data Analysis (with Python lab) [PPTX] [PDF] Statistical Thinking in the Age of Big Data
Exploratory Data Analysis
From the O'Reilly Book "Doing Data Science" - available on campus or via the library VPN.

Introduction to Hypothesis Testing
Bunny 4 by 5pm;
Lab 5
Final Project Proposals due Tues 3/11 Midnight.
T 3/11 Homework 2 out. Due by 4/1
M 3/17 L7: Regression, Classification, intro to Supervised Learning (with R Lab)
Part 1:[PPTX] [PDF] Part 2:[PPTX] [PDF]
Homework Tips
Three Basic Algorithms From the O'Reilly Book "Doing Data Science" - available on campus or via the library VPN.
Bunny 5 by 5pm
Lab 6
M 3/24 No Lecture: Spring Break
M 3/31 L8: Part 1 - Guest Lecture: Peter Skomoroch; Slides: [PDF](29MB); followed by:
Part2 - Unsupervised Learning and K-Means Clustering (in Python)
K-Nearest Neighbors and K-Means clustering from Three Basic Algorithms. Part of the O'Reilly Book "Doing Data Science" - available on campus or via the library VPN.
No Bunny !
Lab 7
T 4/1 Homework 2 Due. Submit using glookup
M 4/7 L9: Scaling Up Analytics (with Spark/EC2 Lab); Guest Lecturer: Kay Ousterhout [PDF] [PPTX] "MapReduce," "Word Frequency Problem", and "Other Examples of MapReduce" sections from O'Reilly "Doing Data Science" book (available online or from the library) and Spark Short paper Bunny 9 by 5pm
Lab 8
Homework 3, Part 1 Due 4/14
F 4/11 Final Project update due on glookup
M 4/14 L10: Visualization (D3 lab)[PPTX] [PDF]
Lab Slides
Chapter 9 on Data Visualization from "Doing Data Science" available online or from the library.
D3: Data Driven Documents by Bostock et. al.
Optional: Reading about how the challenger disaster may have been prevented with data visualization by Edward Tufte
Bunny 10 by 5pm
Homework 3, Part 1 due
Lab 9
Th 4/17 Midterm - 6.00 to 7.30 pm
F 4/18 Homework 3, Part 2 out. Due by 4/25.
M 4/21 L11: Graph Processing (with GraphX Lab); Guest Lecturers: Joey Gonzalez and Dan Crankshaw
[PPTX](19MB) [PDF](19MB)
Chapter 2 from "Networks, Crowds, and Markets: Reasoning About a Highly Connected World" Bunny 11 by 5pm
Lab 10
F 4/25 Homework 3, Part 2 due
M 4/28 L12: Putting it All Together Bunny 12 by 5pm