Organizations use their data for decision support and to build data-intensive products and services. The collection of skills required by organizations to support these functions has been grouped under the term Data Science. This course will attempt to articulate the expected output of Data Scientists and then equip the students with the ability to deliver against these expectations. The assignments will involve web programming, statistics, and the ability to manipulate data sets with code.

- Course Number: CS 194-16, Spring 2014, UC Berkeley
- Instructor: Mike Franklin
- Time: Monday, 5.30pm - 8.30pm
- Location: 145 Moffit
- Teaching Assistants: Daniel Bruckner, Evan Sparks and Shivaram Venkataraman
- Discussion: Join Piazza for announcements and to ask questions about the course
- Office hours:
- Mike Franklin - T 3.30-4.30, Th 2.30-3.30 at 449 Soda
- GSIs - M 2-3 at 449 Soda, W 11-12 at 751 Soda

Pre-requisites for this course include 61A, 61B, 61C and basic programming skills. Knowledge of Python will be useful for the assignments. Students will also be expected to run VirtualBox on their laptops for the assignments.

Please take the class survey here.

Plese set up your machine according to these instuctions.

- Class Participation and in-class labs: 20%
- Midterm: 20%
- Final Project (in groups): 25% Final Project Information is Here
- Homeworks : 30% (3 @ 10% each: Homework 1; Homework 2; Homework 3)
- “Bunnies” : 5%

Class Date |
Lecture Material |
Reading |
Assignments |

M 1/27 | L1: Introduction/Data Science Process [PPTX] [PDF] | ||

M 2/3 | L2: Data Preparation (w/Unix Shell Lab) [PPTX] [PDF] | Enterprise Data Analysis and Visualization: An Interview Study |
Bunny 1 by 5pm on 2/3
Lab 1 |

M 2/10 | L3: Tabular Data (w/Pandas Lab) [PPTX] [PDF] | From Databases to Dataspaces: A New Abstraction for Information Management
Schemaless SQL and Schema on Write vs. Schema on Read |
Bunny 2 by 5pm on 2/10
Lab 2 |

F 2/14 | Homework 1 out. Due by 2/28 | ||

M 2/17 | No class - President's Day | ||

M 2/24 | L4: Data Cleaning (w/Open Refine Lab) [PPTX] [PDF] | Lab 3 | |

F 2/28 | Homework 1 Due! Submit using glookup | ||

M 3/3 | L5: Part 1 - Guest Lecture: Josh Wills, Director of Data Science, Cloudera; followed by:
Part 2- Data Integration (w/Pandas) [PPTX] [PDF] |
WebTables: Exploring the Power of Tables on the Web (Sections 1,2 and 4; others optional)
and OpenRefine Data Augmentation (video) |
Bunny 3 by 5pm;
Lab 4 Final Project Group Lists Due Midnight |

M 3/10 | L6: Exploratory Data Analysis (with Python lab) [PPTX] [PDF] | Statistical Thinking in the Age of Big Data
Exploratory Data Analysis From the O'Reilly Book "Doing Data Science" - available on campus or via the library VPN. Introduction to Hypothesis Testing |
Bunny 4 by 5pm;
Lab 5 Final Project Proposals due Tues 3/11 Midnight. |

T 3/11 | Homework 2 out. Due by 4/1 | ||

M 3/17 | L7: Regression, Classification, intro to Supervised Learning (with R Lab) Part 1:[PPTX] [PDF] Part 2:[PPTX] [PDF] Homework Tips |
Three Basic Algorithms
From the O'Reilly Book "Doing Data Science" - available on campus or via the library VPN. |
Bunny 5 by 5pm Lab 6 |

M 3/24 | No Lecture: Spring Break | ||

M 3/31 | L8: Part 1 - Guest Lecture: Peter Skomoroch; Slides: [PDF](29MB); followed by:
Part2 - Unsupervised Learning and K-Means Clustering (in Python) |
K-Nearest Neighbors and K-Means clustering from Three Basic Algorithms.
Part of the O'Reilly Book "Doing Data Science" - available on campus or via the library VPN. |
No Bunny ! Lab 7 |

T 4/1 | Homework 2 Due. Submit using glookup | ||

M 4/7 | L9: Scaling Up Analytics (with Spark/EC2 Lab); Guest Lecturer: Kay Ousterhout [PDF] [PPTX] | "MapReduce," "Word Frequency Problem", and "Other Examples of MapReduce" sections from O'Reilly "Doing Data Science" book (available online or from the library) and Spark Short paper | Bunny 9 by 5pm Lab 8 Homework 3, Part 1 Due 4/14 |

F 4/11 | Final Project update due on glookup | ||

M 4/14 |
L10: Visualization (D3 lab)[PPTX] [PDF] Lab Slides |
Chapter 9 on Data Visualization from "Doing Data Science" available online or from the library. D3: Data Driven Documents by Bostock et. al. Optional: Reading about how the challenger disaster may have been prevented with data visualization by Edward Tufte |
Bunny 10 by 5pm Homework 3, Part 1 due Lab 9 |

Th 4/17 | Midterm - 6.00 to 7.30 pm | ||

F 4/18 | Homework 3, Part 2 out. Due by 4/25. | ||

M 4/21 | L11: Graph Processing (with GraphX Lab); Guest Lecturers: Joey Gonzalez and Dan Crankshaw [PPTX](19MB) [PDF](19MB) |
Chapter 2 from "Networks, Crowds, and Markets: Reasoning About a Highly Connected World" |
Bunny 11 by 5pm Lab 10 |

F 4/25 | Homework 3, Part 2 due | ||

M 4/28 | L12: Putting it All Together |