The final projects for this class are to be done in groups of 3 students (or 2 with special permission). The idea is to perform an end-to-end data science project of your choosing. The idea is to exercise the entire data science lifecycle.
In the project you will identify two or more data sets you would like to study. Write the code to collect and integrate those data sets, then build two or three visualizations of the data. Once you’ve got a decent feeling for the data, perform an analysis of the data to identify insights, answer questions, examine hypotheses, etc.
You should produce some interesting visualizations of the data, and develop a prototype of a data product that uses the data and the analyses.
We recommend that throughout the project you keep a diary of your successes and failures. Did you run into problems fetching the data? Coding it? Were there a lot of missing values? Were your visualizations insightful? What are your concerns about the quality of the inference you can draw. The final submission will consist of a paper document documenting your project and experiences and a presentation/demo. (details to be provided)
For the first stage, we would like you to produce a 1-2 page Initial Project Proposal. This proposal should outline:
We understand that these proposals are preliminary. We will meet with the project groups to discuss the proposals so that we can agree on direction and scope as well as to try to identify gotcha’s that may arise.
For some inspiration, you can have a look at the slides from the presentations of the 2011 offering of 194-16 (note that the requirements were somewhat different that year)
There are lots of places collecting interesting data sets or pointers - here are a few:
Quandl - Find Use and Share numerical data
If you find other good sources of available data, please post to the piazza group.