Tutorial 1: Data exploration and wrangling
Before building a machine learning model, it is important to understand and wrangle your data into an appropriate numeric format. In this tutorial, we’ll look at how I like to set up my projects and some tips for exploratory visualizations.
Part 1: Project configuration
Tutorials are not marked in this course, so it’s up to you to keep track of them separately. I recommend copying the this directory to a new location rather than forking the entire w26 repo and working out of that, otherwise you’ll have a bunch of merge conflicts and extra stuff if you want to submit a PR to the main repo. Alternatively, you can create a fork and work within a separate branch.
Tools:
Part 2: Exploratory visualizations
Visualizations for the purposes of exploring data (rather than communicating results) can be “quick and dirty”, but there are some guidelines to consider, as well as a few tricks that can help.
Follow along with the notebook and answer the various TODOs.
Part 3: Reverse engineer a cleaned dataset
- Create a new .ipynb file to explore this new dataset
- Read the raw data into a pandas DataFrame. You can either download the zip file, or install the
ucimlrepopackage and fetch the data directly. - Read the pre-processed version into a different pandas DataFrame.
- Try to answer the following questions:
- How were the categorical features handled?
- Were any of the numerical categories manipulated?
- What additional transformations might be useful for this dataset?