Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Final Project: Predict something with some data

DeliverableDue dateWeight (of project)
1ProposalFeb 27
Pitch Presentation in lab on March 2
15%
2Checkpoint 1March 2720%
2Checkpoint 2April 1320%
4Final ReportApril 24
Presentations on April 22*
45%

*I asked the registrar to schedule a block of time during final exams for you to share the outcomes of your projects, so this shows up on MyMRU as a final exam. You can put finishing touches on your report in the days afterwards.

You may work in teams of 2 or 3.

Overview

The goal of the final project is to showcase the skills you’ve learned in this course by applying them to a new dataset and task. You are free to choose anything you like, but do not pick a “hello-world” style dataset. I’d rather you take a risk and have it not be successful than recreate yet another Titanic survival prediction model.

Overall, throughout this project you will:

  1. Choose a task, such as classification, regression, time-series forecasting, NLP, etc.
  2. Find an appropriate dataset for that task. This is harder than it sounds, and may require some iterating when you start working with the data.
  3. Explore the data and preprocess as necessary to handle missing values, feature scaling, etc.
  4. Implement a model to accomplish your task.
  5. Write a formal report describing your project.

Tip

Often when you are writing a report and describing your methodology, you realize there is an error in that methodology and end up having to go back and fix something. Don’t leave the writeup to the very end - it’s a good idea to write as you go along.

Finding a Dataset

First, think of your interests and a possible prediction task, then look to see if there’s a relevant publicly accessible dataset. As you poke around you might do it the other way and stumble across an interesting dataset that inspires a task - that’s okay too!

Some good places to look for datasets:

  • Hugging Face - A popular data- and model-sharing site, and the successor to the now-defunct “Papers with Code” (RIP)
  • Google Dataset Search - of course Google does datasets as well.
  • Wikipedia - a giant list of datasets for machine learning research.
  • Google BigQuery Datasets - we’ll be using this in assignment 3, so I’ll provide guidance on accessing it
  • Kaggle - a website for machine learning competitions with a huge number of datasets. You can filter by file size, topic and more

Submissions

This time, there is no starter code to share, as each of your projects will be different. Please submit your proposal, checkpoint and report as PDFs on D2L, and include a link to your code (e.g. on GitHub or Google Drive) in your final report abstract. This code may be public so you can use it as a portfolio piece, or if you prefer to keep it private, you can invite me to view it on GitHub.

4-Point Scale

Each of the components will be marked on various criteria using the usual 4 point scale. More details provided in each component description.

ScoreDescription
4Excellent - thoughtful and creative without any errors or omissions
3Pretty good, but with minor errors or omissions
2Mostly complete, but with major errors or omissions, lacking in detail
1A minimal effort was made, incomplete or incorrect
0No effort was made, or the submission is plagiarized