Final Project: Predict something with some data
| Deliverable | Due date | Weight (of project) | |
|---|---|---|---|
| 1 | Proposal | Feb 27 Pitch Presentation in lab on March 2 | 15% |
| 2 | Checkpoint 1 | March 27 | 20% |
| 2 | Checkpoint 2 | April 13 | 20% |
| 4 | Final Report | April 24 Presentations on April 22* | 45% |
*I asked the registrar to schedule a block of time during final exams for you to share the outcomes of your projects, so this shows up on MyMRU as a final exam. You can put finishing touches on your report in the days afterwards.
You may work in teams of 2 or 3.
Overview
The goal of the final project is to showcase the skills you’ve learned in this course by applying them to a new dataset and task. You are free to choose anything you like, but do not pick a “hello-world” style dataset. I’d rather you take a risk and have it not be successful than recreate yet another Titanic survival prediction model.
Overall, throughout this project you will:
- Choose a task, such as classification, regression, time-series forecasting, NLP, etc.
- Find an appropriate dataset for that task. This is harder than it sounds, and may require some iterating when you start working with the data.
- Explore the data and preprocess as necessary to handle missing values, feature scaling, etc.
- Implement a model to accomplish your task.
- Write a formal report describing your project.
Tip
Often when you are writing a report and describing your methodology, you realize there is an error in that methodology and end up having to go back and fix something. Don’t leave the writeup to the very end - it’s a good idea to write as you go along.
Finding a Dataset
First, think of your interests and a possible prediction task, then look to see if there’s a relevant publicly accessible dataset. As you poke around you might do it the other way and stumble across an interesting dataset that inspires a task - that’s okay too!
Some good places to look for datasets:
- Hugging Face - A popular data- and model-sharing site, and the successor to the now-defunct “Papers with Code” (RIP)
- Google Dataset Search - of course Google does datasets as well.
- Wikipedia - a giant list of datasets for machine learning research.
- Google BigQuery Datasets - we’ll be using this in assignment 3, so I’ll provide guidance on accessing it
- Kaggle - a website for machine learning competitions with a huge number of datasets. You can filter by file size, topic and more
Submissions
This time, there is no starter code to share, as each of your projects will be different. Please submit your proposal, checkpoint and report as PDFs on D2L, and include a link to your code (e.g. on GitHub or Google Drive) in your final report abstract. This code may be public so you can use it as a portfolio piece, or if you prefer to keep it private, you can invite me to view it on GitHub.
4-Point Scale
Each of the components will be marked on various criteria using the usual 4 point scale. More details provided in each component description.
| Score | Description |
|---|---|
| 4 | Excellent - thoughtful and creative without any errors or omissions |
| 3 | Pretty good, but with minor errors or omissions |
| 2 | Mostly complete, but with major errors or omissions, lacking in detail |
| 1 | A minimal effort was made, incomplete or incorrect |
| 0 | No effort was made, or the submission is plagiarized |