Syllabus - Sta-395 Intro to ML (Spring 2023)

Course Information

Instructor:

  • Ryan Miller, Noyce 2218, millerry@grinnell.edu

Class Meetings:

  • HSC N3170, Tuesday and Thursday 1:00-2:20pm

Office Hours:

  • Noyce 2218, Monday 1-2pm, Tuesday 2:30-3:30pm, Friday 11am-noon

Class Mentors:

  • Brian Goodell, goodellb@grinnell.edu
  • Evaan Yaser Ahmed, ahmedeva@grinnell.edu

Mentors will primarily provide support for Python-related questions. I encourage you to direct your conceptual/technical machine learning questions towards the instructor.

Mentor sessions dates/times: Sunday 4-5pm (location TBD), another session TBD

Course Description:

Introduction to Machine-Learning using Python. Artificial Intelligence (AI) is gaining attention in both industry and academia. Machine learning is the foundational aspect of AI that learns patterns from data. In this course, students are expected to apply machine learning using Python to problems from a variety of disciplines. The following topics will be covered: CART, Bagging & Boosting, GLM, Model selection, Performance evaluation, Artificial Neural Networks, some Frontier AI topics, and a semester-long capstone project. Prerequisite: MAT-215 and STA-230.

Texts:

There is no required textbook for this course. All required materials will be posted on our course website.

Course content and recommended readings will be drawn from the following sources:

  1. An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani
  2. Introduction to machine learning with Python: a guide for data scientists by Andreas Müller (Author). ISBN-13: 978-1449369415

\(~\)

Aims and Objectives

This course aims to develop conceptual, theoretical, and applied perspectives on commonly used machine learning algorithms.

Learning Objectives

After completing this course, students should be able to:

  • Translate research questions related to machine learning into appropriate models that can be trained and tested.
  • Implement pipelines and training/testing schemes for various machine learning models using Python.
  • Clearly and concisely communicate machine learning results to both technical and non-technical audiences.

\(~\)

Policies

Class Sessions

Most class sessions will be split evenly between “lecture” and “lab”. Sessions will begin with a brief, low-stakes 1-2 question quiz to provide incentives for consistent attendance and staying up to date on content. More details are provided in the “Grading” section of the syllabus.

Portions of the class devoted to “lecture” will focus on conceptual and mathematical details with minimal discussion of coding/software.

Portions of class devoted to “lab” involve working in assigned and/or self-formed groups of 2-3 on Python implementations of course topics.

Attendance

Lab sessions will involve collaboration with others in assigned and self-selected groups. To make this work, attendance on lab days is especially important. That said, I understand that missing class is sometimes necessary. If you will be absent for any reason I ask to be notified as soon as possible. Showing up late or missing class more than twice without prior notice will negatively impact the participation component of your course grade.

Software

Software is an essential component of machine learning, and this class will make extensive use of Python (any version 3 release should work fine). You are welcome to use any Python IDE that you are familiar with, but I encourage you to use JupyterLab to record your work on labs/assignments.

You are welcome to use your own personal laptop, or a classroom computer during the course. If you are working your own laptop, I suggest downloading the most recent Anaconda Distribution to ensure compatibility with the code examples that will be given during class. JupyterLab can be found in the Anaconda Navigator, and Anaconda comes with most of the libraries we’ll be using pre-installed.

Academic Honesty

At Grinnell College you are part of a conversation among scholars, professors, and students, one that helps sustain both the intellectual community here and the larger world of thinkers, researchers, and writers. The tests you take, the research you do, the writing you submit-all these are ways you participate in this conversation.

The College presumes that your work for any course is your own contribution to that scholarly conversation, and it expects you to take responsibility for that contribution. That is, you should strive to present ideas and data fairly and accurately, indicate what is your own work, and acknowledge what you have derived from others. This care permits other members of the community to trace the evolution of ideas and check claims for accuracy.

Failure to live up to this expectation constitutes academic dishonesty. Academic dishonesty is misrepresenting someone else’s intellectual effort as your own. Within the context of a course, it also can include misrepresenting your own work as produced for that class when in fact it was produced for some other purpose. A complete list of dishonest behaviors, as defined by Grinnell College, can be found here.

Inclusive Classroom

Grinnell College makes reasonable accommodations for students with documented disabilities. To receive accommodations, students must provide documentation to the Coordinator for Disability Resources, information can be found here. If you plan on using accommodations in this course, you should speak with me as early as possible in the semester so that we can discuss ways to ensure your full participation in the course.

Religious Holidays

Grinnell College encourages students who plan to observe holy days that coincide with class meetings or assignment due dates to consult with your instructor in the first three weeks of classes so that you may reach a mutual understanding of how you can meet the terms of your religious observance, and the requirements of the course.

\(~\)

Grading

Engagement and Participation - 5%

Participation in a lab-heavy course is absolutely critical. During labs you are expected to help your partner(s) learn the material (which goes beyond simply answering the lab questions), and they are expected to help you. Everyone will begin the semester with a baseline participation score of 80%, which will then move up or down depending on my subjective assessment of your behavior during class. You can very quickly raise this score to 100% by doing a superb job helping your lab partner(s), and working diligently to understand course material during class. Alternatively, you can lower this score by skipping class, letting your lab partner(s) do most of the work, using your phone or surfing the web during class, etc. If you are ever unsure of your participation standing, you can email me and I am happy to provide you an interim assessment.

Labs - 15%

In-class labs will contain a handful of embedded questions that you and your partner(s) should answer to test your understanding of concepts and Python implementations. When working as a group, it is acceptable for one member to submit the entire group’s responses, so long as everyone’s name is included on the work at the time of submission.

Problem Sets - 25%

The semester will be broken up into several units. Units 2-6 will each be accompanied by a problem set containing an assortment of conceptual and applied questions. These problem sets will generally be due on the Friday corresponding to the end of that unit.

In-class Quizzes - 10%

Short in-class quizes will be delivered during the first 5-minutes of most class meetings. These quizzes will contain 1-3 brief questions, often multiple choice, covering concepts from previous lectures. These quizzes are intended to cover essential concepts that any machine learning specialist should have committed memory. Quizzes cannot be retaken or made up, but your lowest three quiz scores will be dropped at the end of the semester. Pending instructor approval and coordination with academic support staff, special exceptions may be made for circumstances involving prolonged absences.

Capstone Project - 45% (Presentation: 10%, Report: 30%, Code: 5%)

For this project you will work in a group of 2-4 students on a self-selected machine learning problem involving a non-trivial data set of your choosing. Your group will be responsible for creating a repository containing the code and data used during the project. You will present your results in a 10-minute in-class presentation intended to mirror a scientific conference presentation (ie: assume a modest amount of machine learning knowledge from the audience, with limited content area knowledge in the domain of your application). You will also prepare a 3-page scientific report summarizing the key methods and results involved in your work.

Additional details on the project will be posted at this link later in the semester.

\(~\)

Misc

Getting Help

In addition to visiting office hours and completing the recommended readings, there are many other ways in which you can find help on assignments and projects.

The Data Science and Social Inquiry Lab (DASIL) is staffed by mentors who are experienced programmers and may be able to troubleshoot coding problems you are having. Many students who’ve successfully completed this course have made extensive use of the DASIL work space and its computing resources.

The online platform Stack Overflow is a useful resource for finding user-generated coding solutions to coding questions. Nearly all professional data scientists have needed to “look up” a coding strategy on a site like Stack Overflow at some point in their career, and I have no problem with you doing the same on assignments or projects. However, if you make substantial use of a Stack Overflow answer (ie: actually integrating lines of code written by someone else into your work, not just getting help identifying the right functions/arguments) the expectation is that you cite or acknowledge doing so.

\(~\)

Units and Topics

Below is a list of course topics and tentative time frames for covering them.

  • Unit #1 - Introduction and Scattered Foundations in Python (1 week)
    • Topics: data structures, data manipulation using Pandas, exploratory data analysis
  • Unit #2 - Concepts in Machine Learning (2 weeks)
    • Topics: classification vs. regression, bias vs. variance trade-off, training vs. testing, performance evaluation, and cross-validation
  • Unit #3 - Generalized Linear Models, Model Optimization, and Regularization (2 weeks)
    • Topics: linear and logistic regression, regularization, gradient descent algorithms
  • Unit #4 - Tree-based Models (2-3 weeks)
    • Topics: decisions trees, ensembles and random forests, boosting
  • Unit #5 - Neural Networks (3-4 weeks)
    • Topics: single and multilayer artificial neural networks, convolutional neural networks, recurrent neural networks
  • Unit #6 - Unsupervised Learning (2 weeks)
    • Topics: k-means clustering, DBSCAN, principal component analysis, t-SNE