Syllabus - Sta-395 Intro to ML (Spring ’24)

Course Information

Instructor:

  • Ryan Miller, Noyce 2218, millerry@grinnell.edu

Class Meetings:

  • Noyce 2402, Tuesday and Thursday 1:00-2:20pm

Office Hours:

  • Drop-in hours (in Noyce 2218): Monday 2:30-3:30pm, Tuesday 2:30-3:30pm, Friday 11:45am-12:45pm
    • Additional availability by appointment

Class Mentors:

  • Brian Goodell, goodellb@grinnell.edu
    • Mentor session dates/times are currently TBD.

Course Description:

Machine learning is a branch of artificial intelligence (AI) rooted in computational statistics that focuses on the development of models and algorithms capable of identifying patterns in data. In this course, students will implement machine learning methods in Python to solve problems from a variety of disciplines. Topics include model validation and optimization, decision trees, boosting and bagging, neural networks, transfer learning, and recently developed methods. Students will complete a semester-long capstone project. Prerequisite: MAT-215 and STA-230.

Texts:

There is no required textbook for this course. All required materials will be posted on our course website.

Course content and recommended readings will be drawn from the following sources:

  1. An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani
  2. Introduction to machine learning with Python: a guide for data scientists by Andreas Müller
  3. Dive into Deep Learning by Zhang, Lipton, Li, and Smola

\(~\)

Aims and Objectives

This course aims to develop conceptual, theoretical, and applied perspectives on commonly used machine learning algorithms.

Learning Objectives

After completing this course, students should be able to:

  • Translate research questions that can be appropriately addressed by machine learning approaches into models that can be trained and tested.
  • Implement data pipelines and proper training/validation schemes in Python for a variety of machine learning methods.
  • Clearly and concisely communicate machine learning methodologies and results in both written and oral formats to technical and non-technical audiences.
  • Organize and execute a machine learning project involving complex, real-world data and modern machine learning methods and implementations.

\(~\)

Policies

Class Sessions

Class time will be split between “lecture” and “lab”. Most class meetings will begin with a short, low-stakes quiz (2-3 conceptual questions) to provide an incentive for consistent attendance and review of course content outside of class. Additional details on these quizzes can be found in the “Grading” section of the syllabus.

Portions of the class devoted to “lecture” will focus on conceptual and mathematical topics with minimal discussion of coding/software.

Portions of class devoted to “lab” involve working in assigned and/or self-formed groups of 2-3 on applications of course topics using Python.

Attendance

Lab sessions will involve collaboration with others in assigned and self-selected groups. This means that attendance during labs is especially important. While I understand that missing class is sometimes necessary, if you will be absent for any reason I ask to be notified as soon as possible so that assigned lab groups can be modified. Showing up late or missing class more than once without prior notice will negatively impact the participation component of your course grade.

Software

Software is an essential component of machine learning, and this class will make extensive use of Python (any version 3 release should be fine). You are welcome to use any Python IDE that you are familiar with, but I encourage you to use JupyterLab to record your work on labs/assignments.

You are welcome to use your own personal laptop, or a classroom computer during the course. If you are working your own laptop, I suggest downloading the most recent Anaconda Distribution to ensure compatibility with the code examples that will be given during class. JupyterLab can be found in the Anaconda Navigator, and Anaconda comes with most of the libraries we’ll be using pre-installed.

You may also use Anaconda Cloud during the early stages of the course if you’d like to use Anaconda but don’t want to download the full version onto your personal computer. That said, a free-tier account has disk space limits that will be problematic when we begin studying deep learning.

Academic Honesty

At Grinnell College you are part of a conversation among scholars, professors, and students, one that helps sustain both the intellectual community here and the larger world of thinkers, researchers, and writers. The tests you take, the research you do, the writing you submit-all these are ways you participate in this conversation.

The College presumes that your work for any course is your own contribution to that scholarly conversation, and it expects you to take responsibility for that contribution. That is, you should strive to present ideas and data fairly and accurately, indicate what is your own work, and acknowledge what you have derived from others. This care permits other members of the community to trace the evolution of ideas and check claims for accuracy.

Failure to live up to this expectation constitutes academic dishonesty. Academic dishonesty is misrepresenting someone else’s intellectual effort as your own. Within the context of a course, it also can include misrepresenting your own work as produced for that class when in fact it was produced for some other purpose. A complete list of dishonest behaviors, as defined by Grinnell College, can be found here.

Inclusive Classroom

Grinnell College makes reasonable accommodations for students with documented disabilities. To receive accommodations, students must provide documentation to the Coordinator for Disability Resources, information can be found here. If you plan on using accommodations in this course, you should speak with me as early as possible in the semester so that we can discuss ways to ensure your full participation in the course.

Religious Holidays

Grinnell College encourages students who plan to observe holy days that coincide with class meetings or assignment due dates to consult with your instructor in the first three weeks of classes so that you may reach a mutual understanding of how you can meet the terms of your religious observance, and the requirements of the course.

\(~\)

Grading

Engagement and Participation - 5%

Participation in a lab-heavy course is absolutely critical. During labs you are expected to help your partner(s) learn the material (which goes beyond simply answering the lab questions), and they are expected to help you. Everyone will begin the semester with a baseline participation score of 80%, which will then move up or down depending on my subjective assessment of your behavior during class. You can very quickly raise this score to 100% by doing a superb job helping your lab partner(s), and working diligently to understand course material during class. Alternatively, you can lower this score by skipping class, letting your lab partner(s) do most of the work, using your phone or surfing the web during class, etc. If you are ever unsure of your participation standing, you can email me and I am happy to provide you an interim assessment.

Labs - 15%

In-class labs will contain a handful of embedded questions that you and your partner(s) should answer to test your understanding of concepts and Python implementations. When working as a group, it is acceptable for one member to submit the entire group’s responses, so long as everyone’s name is included on the work at the time of submission.

Problem Sets - 20%

Course content will be divided into 4-5 units. Each unit will have a comprehensive homework assignment that is due at the end of the unit.

In-class Quizzes - 10%

Short in-class quizes will be delivered during the first 5-minutes of most class meetings. These quizzes will contain 1-3 brief questions, often multiple choice, covering concepts from previous lectures. These quizzes are intended to cover essential concepts that any machine learning specialist should have committed memory. Quizzes cannot be retaken or made up, but your lowest three quiz scores will be dropped at the end of the semester. Pending instructor approval and coordination with academic support staff, special exceptions may be made for circumstances involving prolonged absences.

Midterm Exam - 15%

There will be one in-class exam towards the middle of the semester, with the precise date/time announced no later than 2-weeks in advance of the exam. This exam is intended to assess mathematical and conceptual topics from the course, thus it is complementary to the applied focus of the capstone project (see below). Details and review materials will be provided later in the semester.

Capstone Project - 35% (Presentation: 5%, Report: 25%, Code: 5%)

For this project you will work in a group of 2-4 students on a self-selected machine learning problem involving a non-trivial data set of your choosing. Your group will be responsible for creating a repository containing the code and data used during the project. You will present your results in a 10-minute in-class presentation intended to mirror a scientific conference presentation (ie: assume a modest amount of machine learning knowledge from the audience, with limited content area knowledge in the domain of your application). You will also prepare a 3-page (single-spaced, not including figures) scientific report summarizing your methods and results.

Additional details on the project will be posted at this link later in the semester.

\(~\)

Misc

Getting Help

In addition to visiting office hours and completing the recommended readings, there are many other ways in which you can find help on assignments and projects.

The Data Science and Social Inquiry Lab (DASIL) is staffed by mentors who are experienced programmers and may be able to troubleshoot coding problems you are having. Many students who’ve successfully completed this course have made extensive use of the DASIL work space and its computing resources.

The online platform Stack Overflow is a useful resource for finding user-generated coding solutions to coding questions. Nearly all professional data scientists have needed to “look up” a coding strategy on a site like Stack Overflow at some point in their career, and I have no problem with you doing the same on assignments or projects. However, if you make substantial use of a Stack Overflow answer (ie: actually integrating lines of code written by someone else into your work, not just getting help identifying the right functions/arguments) the expectation is that you cite or acknowledge doing so.

Large language models:

Large language models, such as ChatGPT, Bing Chat, or Google Bard, can be a useful tool for explaining and fixing errors in Python code. You are welcome to use these tools; however, you are ultimately responsible for the accuracy of any code or written work you submit. Relying upon a large language model to write for you is a risky endeavor in the sense that the model might provide inaccurate information or generate text that is superficial and lacking sufficient detail, I encourage you to read Professor Erik Simpson’s write-up on writing with LLMs to see some reasons why you shouldn’t lean too heavily on these technologies. Nevertheless, you’re welcome to use large language models in this course in the same way you’d use a website like Stack Overflow or a peer mentor.

\(~\)

Units and Topics

Below is a list of course topics and tentative time frames for covering them.

  • Unit #1 - Scattered foundations in Python and unsupervised learning
    • Data handling and mathematical operations using numpy and pandas
    • Clustering using \(k\)-means and hierarchical clustering
    • Outlier detection and clustering using DBSCAN
    • Dimension reduction using principal component analysis
  • Unit #2 - Machine learning fundamentals and classic algorithms
    • Performance metrics for classification and regression tasks
    • Preventing data leakage using pipelines for model training and validation
    • Decision trees, support vector machines, \(k\)-nearest neighbors, and generalized linear models
    • Feature engineering
  • Unit #3 - Optimization of structured models
    • Review of select topics in linear algebra and multivariate calculus
    • Cost functions
    • Gradient descent algorithms
  • Unit #4 - Introduction to deep learning
    • Single and multi-layer feed forward neural networks
    • Backpropagation algorithms
    • Convolutional neural networks
    • Recurrent neural networks and long short term memory
    • Transfer learning
  • Unit #5 - Ensembles and feature importance
    • Boosting and bagging
    • Stacked generalizations
    • Gain-based importance and SHAP-based feature importance