Instructor:
Class Meetings:
Office Hours:
Course Mentors:
Course Description:
This course introduces core topics in data science using R programming. This includes introductions to getting and cleaning data, data management, exploratory data analysis, reproducible research, and data visualization. This course incorporates case studies from multiple disciplines and emphasizes the importance of properly communicating statistical ideas. Prerequisite: MAT-209 or STA-209. Suggested CSC-151 or computer programming experience.
Texts:
There is no required textbook for this course. All materials will be posted on the course website.
Materials will be accompanied recommended readings from several published textbooks, all of which are freely available online:
\(~\)
This course aims to develop in students informed critical and theoretical perspectives on data collection, data manipulation and production, and the use algorithmic techniques to process and analyze data.
After completing this course, students should be able to:
R
programming environment to process, manage,
and format unstructured data, create meaningful data visualizations, and
apply supervised or unsupervised learning models\(~\)
Class Sessions
The core component of our class meetings will be working through hands-on labs in a paired programming environment. Pairs will be assigned weekly for the first half of the semester. During the second half you will be free to choose your own partner (or work independently). During labs it is essential that you and your partner(s) work together, making certain that each of you understand your work equally well.
Most labs will begin with a brief “preamble” section that we will go through together as class. The purpose of this section is to introduce the core topic of the lab and provide a smooth start to each class meeting.
Attendance
This course involves substantial collaboration and absences impact not only yourself, but also your classmates. That said, I understand that missing class is sometimes necessary. If you will be absent for any reason I ask to be notified as soon as possible. Showing up late or missing class more than twice without prior notice will negatively impact the participation component of your course grade.
Late Work
Assignments are generally due at 11:59pm on their assigned due-date. All assignments have an automatic 48-hour grace period where they can still be submitted with a 5% penalty applied. After 48-hours, late work may still be accepted on a case-by-case basis subject to an additional penalty. Special exceptions due to individual circumstances are possible, but should be arranged in advance of an assignment’s deadline, and/or coordinated with Grinnell College academic support staff.
Software
Software is an essential component of data science and will play an
important role in this course. We will primarily use R
, an
open-source statistical software program. You will also be expected to
write, document, and submit code used for projects and assignments
throughout the semester.
You are welcome to use your own personal laptop, or a Grinnell
College laptop, during the course. R
is freely available
and you can download it and it’s UI companion, R Studio
,
here (note: R
must be downloaded and installed before
R Studio
):
R
from http://www.r-project.org/R Studio
from http://www.rstudio.com/You may also work on a classroom laptop, all of which will have
R
and R Studio
pre-installed.
Academic Honesty
At Grinnell College you are part of a conversation among scholars, professors, and students, one that helps sustain both the intellectual community here and the larger world of thinkers, researchers, and writers. The tests you take, the research you do, the writing you submit-all these are ways you participate in this conversation.
The College presumes that your work for any course is your own contribution to that scholarly conversation, and it expects you to take responsibility for that contribution. That is, you should strive to present ideas and data fairly and accurately, indicate what is your own work, and acknowledge what you have derived from others. This care permits other members of the community to trace the evolution of ideas and check claims for accuracy.
Failure to live up to this expectation constitutes academic dishonesty. Academic dishonesty is misrepresenting someone’s intellectual effort as your own. Within the context of a course, it also can include misrepresenting your own work as produced for that class when in fact it was produced for some other purpose. Additional information can be found here.
Inclusive Classroom
Grinnell College makes reasonable accommodations for students with documented disabilities. To receive accommodations, students must provide documentation to the Coordinator for Disability Resources, information can be found here. If you plan on using accommodations in this course, you should speak with me as early as possible in the semester so that we can discuss ways to ensure your full participation in the course.
Religious Holidays
Grinnell College encourages students who plan to observe holy days that coincide with class meetings or assignment due dates to consult with your instructor in the first three weeks of classes so that you may reach a mutual understanding of how you can meet the terms of your religious observance, and the requirements of the course.
Academic Support
If you have other needs not addressed in previous sections, please let me know soon so that we can work together for the best possible learning environment. In some cases, I will recommend consulting with the Academic Advising staff. They are an excellent resource for developing strategies for academic success and can connect you with other campus resources as well: http://www.grinnell.edu/about/offices-services/academic-advising. If I notice that you are encountering difficulty, in addition to communicating with you directly about it, I will also likely submit an academic alert via Academic Advising’s SAL portal. This reminds you of my concern, and it notifies the Academic Advising team and your adviser(s) so that they can reach out to you with additional offers of support.
\(~\)
Engagement and Participation - 5%
Participation in a lab-heavy course is absolutely critical. During labs you are expected to help your partner(s) learn the material (which goes beyond simply answering the lab questions), and your partner is expected to help further your understanding. Everyone will begin the semester with a baseline participation score of 80%, which will move up or down depending on my subjective assessment of your behavior during class. You can very quickly raise this score to 100% by doing a superb job helping your lab partner(s), and working diligently to understand course material during class. Alternatively, you can lower this score by skipping class, letting your lab partner(s) do most of the work, using your phone or surfing the web during class, etc. If you are ever unsure of your participation standing, you can email me and I am happy to provide you an interim estimate.
Labs - 20%
In-class labs contain embedded questions that you and your lab partner will answer together. Oftentimes multiple labs will be due at the same time, and you are welcome to upload your answers as a single file. Some lab questions will be scored for accuracy with feedback given, while others may be scored for effort/completion. If it becomes clear that you are your partner are using a “divide and conquer” approach to answering lab questions your score on that assignment will be penalized.
Individual Homework - 20%
There will be 5-7 individual homework assignments throughout the semester. These assignments are naturally cumulative, and intended to force you to combine concepts and methods from multiple labs. You are welcome to get help from mentors, classmates, and Professor Miller on these assignments, but your submissions should be your own work, and you should add the name(s) of anyone you received help from as a footnote on the question that they helped you with. Some homeworks will involve reading and writing questions.
Data Cleaning/Visualization Take-home - 10%
This assessment is intended to individually measure your ability to clean and manipulate complex data to arrive at a specific endpoint (recreating the results and visualizations from an analysis). The task is intended to approximately 4 hours to complete, but you will have 48-hours to work on the assessment once it is assigned.
Midterm Project - 20%
For this project, you will develop an R Shiny
app that
visually explores a data set of your choosing. You may work on this
project individually, or in a group of two. You will
deliver a 5-minute in-class presentation demonstrating the capabilities
of your app and discussing some of the trends in your data that your app
displays. You will be evaluated on the functionality of your app, use of
data, and communication.
Final Project - 25%
This project is a start-to-finish data science application on a non-trivial data set. The final product is a three-page written report accompanied by R code and documentation. You may work on this project individually, or in a group of two or three. This project is intended to mirror the USCLAP Competition guidelines, and I encourage any interested students to prepare their project with a competition submission in-mind.
\(~\)
Why use R for Data Science?
If you’ve spent any time reading about data science online you’ll
undoubtedly have noticed the prominence of the Python programming
language. Indeed, research from Cal State University found Python was
the most popular data science language in private industry, being
mentioned in 42% of data scientist job postings. However,
R
, which was mentioned in 20% of job postings, is not far
behind and offers a few advantages when approaching data science from a
statistical perspective (hence this course having the STA prefix).
Both R
and Python provide plenty of functions for data
manipulation. However, because R was created by academic statisticians,
it offers very strong data visualization and statistical modeling
packages. On the other hand, Python is a general-purpose programming
language that excels in production, deployment, and machine learning.
Regardless of each language’s strengths and weaknesses, as an
introductory course our focus is on the fundamental skills and thought
processes used in data science – which is something that can be
accomplished regardless of the tools used (which will change over time
anyways). Speaking from personal experience, I have used R almost daily
for over a decade and my knowledge of R made it possible to learn Python
very quickly.
Getting Help
In addition to visiting office hours and completing the recommended readings, there are many other ways in which you can find help on assignments and projects.
The Data
Science and Social Inquiry Lab (DASIL) is staffed by mentors who are
experienced in R
programming and may be able to
troubleshoot coding problems you are having. Many students who’ve
successfully completed this course have made extensive use of the DASIL
work space and its computing resources.
The online platform Stack Overflow
is a useful resource to find user-generated coding solutions to common
R
problems. Nearly all professional data scientists have
needed to “look up” a coding strategy on a site like Stack Overflow at
some point in their career, and I have no problem with you doing the
same on assignments or projects. However, if you make substantial use of
a Stack Overflow answer (ie: actually integrating lines of code written
by someone else into your work, not just getting help identifying the
right functions/arguments) the expectation is that you cite or
acknowledge doing so.
\(~\)
R
(data structures, functions,
packages, help documentation, and R Markdown
)ggplot2
and data visualization
principles)tidyr
, wrangling
using dplyr
, merging and joining, string processing and
regular expressions)plotly
, maps with
leaflet
, dashboards with R Shiny
)