Lab #1 - Python Essentials¶

While Python programming is not the core focus of this course, software is an essential component of any machine learning project.

The goal of this lab is to provide a brief overview of Python concepts, syntax, and semantics that are relevant to machine learning applications. In particular, we'll cover topics related to data handling, preparation, and visualization in this lab.

The lab is written under the assumption that you're using the Anaconda distribution, an open source Python distribution that assists in package and environment management.

Part 1 - Libraries¶

Most functions we'll use in this course are contained in collections of pre-built code known as libraries. Within Python libraries are modules, or bundles of related functions.

If you are using the Anaconda distribution, the code below will import two different libraries, pandas and numpy, that each contain functions for data handling and manipulation.

The code also assigns an alias to each library (ie: pd and np), which will reduce the amount of characters/typing needed to reference functions contained in these libraries.

In [1]:
import pandas as pd
import numpy as np

The code below imports the pyplot module from the matplotlib graphics library using the alias plt. As you'll soon see, the . operator has a variety of uses, one of which is to reference modules within a library.

In [2]:
import matplotlib.pyplot as plt

Note: If you are working outside of the Anaconda distribution you may need to install these libraries manually. This guide provides a set of instructions for Windows, This page provides a few different installation options on Mac.

Part 2 - Basic Data Types¶

To effectively utilize functions contained in libraries, we'll need to understand a few common data types and how to interact with them. For our purposes, there are 4 basic types of interest (note that technically there are many more types):

  1. Integers (ie: 5)
  2. Floats (ie: 5.0)
  3. Strings (ie: 'five' or '5')
  4. Booleans (ie: True or False)

For some functions, integers and floats can be used interchangeably. While for others, integers are expected as respresentations of distinct categorical outcomes.

If necessary, variable types can be converted using a conversion function:

In [3]:
## x contains the string '4'
x = '4'
type(x)
Out[3]:
str
In [4]:
## Convert x from str to float
new_x = float(x)
print(new_x)
type(new_x)
4.0
Out[4]:
float

Other useful conversion functions are int() and bool().

Be aware that bool() will convert anything other than zero to True:

In [5]:
## Three different boolean conversions
print(bool(1.0), bool(-2), bool(0.0))
True True False

Storing data¶

There are wide variety of ways to store data in Python, but most of what we'll do will focus on three of them:

  1. lists (base Python)
  2. numpy arrays
  3. pandas DataFrames

Lists are the simplest of these structures:

In [6]:
## Simple list
my_list1 = [1,5,3,9]

## Lists within a list
my_list2 = [[1,3,5], [2,4,6]]

In the example above, my_list2 is a list that contains two lists within it.

We can access these lists using indices:

In [7]:
## The first list within `my_list2`
my_list2[0]
Out[7]:
[1, 3, 5]

Note that Python indices begin at 0, so my_list2[0] returns the first item stored in my_list2, which is the list [1,3,5].

Now suppose want the first 2 elements of my_list1:

In [8]:
my_list1[:2]
Out[8]:
[1, 5]

Recognize that the element in position 2 is not included, thus :2 returns the elements in positions 0 and 1 (the list [1,5]). Also note that the syntax 0:2 yields the same result as :2.

Next, suppose we want to extract the integer 3 from my_list2 (where it is the second element in the first list):

In [9]:
my_list2[0][1]
Out[9]:
3

This code first extracted the item in position 0 of my_list2 (the list [1,3,5]), then it extracted the element in position 1 of the resulting list (the value 3).

Question #1¶

  • Create a list containing the strings 'one', 'two', 'three', 'four'. Write code that confirms the type of this list, then use indices to print the list's last two elements.

Part 3 - Numpy Arrays¶

In our previous example, my_list2[0][1] gave us to access the integer 3; unfortunately, the command my_list2[0,1] will not achieve the same result:

In [10]:
my_list2[0,1]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_15364\3441573747.py in <module>
----> 1 my_list2[0,1]

TypeError: list indices must be integers or slices, not tuple

This shortcoming helps motivate our next data structure, numpy arrays:

In [11]:
my_array2 = np.array(my_list2)
my_array2[0,1]
Out[11]:
3

While lists and arrays might seem similar, and some functions can handle either inputs, it is important to recognize that arrays belong to the numpy package, while lists exist within base Python. In future labs, I'll frequently use the term "array" without an explicit reference to the numpy library.

numpy arrays can be viewed as special case of lists with a few additional constraints that convey a number of benefits:

  1. Arrays allow mathematical operations (vectorized addition/multiplication, and matrix algebra)
  2. Arrays allow easier slicing across multiple dimensions
  3. Arrays use substantially less memory to store the same data

Below is an example demonstrating the first advantage. Notice what happens when we try to add two lists:

In [12]:
# For lists, + will concantenate
print(my_list1 + my_list1)

## For arrays, + works as intended (vectorized addition)
my_array1 = np.array(my_list1)
print(my_array1 + my_array1)
[1, 5, 3, 9, 1, 5, 3, 9]
[ 2 10  6 18]

numpy arrays are also our first opportunity to learn about two important aspects of Python objects:

  1. Attributes - which can be viewed as the characteristics of an object
  2. Methods - which can be viewed as the operations/actions of an object

Attributes are called using the syntax my_object.attribute_name, while methods are called uisng the syntax my_object.method_name()

For example, numpy arrays have a shape attribute and a flatten() method:

In [13]:
## The shape attribute describes the array's dimensions
my_array2.shape
Out[13]:
(2, 3)
In [14]:
## The flatten method reshapes an n-dimensional array into a 1-dimensional array
my_array2.flatten()
Out[14]:
array([1, 3, 5, 2, 4, 6])

A complete list of attributes and methods for n-dimensional numpy arrays can be found here.

Question #2¶

  • Convert the list you created in Question #1 into a numpy array. Then use the reshape() method to organize it such that it contains two rows and two columns. Print the resulting object's shape attribute to verify your manipulations worked as intended. Hint: Use the numpy reference guide linked above to learn more about reshape().

To conclude our introduction to arrays, one last thing to know is that the term axis refers one of an array's dimensions.

It's common for data to be organized into rows (observations/examples) and columns (variables/features). For such data, the 0 is the index of the row axis, and 1 is the index of the column axis.

Other types of data, such as colored images, are typically stored in higher dimensional arrays. For example, a single RGB image might be stored in an n by m by 3 array (where n and m are pixel dimensions, and 3 is the number of color channels).

Part 4 - Pandas DataFrames¶

Now we're ready to look at some actual data. For this we'll use the read_csv() function in the pandas library to read a .csv file from the web (recall that we gave the alias pd to pandas):

In [15]:
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
ic.shape
Out[15]:
(777, 19)

The dataset stored in ic was scraped from the Johnson County (IA) Assessor between 2005 and 2007, and it contains various characteristics of all homes sold in Iowa City, IA during that period. We can see that these data contain 777 examples and 19 features.

Unlike numpy arrays, pandas DataFrames are intended to operate as 2-dimensional objects with labeled rows and columns (note that 1-dimension pandas DataFrames are called series). We can use these labels for selection and subsetting:

In [16]:
## Print column labels attribute
print(ic.columns)
Index(['sale.amount', 'sale.date', 'occupancy', 'style', 'built', 'bedrooms',
       'bsmt', 'ac', 'attic', 'area.base', 'area.add', 'area.bsmt',
       'area.garage1', 'area.garage2', 'area.living', 'area.lot', 'lon', 'lat',
       'assessed'],
      dtype='object')
In [17]:
## Store price as a separate object, using the label `sale.amount` to select that column
ic_price = ic['sale.amount']
print(ic_price.shape)
(777,)
In [18]:
## Drop sale.price column (axis 1) from the original DataFrame
ic_no_price = ic.drop('sale.amount', axis=1)
print(ic_no_price.shape)
(777, 18)
In [19]:
## Select only columns with numeric data types
ic_num = ic.select_dtypes(include=['number'])
print(ic_num.shape)
(777, 13)
In [20]:
## Subset rows using a boolean condition
ic_expensive = ic[ic['sale.amount'] > 500000]
print(ic_expensive.shape)
(11, 19)

You should make note of this page for a complete list of attributes and methods for pandas DataFrames.

Question #3¶

  • Part A: Use the select_dtypes() method to select all non-numeric columns in the DataFrame ic, then print the dimensions of the resulting object. Hint: see the reference page for keywords and arguments.
  • Part B: Create a series containing the categorical variable style for homes that sold for more than $300,000. Then use the value_counts() method to find the number of homes sold of each recorded style.

For future reference, here are a few useful DataFrame attributes/methods to be aware of:

In [21]:
## Print the first N rows of a DataFrame
ic.head(3)

## Briefly summarize the DataFrame's column types
ic.info()

## Built-in descriptive summaries of a DataFrame's columns
ic.describe()

## Drop rows with at least one missing value
ic.dropna()

## Number of unique values for each column
ic.nunique()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sale.amount   777 non-null    int64  
 1   sale.date     777 non-null    object 
 2   occupancy     777 non-null    object 
 3   style         777 non-null    object 
 4   built         777 non-null    int64  
 5   bedrooms      777 non-null    int64  
 6   bsmt          777 non-null    object 
 7   ac            777 non-null    object 
 8   attic         777 non-null    object 
 9   area.base     777 non-null    int64  
 10  area.add      777 non-null    int64  
 11  area.bsmt     777 non-null    int64  
 12  area.garage1  777 non-null    int64  
 13  area.garage2  777 non-null    int64  
 14  area.living   777 non-null    int64  
 15  area.lot      777 non-null    int64  
 16  lon           777 non-null    float64
 17  lat           777 non-null    float64
 18  assessed      777 non-null    int64  
dtypes: float64(2), int64(11), object(6)
memory usage: 115.5+ KB
Out[21]:
sale.amount     391
sale.date       605
occupancy         3
style             9
built            95
bedrooms          7
bsmt              5
ac                2
attic             6
area.base       381
area.add        150
area.bsmt       111
area.garage1    127
area.garage2     47
area.living     442
area.lot        448
lon             701
lat             703
assessed        705
dtype: int64

Part 5 - Data Manipulation¶

Very often we'll need to extensively clean and manipulate our data before its in a format for that can be used by machine learning algorithms. The examples in this section will cover a few common data cleaning/manipulation operations.

  • Grouped Summarization

In this example, the data are grouped by style, then the means of each column are calculated separately for these groupings:

In [22]:
## Group by a column then summarize (means here)
ic.groupby(by='style').mean()
Out[22]:
sale.amount built bedrooms area.base area.add area.bsmt area.garage1 area.garage2 area.living area.lot lon lat assessed
style
1 1/2 Story Frame 186644.000000 1927.320000 3.080000 744.480000 177.040000 123.800000 79.240000 273.280000 1499.160000 8348.440000 -91.525070 41.654289 177443.600000
1 Story Brick 220225.416667 1992.000000 2.333333 1304.000000 17.500000 244.041667 125.583333 33.333333 1321.500000 7325.708333 -91.513483 41.654208 224732.916667
1 Story Condo 121246.600000 2001.800000 2.111111 1041.777778 0.000000 106.000000 108.555556 193.111111 1041.777778 4232.644444 -91.517162 41.653922 118141.777778
1 Story Frame 166179.740634 1969.317003 3.002882 1115.057637 33.997118 412.340058 244.138329 80.020173 1188.129683 9057.158501 -91.522871 41.652600 159532.074928
2 Story Brick 334985.000000 1937.200000 4.100000 923.500000 142.600000 135.000000 20.900000 128.800000 2247.600000 14049.800000 -91.519299 41.651382 305561.000000
2 Story Condo 149766.666667 2003.555556 2.407407 721.740741 21.222222 113.592593 206.296296 0.000000 1537.925926 6579.888889 -91.527477 41.650626 151150.740741
2 Story Frame 215038.451087 1966.141304 3.271739 736.065217 189.891304 284.619565 248.597826 73.641304 1764.902174 10013.983696 -91.523175 41.653444 210999.782609
Split Foyer Frame 160058.333333 1982.023810 3.226190 1085.595238 13.464286 516.738095 78.785714 9.571429 1114.071429 8852.440476 -91.523587 41.652011 151671.547619
Split Level Frame 208351.612903 1984.870968 3.290323 896.548387 266.838710 381.387097 380.193548 0.000000 1614.096774 9119.483871 -91.521810 41.652192 201474.838710
  • Merging/Joining

In this example, a new 2x2 DataFrame, more_data, is created such that sale.date is a key that can be used to link these data to the larger ic DataFrame. A left join is then used to attach information from homes in ic with matching sale dates to the 2 rows in more_data:

In [23]:
## Merge two dataframes
more_data = pd.DataFrame({'sale.date': ['1/3/2005','1/12/2005'],
        'new_variable': ['a','b']})
merged_data = more_data.merge(ic, on='sale.date', how='left')
print(merged_data.shape)
(2, 20)
  • Querying

We've already seen how to subset using boolean conditions, but in many instances query() provides a simpler syntax. Additionally, notice how the use of quotes and backticks allows a column name that contains a period to be referenced:

In [24]:
## Find homes with sale amounts in a given range
ic_midprice = ic.query('200000 < `sale.amount` < 400000')
ic_midprice.shape
Out[24]:
(183, 19)
  • Inserting

This example adds a 20th column to the ic DataFrame containing random values. Notice how ic.shape[0] can be used to obtain the number of rows in a DataFrame, and ic.shape[1] can be used to obtain the number of columns:

In [25]:
## Add a new column to a DataFrame (random values from a standard normal dist in this example)
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
new_var = np.random.randn(ic.shape[0])
ic.insert(ic.shape[1],'my_new_column',new_var)
ic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   sale.amount    777 non-null    int64  
 1   sale.date      777 non-null    object 
 2   occupancy      777 non-null    object 
 3   style          777 non-null    object 
 4   built          777 non-null    int64  
 5   bedrooms       777 non-null    int64  
 6   bsmt           777 non-null    object 
 7   ac             777 non-null    object 
 8   attic          777 non-null    object 
 9   area.base      777 non-null    int64  
 10  area.add       777 non-null    int64  
 11  area.bsmt      777 non-null    int64  
 12  area.garage1   777 non-null    int64  
 13  area.garage2   777 non-null    int64  
 14  area.living    777 non-null    int64  
 15  area.lot       777 non-null    int64  
 16  lon            777 non-null    float64
 17  lat            777 non-null    float64
 18  assessed       777 non-null    int64  
 19  my_new_column  777 non-null    float64
dtypes: float64(3), int64(11), object(6)
memory usage: 121.5+ KB
  • Pivoting:
In [26]:
## Read data in "long" format
collegeAdm = pd.read_csv("https://remiller1450.github.io/data/college_adm.csv")
collegeAdm
Out[26]:
Adm_Rate Year College
0 28.9 2018 Grinnell
1 24.4 2019 Grinnell
2 23.1 2020 Grinnell
3 21.2 2018 Carlton
4 19.8 2019 Carlton
5 19.1 2020 Carlton
6 33.7 2018 Oberlin
7 36.2 2019 Oberlin
8 36.4 2020 Oberlin
In [27]:
## Pivot to "wide" format
ca_wide = collegeAdm.pivot(index="College", columns="Year", values="Adm_Rate")
ca_wide
Out[27]:
Year 2018 2019 2020
College
Carlton 21.2 19.8 19.1
Grinnell 28.9 24.4 23.1
Oberlin 33.7 36.2 36.4
In [28]:
## Pivot back to "long" format
ca_wide.reset_index(inplace=True)
ca_wide.melt(id_vars=['College'])
Out[28]:
College Year value
0 Carlton 2018 21.2
1 Grinnell 2018 28.9
2 Oberlin 2018 33.7
3 Carlton 2019 19.8
4 Grinnell 2019 24.4
5 Oberlin 2019 36.2
6 Carlton 2020 19.1
7 Grinnell 2020 23.1
8 Oberlin 2020 36.4
  • To pivot "long" data to a "wide" format, the argument index defines what should be the new rows in the wide form, while column defines the variable whose values should be converted to new columns.
  • To pivot "wide" data to a "long" format, melt() is used. As a preliminary step we convert the row names into an actual column within the DataFrame using reset_index().

Question #4¶

The code given below reads data scraped from the RealClearPolitics website prior to the 2016 US Presidential election. It then uses the Sample column to create two distinct columns for the number of people polled, N, and the population polled, Pop (either likely voters, LV, or registered voters, RV).

  • Part A - Reshape these data so that each row represents the polling percentage of a single candidate for a single poll.
  • Part B - Using only polls with at least 1,000 participants, find the mean polling percentage for each candidate. Then, print only these polling averages for your final answer.
In [29]:
## Read polls
polls = pd.read_csv("https://remiller1450.github.io/data/polls2016.csv")

## Split 'Sample' into 'N' and 'P' using the space
polls[['N', 'Pop']] = polls['Sample'].str.split(' ', n=1, expand=True)

## Info on the resulting dataset
polls.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Poll         7 non-null      object 
 1   Date         7 non-null      object 
 2   Sample       7 non-null      object 
 3   MoE          6 non-null      float64
 4   Clinton..D.  7 non-null      int64  
 5   Trump..R.    7 non-null      int64  
 6   Johnson..L.  7 non-null      int64  
 7   Stein..G.    7 non-null      int64  
 8   N            7 non-null      object 
 9   Pop          7 non-null      object 
dtypes: float64(1), int64(4), object(5)
memory usage: 688.0+ bytes

Part 6 - Basic Graphics¶

Data visualization is an important tool in the early stages of a machine learning project. In this context, we can use data visualization to assist us with:

  1. Identifying predictors with unusual distributions
  2. Identifying redundant (collinear) predictors
  3. Assessing whether marginal relationships between a predictor and the outcome tend to be linear, non-linear, etc.

There are many different graphics libraries in Python. I will try to stick to pyplot module in matplotlib (recall we gave this the alias plt), and pandas methods that come with DataFrames (occasionally we might also use the seaborn library).

In pyplot, each function call makes some type of change to a figure (ie: creating a plotting area, displaying geometric elements, changing labels, etc.). Aspects of the figure are stored and preserved across different function calls.

The example below demonstrates this by creating a histogram, then adding a title, then changing the label of the x-axis:

In [30]:
## Histogram
plt.hist(ic['sale.amount'])
plt.title('Home Sales in Iowa City, IA (2005-2007)')
plt.xlabel('Sale Price')
plt.show()

In JupyterLabs (and most other IDEs) the final command, plt.show(), is not actually needed to display the figure. However, it will supress any output related to the creation of a figure that is not the figure itself.

Templates for a couple of other useful graphics are shown below:

In [31]:
## Scatterplot
plt.scatter(ic['sale.amount'], ic['assessed'])
plt.show()

## Boxplots (these are easier to create using another graphics library, 'seaborn')
import seaborn as sns
sns.boxplot(x=ic['style'], y=ic['sale.amount'])
plt.xticks(rotation=90)
plt.show()

## Line graph (the default of `plot`)
plt.plot(collegeAdm[collegeAdm['College'] == 'Grinnell']['Year'], collegeAdm[collegeAdm['College'] == 'Grinnell']['Adm_Rate'])
plt.plot(collegeAdm[collegeAdm['College'] == 'Carlton']['Year'], collegeAdm[collegeAdm['College'] == 'Carlton']['Adm_Rate'])
plt.show()

In these examples, note that seaborn and matplotlib are sometimes compatible with each other (we could use plt.xticks() on our seaborn boxplot).

Displaying Multiple Graphics¶

For many machine learning applications we'll want to look at several similar visualizations (ie: distributions of different predictors, different possible transformations, performance of different models, etc.) A grid containing several plots can be constructed using plt.subplots():

In [32]:
## 2x2 grid of plots
fig, axs = plt.subplots(2,2)
fig.suptitle('2x2 plot grid')
axs[0,0].hist(ic['sale.amount'])
axs[0,0].set(xlabel='Sale Price')
axs[1,1].hist(ic['assessed'])
plt.show()

The first command might look a bit surprising, but Python allows functions to return multiple objects (plt.subplots returns two). We store these objects as fig and axs.

The object fig controls features of the entire grid, while axs allows you to control individual subfigures along either axis of the grid.

In general, I'd recommend using subplot grids for the display of final results. For data exploration purposes, pandas DataFrames have a few built-in graphics methods that are much easier than looping through and using subplots:

In [33]:
## Histograms of all numeric variables
ic_num = ic.select_dtypes(include=['number'])
ic_num.hist(figsize=(10, 12), layout=(3,5))
plt.show()
In [34]:
## Scatterplot matrix of all pairing of numeric vars
from pandas.plotting import scatter_matrix
scatter_matrix(ic_num)
plt.show()
In [35]:
## Boxplots (with groupings)
ic[['sale.amount','style']].boxplot(by='style')
plt.xticks(rotation=90)
plt.show()
In [36]:
## Bar chart for a categorical var (note that the data must be converted to counts first)
ic_style_counts = ic['style'].value_counts()
ic_style_counts.plot(kind='bar')
plt.show()

For additional examples and information on pandas graphics, you can use this reference page.

Finally, image data can be handled using the skimage library, which contains functions that will automatically read images into numpy arrays (and display appropriately formatted numpy arrays as images):

In [37]:
from skimage import io
my_img = io.imread("https://www.iowacollegefoundation.org/_assets/image/schools/grinnell-college/grinnell-college.png")
io.imshow(my_img)
io.show()
my_img.shape
Out[37]:
(238, 400, 4)

In future labs we'll occasionally apply machine learning methods to image data. So it's worthwhile knowing how to display an image that is stored as a numpy array.

Question #5¶

The code below reads a dataset containing 9-month salaries of faculty members at a large university (in 2008-09). You will use these data for this question.

  • Part A - Set up a 2x2 grid of plots. Then, in the first plot of the first row, create a barchart that displays the average salary for male and female faculty. Hint: use groupby.
  • Part B - In the second plot of the first row, create a scatterplot displaying the relationship between years of service and salary
  • Part C - In the second row, create a histogram displaying the distribution of years of experience for female faculty only in one column of the subplot grid, and the same plot for male faculty only in the other column of the subplot grid.
In [38]:
## Read data
profs = pd.read_csv("https://remiller1450.github.io/data/Salaries.csv")