Lab 1 - Introduction to Python¶

Learning about Python is not the primary focus of this course, but a basic understanding of the language is necessary to carry out the types of machine learning tasks we will be exploring.

Directions: Please read through the contents of this lab with your partner and try the examples. After you're both confident that you understand a topic you should attempt the associated exercises and record your answers in your own Jupyter notebook that you will submit for credit. The notebook you submit should only contain answers to the lab's exercises (so you should remove any code you ran for the examples, or use a separate notebook to test out the examples).

Part 1 - Libraries¶

We will make extensive use of libraries, or pre-built collections of code, throughout the semester. If you are using the Anaconda distribution you'll have numerous commonly used libraries pre-installed and simply need to load them in order to access their functions.

The code below imports two libraries that we'll use extensively, pandas and numpy. Each library is assigned an alias (ie: pd and np) so that they may be referenced using fewer characters.

In [1]:
import pandas as pd
import numpy as np

The example below uses the read_csv() function from pandas to load a data set containing various attributes of homes sold in Iowa City, IA between 2005 and 2007. Notice the use of the alias, pd.

In [2]:
ic_homes = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")

Some libraries are organized into modules, where each module is a bundle of related functions. Modules within a library are referenced using the . character. We may import all functions in a module using the * character, but best practice is to import only the functions you plan on using.

Some examples involving scikit-learn, a library that is organized into many hierarchical modules, are shown below:

In [3]:
# Import the scikit-learn library
import sklearn 

# Load all functions from the pre-processing module
from sklearn.preprocessing import *

# Load just the 'LinearRegression' function from the linear_model module
from sklearn.linear_model import LinearRegression

Question #1:

  • Part A: Import the pyplot module from the matplotlib and assign the alias plt (Hint: you can do this all in one step using the . character)
  • Part B: Import DecisionTreeClassifier, which is contained in the tree module of sklearn.

Part 2 - Basic Data Structures¶

There are 4 native data types in Python that will be relevant to our course:

  1. int - integer values like 5 or -9
  2. float - floating point real numbers like 5.0 or 3.14
  3. str - character strings like '5' or 'five'
  4. bool - boolean logical values, namely True or False

A few things to note:

  • As with any language, passing data of the wrong type can cause functions to throw errors, but more dangerously it can lead to unintended behavior without an explicit error message.
  • We can check a variable's type using the type() function
  • The functions int(), float(), bool(), and str() can be used for coercion

Some examples are shown below:

In [4]:
# Example 1 - Define x as a str type and coerce to float
x = '4'
type(x)
float(x)

# Example 2 - Check the boolean coercions of a few different numerical values
print(bool(1.0), bool(-2), bool(0.0))
True True False

There are many ways to store collections of data in Python. In the early stages of the course we'll focus on the following:

  1. Lists (native to Python)
  2. Dictionaries (native to Python)
  3. Arrays (from numpy)
  4. DataFrames (from pandas)

Later on we'll also use Pytorch tensors, which are specialized multi-dimensional arrays.

Lists¶

Lists are the most basic data structure we will use. The examples below illustrate a few basic properties of lists:

In [5]:
# Create a simple list of integer values
my_first_list = [3,1,4,5]

# Create a list of lists
my_second_list = [[1,2,3], ['a','b','c']]

# Indexing starts at zero in Python
my_second_list[1]
Out[5]:
['a', 'b', 'c']

Question #2: Consider the list of lists given below:

Q2_list = [[1.0,2.0,3.0], ['a','b','c'], [True, 1, False]]

  • Part A: Use indices to print the first list stored in Q2_list
  • Part B: Confirm that the type of the second element in the third list stored in Q2_list is int (Hint: in our example my_second_list[1] returns the list ['a', 'b', 'c'], which has its own indices)

Dictionaries¶

Dictionaries store data via key:value pairs. We'll mostly use them to organize data pre-processing steps and parameter combinations when developing models. But two things are worth noting:

  1. Keys are used to access the corresponding values
  2. Unlike our next two data structures, arrays and DataFrames, values do not need to have the same length for each key in the dictionary

Below is an example that uses the keys: 'brand', 'year', and 'color' and the values 'Ford', [1964, 1965], and ['red','white','blue'].

In [6]:
# Create the dictionary
my_dict = {'brand': 'Ford',
           'year': 1964,
           'colors': ['red', 'white', 'blue']}

# Access the 'colors' that are stored in this dictionary
my_dict['colors']
Out[6]:
['red', 'white', 'blue']

Arrays¶

Arrays are a data structure provided by the numpy library that share a number of similarities to lists, but have a few key distinctions:

  1. Arrays allow for vectorized mathematical operations
  2. Arrays are more memory efficient than lists
  3. Arrays are easier to reshape and subset (as we'll soon see)

Below is an example of a scenario where lists and arrays behave very differently:

In [7]:
# Two lists used in the example
my_list1 = [1,5,3,9]
my_list2 = [2,5,0,-4]

# The '+' operator will concatenate lists
my_list1 + my_list2
Out[7]:
[1, 5, 3, 9, 2, 5, 0, -4]
In [8]:
# The '+' operator will perform vectorized addition on numpy arrays
my_array1 = np.array(my_list1)
my_array2 = np.array(my_list2)
my_array1 + my_array2
Out[8]:
array([ 3, 10,  3,  5])

Arrays also provide us an opportunity to see two important aspects of Python objects:

  1. methods - which are actions and object can perform
  2. attributes - which are characteristics of an object

Below are a few examples involving numpy arrays:

In [9]:
## Create a 2-d array
my_2d_array = np.array([[1,3,5], [2,4,6]]) 

## Use the "shape" attribute to see the dimensions of this array
my_2d_array.shape
Out[9]:
(2, 3)
In [10]:
## Use the "flatten" method to make this 2-d array into a 1-d array
my_2d_array.flatten()
Out[10]:
array([1, 3, 5, 2, 4, 6])

Notice that methods are designed to accept arguments (hence the empty parentheses in the second example) but attributes are fixed characteristics of an object and do not involve any arguments.

We will not exhaustively cover all of the attributes and methods of the objects used throughout this class, so you should be prepared to read documentation. Here is the link to the official documentation page for numpy arrays.

Question #3:

  • Part A: Create two arrays, the first containing the elements 1, 2, 3 and the second containing the elements -1, -2, -3. Use the stack() function to create a single 3 by 2 array from these two arrays. After doing this, briefly explain the implications of changing the axis argument from 0 to 1 in the stack() function.
  • Part B: Check the ndim, size, and shape attributes of the new array you created in Part A. Briefly describe the information each of these attributes provides about an array.

DataFrames¶

While numpy arrays are the expected input for many machine learning methods, DataFrames can sometimes be more useful due to their broader set of built-in data manipulation tools.

Unlike numpy arrays, which can have arbitrarily many axes, DataFrames store data along exactly 2 axes, with each row representing an observation/sample and each column representing a variable/feature. The analogue to a 1-dimensional DataFrame in pandas is a "Series", but we'll try to avoid doing anything with series.

We've already created a pandas DataFrame named ic_homes using the pd.read_csv() command earlier in the lab. The examples below demonstrate some basic capabilities of DataFrames:

In [11]:
## Example 1 - Printing the variable names and types in ic_homes
ic_homes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sale.amount   777 non-null    int64  
 1   sale.date     777 non-null    object 
 2   occupancy     777 non-null    object 
 3   style         777 non-null    object 
 4   built         777 non-null    int64  
 5   bedrooms      777 non-null    int64  
 6   bsmt          777 non-null    object 
 7   ac            777 non-null    object 
 8   attic         777 non-null    object 
 9   area.base     777 non-null    int64  
 10  area.add      777 non-null    int64  
 11  area.bsmt     777 non-null    int64  
 12  area.garage1  777 non-null    int64  
 13  area.garage2  777 non-null    int64  
 14  area.living   777 non-null    int64  
 15  area.lot      777 non-null    int64  
 16  lon           777 non-null    float64
 17  lat           777 non-null    float64
 18  assessed      777 non-null    int64  
dtypes: float64(2), int64(11), object(6)
memory usage: 115.5+ KB
In [12]:
## Example 2 - selecting a single variable and describing it
ic_homes['sale.amount'].describe()
Out[12]:
count       777.000000
mean     180098.329472
std       90655.308636
min       38250.000000
25%      130000.000000
50%      157900.000000
75%      205000.000000
max      815000.000000
Name: sale.amount, dtype: float64
In [13]:
## Example 3 - selecting three variables and printing the first 3 observations
ic_homes[['sale.date','sale.amount', 'assessed']].head(3)
Out[13]:
sale.date sale.amount assessed
0 1/3/2005 172500 173040
1 1/5/2005 90000 89470
2 1/12/2005 168500 164230
In [14]:
## Example 4 - filtering by a logical condition, selecting three variables, and reporting the dimensions of the resulting DataFrame
ic_homes.loc[ic_homes['sale.amount'] > 500000, ['sale.date','sale.amount', 'assessed']].shape
Out[14]:
(11, 3)

Again, this is a small set of examples. You should take a brief look at the pandas Cheat Sheet for a brief overview of other commonly used attributes and methods.

Question #4: Order the rows of the ic_homes DataFrame from largest to smallest assessed value. Then, for the five homes with the highest assessed values and three or fewer bedrooms, print their sale amount and sale date.

Part 3 - Data Manipulation using pandas¶

Oftentimes we need to alter the format of a data set or combine it with another data source in order for it to be useful to a machine learning algorithm. This section will cover a few common types of data manipulation that you'll need to use on our first homework assignment; however, this is not an exhaustive data manipulation tutorial and you should be prepared to use pandas documentation (and other resources) to figure out how to execute data manipulation not explicitly covered here.

Aggregation/Grouped Summaries¶

Grouped summarization applies a method across the groups defined by a categorical variable specified in a prior use of the groupby() method. Shown below is an example that groups homes in the Iowa City Homes data set by the variable 'style' and calculates the mean sale amount within each group:

In [15]:
ic_homes.groupby(by='style')['sale.amount'].mean()
Out[15]:
style
1 1/2 Story Frame    186644.000000
1 Story Brick        220225.416667
1 Story Condo        121246.600000
1 Story Frame        166179.740634
2 Story Brick        334985.000000
2 Story Condo        149766.666667
2 Story Frame        215038.451087
Split Foyer Frame    160058.333333
Split Level Frame    208351.612903
Name: sale.amount, dtype: float64

Merging and Joining¶

Two data frames can be joined using the merge() method. The example below demonstrates a simple "left join", which adds columns from the second (right) data frame to the first while retaining all rows in the first (left) data frame.

In [16]:
## Create an example dataframe for illustration
more_data = pd.DataFrame({'sale.date': ['1/3/2005','1/12/2005'],
        'new_variable': ['a','b']})

## Left join 'ic' onto this example dataframe according to the 'sale.date' variable
merged_data = more_data.merge(ic_homes, on='sale.date', how='left')

## Print some of the merged data, notice the new columns from 'ic'
merged_data[['sale.date', 'new_variable', 'sale.amount', 'built']]
Out[16]:
sale.date new_variable sale.amount built
0 1/3/2005 a 172500 1993
1 1/12/2005 b 168500 1976

Note: Question #2 on Homework #1 involves concepts from this section, so you should ask questions about these examples if you have them.

Part 4 - Data Visualizations¶

In this class we will mainly construct graphics for two purposes:

  1. Exploring our data to guide pre-processing steps and perform quality checks
  2. Visualizing model performance or results

For the first of these we can rely upon the built-in data visualization methods of pandas DataFrames and Series.

The most basic of these is the plot() method, where the kind argument is used to determine the type of visualization. Below are some common types:

  • 'line' : line plot (default)
  • 'bar' : vertical bar plot
  • 'barh' : horizontal bar plot
  • 'hist' : histogram
  • 'box' : boxplot
  • 'kde' : Kernel Density Estimation plot
  • 'area' : area plot
  • 'pie' : pie plot
  • 'scatter' : scatter plot (DataFrame only)
  • 'hexbin' : hexbin plot (DataFrame only)

Next is an example showing how to display histograms of sale prices and assessed values in the Iowa City Home Sales data set:

In [17]:
ic_homes[['sale.amount', 'assessed']].plot(kind = 'hist', subplots = True)
Out[17]:
array([<AxesSubplot:ylabel='Frequency'>, <AxesSubplot:ylabel='Frequency'>],
      dtype=object)

The additional argument subplots = True places each column in its own subplot. If this argument is omitted both variables will be graphed in the same plot using shared axes.

Additional details can be found in the plot() method documentation

Another useful graphic that we'll use in the future is a scatterplot matrix, which depicts all pairwise relationships between numeric variables. There is a separate pandas function used to create this graphic that is demonstrated below:

In [18]:
## A few imports
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix

## Create the scatterplot matrix after selecting only numeric columns
plot = scatter_matrix(ic_homes.select_dtypes(include=['number']))
plt.show()

Line charts created using the matplotlib library are the final type of graphic that we will make extensive use of throughout the semester.

The example below creates a line chart showing three sequences of values for a common set of inputs:

In [19]:
## Setup
## Create some inputs
x = np.linspace(0, 4, num=20) # 20 equally spaced values b/w 0 and 4

## Display 3 different functions of x
plt.plot(x, x)
plt.plot(x, np.power(x,2))
plt.plot(x, np.exp(x))

## Add a legend
plt.legend(['y = x', 'y = x^2', 'y = e^x'], loc='upper left')

## Optional plt.show() - this will prevent superfluous printed output
plt.show()

Question #5:

  • Part A: Create a horizontal barplot displaying the frequencies of homes with each number of bedrooms in the Iowa City Home Sales data set.
  • Part B: Use grouped summarization to find the mean sale price and mean assessed value of homes in Iowa City for each value of built in the data set. Use matplotlib to create a line chart displaying both of these two series in the same figure using different colored lines to represent each.

Part 5 - Functions, Iteration/Looping, and Files¶

Two coding concepts that we will make extensive use of throughout the semester are user-created functions and loops. We've already seen and used functions that others have created. The example below creates a function that returns the sum of squared differences between two vectors, y and yh:

In [20]:
## Define the function
def squared_diff(y, yh):
    diff = y - yh
    ss_diff = np.dot(diff, diff)
    return ss_diff

## Example showing how it is used
squared_diff(y = np.array([1,1,1]), yh = np.array([1,0,2]))
Out[20]:
2

Proper indentation is an essential piece of Python's syntax, and the body of a function should be indented by exactly 4 spaces.

Our next important concept is looping, which allows us to repeat a certain set of instructions multiple times. We will encounter two types of loops:

  1. For loops - the body of the loop is repeated once for each element in an iterable object provided in the loop's definition
  2. While loops - the body of the loop is repeated indefinitely so long as a certain criterion is True

Below is an example of a while loop:

In [21]:
## Example while loop
x = 0
while x <= 3:
    print(x)
    x = x +1
0
1
2
3

We see that the code block inside the loop was executed four times. Prior to what would have been the 4th repetition the loop's condition was no longer met because x contained the value 4, so the loop terminated.

A more realistic example of how while loops are used in machine learning is to perform a repeated update until a precision-based criterion is met. For example, we might use a while loop to repeatedly apply an algorithm to find optimal values of the parameters in a model within a certain toleranace or level of precision.

As a toy illustration, consider the following loop that shrinks a vector's elements towards zero until a certain squared difference is achieved:

In [22]:
## More realistic use of a while loop
diff = float('inf')
tol = 0.1
yh = np.array([1,1.5,-2])
target = np.array([0,0,0])
while diff > tol:
    diff = squared_diff(yh, target)
    yh = yh/1.5    # shrink values by 50%
    print(yh)
[ 0.66666667  1.         -1.33333333]
[ 0.44444444  0.66666667 -0.88888889]
[ 0.2962963   0.44444444 -0.59259259]
[ 0.19753086  0.2962963  -0.39506173]
[ 0.13168724  0.19753086 -0.26337449]
[ 0.0877915   0.13168724 -0.17558299]
[ 0.05852766  0.0877915  -0.11705533]

To motivate our second type of looping, we'll briefly address the topic of files and directories, as some of the data we'll work with throughout the semester is not easily stored in a single file.

To begin, download the following folder containing 50 images of cats and unzip it in an easily accesible directory on your PC. I have placed it in OneDrive - Grinnell College/Documents/cats/, which you should modify when trying to run my examples.

  • Folder with images of 50 cats: https://remiller1450.github.io/data/cats.zip
In [23]:
## Libraries
import os
import matplotlib.image as mpimg

## Root directory containing the folder (on my PC)
path = 'OneDrive - Grinnell College/Documents/cats/'

## Display the first file
file_list = os.listdir(path) ## This is a list of all files in the directory     
plt.imshow(mpimg.imread(path + file_list[0]))  ## Notice how '+' combines strings
Out[23]:
<matplotlib.image.AxesImage at 0x1e6b3d1d1c0>

Suppose we'd like to display several cats from the folder in a grid. We could manually load each image into a different object, but a better solution is to exploit the iterable nature of list objects using a for loop:

In [24]:
first_9_cats = file_list[0:8]
for file in first_9_cats:
    plt.imshow(mpimg.imread(path + file))
    plt.show()

Question #6: For this question you should download the zipped folder located at: https://remiller1450.github.io/data/experiment.zip This folder contains 4 example files that each contain two variables.

  • Part A: Create your own function that finds the corresponding value of the variable VDS.Veh.Speed for a user-specified percentile of the absolute values of the variable SCC.Lane.Deviation.2 for a given CSV contained in the "experiment" folder. Your function should accept the file name/path and percentile as its only inputs, and it should return a single value.
  • Part B: Use a for loop and the function you created in Part A to print each subject's speed at their 90th percentile lane deviation.
  • Part C: Starting with the 99th percentile, use a while loop and the function you created in Part A to find the largest integer percentile for which the corresponding vehicle speed is no more than 1 mile per hour different from 30 miles per hour for the file named run35_treatment.