Learning about Python is not the primary focus of this course, but a basic understanding of the language is necessary to carry out the types of machine learning tasks we will be exploring.
Directions: Please read through the contents of this lab with your partner and try the examples. After you're both confident that you understand a topic you should attempt the associated exercises and record your answers in your own Jupyter notebook that you will submit for credit. The notebook you submit should only contain answers to the lab's exercises (so you should remove any code you ran for the examples, or use a separate notebook to test out the examples).
We will make extensive use of libraries, or pre-built collections of code, throughout the semester. If you are using the Anaconda distribution you'll have numerous commonly used libraries pre-installed and simply need to load them in order to access their functions.
The code below imports two libraries that we'll use extensively, pandas and numpy. Each library is assigned an alias (ie: pd
and np
) so that they may be referenced using fewer characters.
import pandas as pd
import numpy as np
The example below uses the read_csv()
function from pandas
to load a data set containing various attributes of homes sold in Iowa City, IA between 2005 and 2007. Notice the use of the alias, pd
.
ic_homes = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
Some libraries are organized into modules, where each module is a bundle of related functions. Modules within a library are referenced using the .
character. We may import all functions in a module using the *
character, but best practice is to import only the functions you plan on using.
Some examples involving scikit-learn, a library that is organized into many hierarchical modules, are shown below:
# Import the scikit-learn library
import sklearn
# Load all functions from the pre-processing module
from sklearn.preprocessing import *
# Load just the 'LinearRegression' function from the linear_model module
from sklearn.linear_model import LinearRegression
Question #1:
pyplot
module from the matplotlib and assign the alias plt
(Hint: you can do this all in one step using the .
character)DecisionTreeClassifier
, which is contained in the tree
module of sklearn
.There are 4 native data types in Python that will be relevant to our course:
int
- integer values like 5
or -9
float
- floating point real numbers like 5.0
or 3.14
str
- character strings like '5'
or 'five'
bool
- boolean logical values, namely True
or False
A few things to note:
type()
functionint()
, float()
, bool()
, and str()
can be used for coercionSome examples are shown below:
# Example 1 - Define x as a str type and coerce to float
x = '4'
type(x)
float(x)
# Example 2 - Check the boolean coercions of a few different numerical values
print(bool(1.0), bool(-2), bool(0.0))
True True False
There are many ways to store collections of data in Python. In the early stages of the course we'll focus on the following:
numpy
)pandas
)Later on we'll also use Pytorch tensors, which are specialized multi-dimensional arrays.
Lists are the most basic data structure we will use. The examples below illustrate a few basic properties of lists:
# Create a simple list of integer values
my_first_list = [3,1,4,5]
# Create a list of lists
my_second_list = [[1,2,3], ['a','b','c']]
# Indexing starts at zero in Python
my_second_list[1]
['a', 'b', 'c']
Question #2: Consider the list of lists given below:
Q2_list = [[1.0,2.0,3.0], ['a','b','c'], [True, 1, False]]
Q2_list
Q2_list
is int
(Hint: in our example my_second_list[1]
returns the list ['a', 'b', 'c']
, which has its own indices)Dictionaries store data via key:value
pairs. We'll mostly use them to organize data pre-processing steps and parameter combinations when developing models. But two things are worth noting:
Below is an example that uses the keys: 'brand'
, 'year'
, and 'color'
and the values 'Ford'
, [1964, 1965]
, and ['red','white','blue']
.
# Create the dictionary
my_dict = {'brand': 'Ford',
'year': 1964,
'colors': ['red', 'white', 'blue']}
# Access the 'colors' that are stored in this dictionary
my_dict['colors']
['red', 'white', 'blue']
Arrays are a data structure provided by the numpy
library that share a number of similarities to lists, but have a few key distinctions:
Below is an example of a scenario where lists and arrays behave very differently:
# Two lists used in the example
my_list1 = [1,5,3,9]
my_list2 = [2,5,0,-4]
# The '+' operator will concatenate lists
my_list1 + my_list2
[1, 5, 3, 9, 2, 5, 0, -4]
# The '+' operator will perform vectorized addition on numpy arrays
my_array1 = np.array(my_list1)
my_array2 = np.array(my_list2)
my_array1 + my_array2
array([ 3, 10, 3, 5])
Arrays also provide us an opportunity to see two important aspects of Python objects:
Below are a few examples involving numpy
arrays:
## Create a 2-d array
my_2d_array = np.array([[1,3,5], [2,4,6]])
## Use the "shape" attribute to see the dimensions of this array
my_2d_array.shape
(2, 3)
## Use the "flatten" method to make this 2-d array into a 1-d array
my_2d_array.flatten()
array([1, 3, 5, 2, 4, 6])
Notice that methods are designed to accept arguments (hence the empty parentheses in the second example) but attributes are fixed characteristics of an object and do not involve any arguments.
We will not exhaustively cover all of the attributes and methods of the objects used throughout this class, so you should be prepared to read documentation. Here is the link to the official documentation page for numpy
arrays.
Question #3:
stack()
function to create a single 3 by 2 array from these two arrays. After doing this, briefly explain the implications of changing the axis
argument from 0 to 1 in the stack()
function.ndim
, size
, and shape
attributes of the new array you created in Part A. Briefly describe the information each of these attributes provides about an array.While numpy
arrays are the expected input for many machine learning methods, DataFrames can sometimes be more useful due to their broader set of built-in data manipulation tools.
Unlike numpy
arrays, which can have arbitrarily many axes, DataFrames store data along exactly 2 axes, with each row representing an observation/sample and each column representing a variable/feature. The analogue to a 1-dimensional DataFrame in pandas
is a "Series", but we'll try to avoid doing anything with series.
We've already created a pandas
DataFrame named ic_homes
using the pd.read_csv()
command earlier in the lab. The examples below demonstrate some basic capabilities of DataFrames:
## Example 1 - Printing the variable names and types in ic_homes
ic_homes.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 777 entries, 0 to 776 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sale.amount 777 non-null int64 1 sale.date 777 non-null object 2 occupancy 777 non-null object 3 style 777 non-null object 4 built 777 non-null int64 5 bedrooms 777 non-null int64 6 bsmt 777 non-null object 7 ac 777 non-null object 8 attic 777 non-null object 9 area.base 777 non-null int64 10 area.add 777 non-null int64 11 area.bsmt 777 non-null int64 12 area.garage1 777 non-null int64 13 area.garage2 777 non-null int64 14 area.living 777 non-null int64 15 area.lot 777 non-null int64 16 lon 777 non-null float64 17 lat 777 non-null float64 18 assessed 777 non-null int64 dtypes: float64(2), int64(11), object(6) memory usage: 115.5+ KB
## Example 2 - selecting a single variable and describing it
ic_homes['sale.amount'].describe()
count 777.000000 mean 180098.329472 std 90655.308636 min 38250.000000 25% 130000.000000 50% 157900.000000 75% 205000.000000 max 815000.000000 Name: sale.amount, dtype: float64
## Example 3 - selecting three variables and printing the first 3 observations
ic_homes[['sale.date','sale.amount', 'assessed']].head(3)
sale.date | sale.amount | assessed | |
---|---|---|---|
0 | 1/3/2005 | 172500 | 173040 |
1 | 1/5/2005 | 90000 | 89470 |
2 | 1/12/2005 | 168500 | 164230 |
## Example 4 - filtering by a logical condition, selecting three variables, and reporting the dimensions of the resulting DataFrame
ic_homes.loc[ic_homes['sale.amount'] > 500000, ['sale.date','sale.amount', 'assessed']].shape
(11, 3)
Again, this is a small set of examples. You should take a brief look at the pandas
Cheat Sheet for a brief overview of other commonly used attributes and methods.
Question #4: Order the rows of the ic_homes
DataFrame from largest to smallest assessed value. Then, for the five homes with the highest assessed values and three or fewer bedrooms, print their sale amount and sale date.
pandas
¶Oftentimes we need to alter the format of a data set or combine it with another data source in order for it to be useful to a machine learning algorithm. This section will cover a few common types of data manipulation that you'll need to use on our first homework assignment; however, this is not an exhaustive data manipulation tutorial and you should be prepared to use pandas
documentation (and other resources) to figure out how to execute data manipulation not explicitly covered here.
Grouped summarization applies a method across the groups defined by a categorical variable specified in a prior use of the groupby()
method. Shown below is an example that groups homes in the Iowa City Homes data set by the variable 'style'
and calculates the mean sale amount within each group:
ic_homes.groupby(by='style')['sale.amount'].mean()
style 1 1/2 Story Frame 186644.000000 1 Story Brick 220225.416667 1 Story Condo 121246.600000 1 Story Frame 166179.740634 2 Story Brick 334985.000000 2 Story Condo 149766.666667 2 Story Frame 215038.451087 Split Foyer Frame 160058.333333 Split Level Frame 208351.612903 Name: sale.amount, dtype: float64
Two data frames can be joined using the merge()
method. The example below demonstrates a simple "left join", which adds columns from the second (right) data frame to the first while retaining all rows in the first (left) data frame.
## Create an example dataframe for illustration
more_data = pd.DataFrame({'sale.date': ['1/3/2005','1/12/2005'],
'new_variable': ['a','b']})
## Left join 'ic' onto this example dataframe according to the 'sale.date' variable
merged_data = more_data.merge(ic_homes, on='sale.date', how='left')
## Print some of the merged data, notice the new columns from 'ic'
merged_data[['sale.date', 'new_variable', 'sale.amount', 'built']]
sale.date | new_variable | sale.amount | built | |
---|---|---|---|---|
0 | 1/3/2005 | a | 172500 | 1993 |
1 | 1/12/2005 | b | 168500 | 1976 |
Note: Question #2 on Homework #1 involves concepts from this section, so you should ask questions about these examples if you have them.
In this class we will mainly construct graphics for two purposes:
For the first of these we can rely upon the built-in data visualization methods of pandas
DataFrames and Series.
The most basic of these is the plot()
method, where the kind
argument is used to determine the type of visualization. Below are some common types:
Next is an example showing how to display histograms of sale prices and assessed values in the Iowa City Home Sales data set:
ic_homes[['sale.amount', 'assessed']].plot(kind = 'hist', subplots = True)
array([<AxesSubplot:ylabel='Frequency'>, <AxesSubplot:ylabel='Frequency'>], dtype=object)
The additional argument subplots = True
places each column in its own subplot. If this argument is omitted both variables will be graphed in the same plot using shared axes.
Additional details can be found in the plot()
method documentation
Another useful graphic that we'll use in the future is a scatterplot matrix, which depicts all pairwise relationships between numeric variables. There is a separate pandas
function used to create this graphic that is demonstrated below:
## A few imports
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
## Create the scatterplot matrix after selecting only numeric columns
plot = scatter_matrix(ic_homes.select_dtypes(include=['number']))
plt.show()
Line charts created using the matplotlib
library are the final type of graphic that we will make extensive use of throughout the semester.
The example below creates a line chart showing three sequences of values for a common set of inputs:
## Setup
## Create some inputs
x = np.linspace(0, 4, num=20) # 20 equally spaced values b/w 0 and 4
## Display 3 different functions of x
plt.plot(x, x)
plt.plot(x, np.power(x,2))
plt.plot(x, np.exp(x))
## Add a legend
plt.legend(['y = x', 'y = x^2', 'y = e^x'], loc='upper left')
## Optional plt.show() - this will prevent superfluous printed output
plt.show()
Question #5:
built
in the data set. Use matplotlib
to create a line chart displaying both of these two series in the same figure using different colored lines to represent each.Two coding concepts that we will make extensive use of throughout the semester are user-created functions and loops. We've already seen and used functions that others have created. The example below creates a function that returns the sum of squared differences between two vectors, y
and yh
:
## Define the function
def squared_diff(y, yh):
diff = y - yh
ss_diff = np.dot(diff, diff)
return ss_diff
## Example showing how it is used
squared_diff(y = np.array([1,1,1]), yh = np.array([1,0,2]))
2
Proper indentation is an essential piece of Python's syntax, and the body of a function should be indented by exactly 4 spaces.
Our next important concept is looping, which allows us to repeat a certain set of instructions multiple times. We will encounter two types of loops:
True
Below is an example of a while loop:
## Example while loop
x = 0
while x <= 3:
print(x)
x = x +1
0 1 2 3
We see that the code block inside the loop was executed four times. Prior to what would have been the 4th repetition the loop's condition was no longer met because x
contained the value 4, so the loop terminated.
A more realistic example of how while loops are used in machine learning is to perform a repeated update until a precision-based criterion is met. For example, we might use a while loop to repeatedly apply an algorithm to find optimal values of the parameters in a model within a certain toleranace or level of precision.
As a toy illustration, consider the following loop that shrinks a vector's elements towards zero until a certain squared difference is achieved:
## More realistic use of a while loop
diff = float('inf')
tol = 0.1
yh = np.array([1,1.5,-2])
target = np.array([0,0,0])
while diff > tol:
diff = squared_diff(yh, target)
yh = yh/1.5 # shrink values by 50%
print(yh)
[ 0.66666667 1. -1.33333333] [ 0.44444444 0.66666667 -0.88888889] [ 0.2962963 0.44444444 -0.59259259] [ 0.19753086 0.2962963 -0.39506173] [ 0.13168724 0.19753086 -0.26337449] [ 0.0877915 0.13168724 -0.17558299] [ 0.05852766 0.0877915 -0.11705533]
To motivate our second type of looping, we'll briefly address the topic of files and directories, as some of the data we'll work with throughout the semester is not easily stored in a single file.
To begin, download the following folder containing 50 images of cats and unzip it in an easily accesible directory on your PC. I have placed it in OneDrive - Grinnell College/Documents/cats/
, which you should modify when trying to run my examples.
## Libraries
import os
import matplotlib.image as mpimg
## Root directory containing the folder (on my PC)
path = 'OneDrive - Grinnell College/Documents/cats/'
## Display the first file
file_list = os.listdir(path) ## This is a list of all files in the directory
plt.imshow(mpimg.imread(path + file_list[0])) ## Notice how '+' combines strings
<matplotlib.image.AxesImage at 0x1e6b3d1d1c0>
Suppose we'd like to display several cats from the folder in a grid. We could manually load each image into a different object, but a better solution is to exploit the iterable nature of list objects using a for loop:
first_9_cats = file_list[0:8]
for file in first_9_cats:
plt.imshow(mpimg.imread(path + file))
plt.show()
Question #6: For this question you should download the zipped folder located at: https://remiller1450.github.io/data/experiment.zip This folder contains 4 example files that each contain two variables.
VDS.Veh.Speed
for a user-specified percentile of the absolute values of the variable SCC.Lane.Deviation.2
for a given CSV contained in the "experiment" folder. Your function should accept the file name/path and percentile as its only inputs, and it should return a single value.run35_treatment
.