Lab 1 - Introduction to Python and $k$-means clustering¶

While learning Python is not primary focus of this course, a basic understanding of the language and its tools is necessary to succeed in a high-level machine learning project.

This lab will introduce a few basic concepts related to data handling in Python, and the second portion will demonstrate the application of these concepts with $k$-means clustering. The lab is written under the assumption that you're using the Anaconda distribution, an open source Python distribution that assists in package and environment management.

Directions: Please record your responses to all embedded questions in an Jupyter Notebook that will be submitted via P-web. This notebook should contain only the necessary code to answer the lab's questions and should not include code from the given examples that isn't needed to answer these questions. I encourage you to try out examples in a different notebook.

Part 1 - Libraries¶

We will make extensive use of pre-built collections of code known as libraries. If you are using the Anaconda distribution, the code below will import two different libraries, pandas and numpy, that each contain tools for data handling and manipulation. This code also assigns an alias to each library (ie: pd and np), which reduces the amount of characters/typing needed to reference functions and modules within the libraries.

In [1]:
import pandas as pd
import numpy as np

Python libraries are typically organized hierarchically into modules, which are pre-built bundles of related functions.

The code below uses the . operator to access the pyplot module within the matplotlib library and load it under the alias plt.

In [2]:
import matplotlib.pyplot as plt

Note: if you are working on your own Python installation (rather than the Anaconda distribution) you might need to download these libraries yourself. This guide provides a set of instructions for Windows, This page provides a few different installation options on Mac.

Part 2 - Data Basics¶

Data Types¶

Python contains several built-in data types, four them are particularly relevant to us:

  1. int - integer values like 5 or 9
  2. float - floating point real number values like 5.0 or 1.3145
  3. str - character strings like '5' or 'five'
  4. bool - boolean logical values like True or False

As with any language, passing the wrong data type into a function can lead to errors. Thus, the first thing we should know how to do is check the type of a variable and coerce it to another type if necessary.

Demonstrated below are a few examples:

In [3]:
x = '4'  # Define x as '4'
type(x)  # Notice the type is str
Out[3]:
str
In [4]:
float(x)  # Coerce to float
Out[4]:
4.0

The int(), bool(), and str() are three additional coercion functions to be aware of. In particular you might want to note how bool() will handle different numerical values:

In [5]:
## Three different boolean conversions
print(bool(1.0), bool(-2), bool(0.0))
True True False

Question 1:

  • Part A: Coerce y (defined below) into the integer value 1 storing the result as yn. Confirm that yn is type int using an appropriate function. Hint: first coerce y to float, then coerce this result to int.
In [6]:
y = '1.0'
  • Part B: Briefly explain why the block of code given below prints int rather than float.
In [7]:
z = 5
float(z)
type(z)
Out[7]:
int

Data objects¶

There many ways to store data in Python. For the moment we'll concern ourselves with 4 different data objects:

  1. lists (base Python)
  2. dictionaries (base Python)
  3. arrays (numpy)
  4. DataFrames (pandas)

Lists¶

Lists are the simplest of these. The code given below creates two different list objects:

In [8]:
## A simple list
my_list1 = [1,5,3,9]

## Two lists within a list
my_list2 = [[1,3,5], [2,4,6]]

We can access elements of a list using their index positions. Python starts indexing at 0:

In [9]:
## The first list in `my_list2`
my_list2[0]
Out[9]:
[1, 3, 5]
In [10]:
## Elements in positions 0 through 2 (first 3 positions) in my_list1
my_list1[:3]
Out[10]:
[1, 5, 3]

At first it might seem confusing that :3 does not include the element in index position 3, but this is for compatibility with the output of functions like len():

In [11]:
## All elements in my_list1
my_list1[:len(my_list1)]
Out[11]:
[1, 5, 3, 9]

Finally, you should be comfortable with more complex uses of indexing, such as:

In [12]:
## Element in position 1 in my_list1
my_list2[0][1]
Out[12]:
3

Question 2: Create a list named my_numbers consisting of three lists, the first being the character strings 'one', 'two', and 'three', the second being the integers 1, 2, and 3, and the third being floats 1.0, 2.0, and 3.0. Then, starting with my_numbers use indices to access and print the last element in the list of integers. Additionally, you should note that we'll generally try to avoid the use of magic numbers), but they're perfectly fine for this exercise.

Dictionaries¶

Dictionaries are used to store data consisting of key:value pairs.

The following is an example dictionary that uses the keys: "brand", "year", and "color", and the values "Ford", 1964, and ["red", "white", "blue"]

In [13]:
my_dict = {
  "brand": "Ford",
  "year": 1964,
  "colors": ["red", "white", "blue"]
}

We'll primarily use dictionaries to organize preprocessing steps and tuning parameter combinations that we'd like to evaluate in supervised learning tasks. However, it is worth knowing that you can use the name of key to access the corresponding values, as some models will use dictionaries to store information:

In [14]:
my_dict['colors']
Out[14]:
['red', 'white', 'blue']

Arrays (numpy)¶

We previously created my_list2 and saw that the command my_list2[0][1] could be used to access the element in position 1 of the first list. However, it would be nice if the command my_list2[0,1] could access this element, which helps motivate numpy arrays.

In [15]:
my_list2 = [[1,3,5], [2,4,6]]   # Same list as before
my_array2 = np.array(my_list2)  # Convert to a numpy array, notice our use of the alias np
my_array2[0,1]                  # Now this works!
Out[15]:
3

While many features of numpy arrays might seem similar to lists, and many functions can use them interchangeably, there are a few benefits:

  1. Arrays allow vectorized mathematical operations
  2. Arrays allow easier reshaping and subsetting
  3. Arrays use significantly less memory to store

Below is an example of how vectorized mathematical operations can be used with arrays, but not with lists:

In [16]:
## For lists, + will concatenate
my_list1 = [1,5,3,9]
print(my_list1 + my_list1)

## For arrays, + works as intended (vectorized addition)
my_array1 = np.array(my_list1)
print(my_array1 + my_array1)
[1, 5, 3, 9, 1, 5, 3, 9]
[ 2 10  6 18]
Methods and attributes¶

Arrays also provide us an opportunity to learn about two important aspects of Python objects:

  1. methods - actions an object can perform, referenced using the general syntax: object_name.method_name(method_arguments)
  2. attributes - characteristics of an object, referenced using the general syntax: object_name.attribute_name

The code below demonstrates these on a 2-d array:

In [17]:
## Create the 2-d array
my_array = np.array([[1,3,5], [2,4,6]]) 

my_array.shape   # arrays have a "shape" attribute
Out[17]:
(2, 3)
In [18]:
my_array.flatten()  # arrays have a "flatten" method
Out[18]:
array([1, 3, 5, 2, 4, 6])

I encourage you to consult this documentation page for information on some of the available attributes and methods of numpy arrays.

Question 3: Use the np.linspace() function to create an array of 10 linearly spaced values between 0 and 90. Then use the reshape() method to reshape this array to have the new shape (2,5).

DataFrames (pandas)¶

We'll later see that numpy arrays have the advantage of being able to coherently store data along more than 2 axes. However, when we have conventionally formatted data (each example/observation as a row and each feature as a column) our preferred method of storage will be the pandas DataFrame due to the wider assortment of data manipulation options they offer.

The code below uses the read_csv() function from the pandas library to load a data set from the web. You should recall that we gave the alias pd to pandas earlier.

In [19]:
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")

These data were scraped from the Johnson County (IA) Assessor between 2005 and 2007 and contain various characteristics of all homes sold in Iowa City, IA during that period.

Unlike numpy arrays, pandas DataFrames contain only 2 axes and named columns (note that a single axis DataFrame is called a "series"). This consistent arrangement of data allows for a much wider variety of attributes and methods. A few examples are given below:

In [20]:
## Built-in descriptive summaries of the DataFrame's columns
ic.describe() 
Out[20]:
sale.amount built bedrooms area.base area.add area.bsmt area.garage1 area.garage2 area.living area.lot lon lat assessed
count 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000 777.000000
mean 180098.329472 1972.566281 3.025740 986.942085 81.061776 345.971686 211.738739 76.873874 1365.144144 8886.574003 -91.522543 41.652816 174324.916345
std 90655.308636 31.252192 0.990604 353.324287 182.587922 386.380310 252.103302 162.755911 527.894470 8474.390873 0.033708 0.011389 85370.750928
min 38250.000000 1873.000000 1.000000 240.000000 0.000000 0.000000 0.000000 0.000000 312.000000 137.000000 -91.605747 41.628040 38590.000000
25% 130000.000000 1958.000000 2.000000 726.000000 0.000000 0.000000 0.000000 0.000000 1017.000000 5398.000000 -91.549255 41.645962 126030.000000
50% 157900.000000 1981.000000 3.000000 924.000000 0.000000 250.000000 0.000000 0.000000 1240.000000 7500.000000 -91.515802 41.652618 153930.000000
75% 205000.000000 1998.000000 4.000000 1184.000000 36.000000 600.000000 440.000000 0.000000 1560.000000 9960.000000 -91.496104 41.659200 198200.000000
max 815000.000000 2007.000000 7.000000 3440.000000 2261.000000 2500.000000 1065.000000 856.000000 4988.000000 158123.000000 -91.463069 41.690921 778000.000000
In [21]:
# Select just the 'sale.amount' column and print the first 5 values
ic['sale.amount'].head(n = 5)  
Out[21]:
0    172500
1     90000
2    168500
3    205000
4    121000
Name: sale.amount, dtype: int64
In [22]:
## Select only columns with numeric data types and print the resulting shape
ic.select_dtypes(include=['number']).shape
Out[22]:
(777, 13)
In [23]:
## Subset rows using a boolean condition and print the resulting shape.
ic[ic['sale.amount'] > 500000].shape
Out[23]:
(11, 19)

Again, these examples are only a few of the many methods and attributes of pandas DataFrames. I encourage you to browse this page to see a complete list.

Part 3 - Data manipulation and summarization using pandas¶

A major reason to use pandas DataFrames is the ease of data manipulation and cleaning. Below is an overview of a few key data manipulation operations in pandas. If you took Sta-230, these should be conceptually familiar.

First, grouped summarization:

In [24]:
## Group the data by the 'style' column, then find the mean of 'sale.amount' for each group
ic.groupby(by='style')['sale.amount'].mean()
Out[24]:
style
1 1/2 Story Frame    186644.000000
1 Story Brick        220225.416667
1 Story Condo        121246.600000
1 Story Frame        166179.740634
2 Story Brick        334985.000000
2 Story Condo        149766.666667
2 Story Frame        215038.451087
Split Foyer Frame    160058.333333
Split Level Frame    208351.612903
Name: sale.amount, dtype: float64

Next, merging and joining:

In [25]:
## Create an example dataframe for illustration
more_data = pd.DataFrame({'sale.date': ['1/3/2005','1/12/2005'],
        'new_variable': ['a','b']})

## Left join 'ic' onto this example dataframe according to the 'sale.date' variable
merged_data = more_data.merge(ic, on='sale.date', how='left')

## Print some of the merged data, notice the new columns from 'ic'
merged_data[['sale.date', 'new_variable', 'sale.amount', 'built']]
Out[25]:
sale.date new_variable sale.amount built
0 1/3/2005 a 172500 1993
1 1/12/2005 b 168500 1976

Now, inserting new columns:

In [26]:
## Add a new column containing the log-transformed sale amount as the 1st column
ic.insert(loc=0, column='new_col_name', value=np.log(ic['sale.amount']))
ic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 20 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   new_col_name  777 non-null    float64
 1   sale.amount   777 non-null    int64  
 2   sale.date     777 non-null    object 
 3   occupancy     777 non-null    object 
 4   style         777 non-null    object 
 5   built         777 non-null    int64  
 6   bedrooms      777 non-null    int64  
 7   bsmt          777 non-null    object 
 8   ac            777 non-null    object 
 9   attic         777 non-null    object 
 10  area.base     777 non-null    int64  
 11  area.add      777 non-null    int64  
 12  area.bsmt     777 non-null    int64  
 13  area.garage1  777 non-null    int64  
 14  area.garage2  777 non-null    int64  
 15  area.living   777 non-null    int64  
 16  area.lot      777 non-null    int64  
 17  lon           777 non-null    float64
 18  lat           777 non-null    float64
 19  assessed      777 non-null    int64  
dtypes: float64(3), int64(11), object(6)
memory usage: 121.5+ KB

Finally, pivoting:

In [27]:
## Load a data set that is in "wide" format
bluechip_stocks = pd.read_csv("https://remiller1450.github.io/data/bluechips.csv")
bluechip_stocks
Out[27]:
Year AAPL KO JNJ AXP
0 2010 7.643214 28.520000 64.680000 40.919998
1 2011 11.770357 32.610001 62.820000 43.400002
2 2012 14.686786 35.070000 65.879997 48.389999
3 2013 19.608213 37.599998 70.839996 58.750000
4 2014 19.754642 40.660000 91.029999 89.449997
5 2015 27.332500 42.139999 104.519997 93.019997
6 2016 26.337500 42.400002 100.480003 67.589996
7 2017 29.037500 41.799999 115.839996 75.349998
8 2018 43.064999 45.540001 139.229996 98.940002
9 2019 35.547501 46.639999 125.720001 93.430000
10 2020 74.357498 54.689999 144.279999 124.599998
11 2021 129.410004 52.759998 156.500000 118.040001
12 2022 182.009995 59.299999 171.539993 168.210007
In [28]:
## Pivot everything but 'Year' longer (print only the first 6 obs)
bluechip_stocks_long = bluechip_stocks.melt(id_vars=['Year'])
bluechip_stocks_long.head(6)
Out[28]:
Year variable value
0 2010 AAPL 7.643214
1 2011 AAPL 11.770357
2 2012 AAPL 14.686786
3 2013 AAPL 19.608213
4 2014 AAPL 19.754642
5 2015 AAPL 27.332500
In [29]:
## Pivot back to "wide" format
bluechip_stocks_wide = bluechip_stocks_long.pivot(index="Year", columns="variable", values="value")
bluechip_stocks_wide.reset_index(inplace=True)  # Extra step to make 'Year' a column rather than row names
print(bluechip_stocks_wide)
variable  Year        AAPL         AXP         JNJ         KO
0         2010    7.643214   40.919998   64.680000  28.520000
1         2011   11.770357   43.400002   62.820000  32.610001
2         2012   14.686786   48.389999   65.879997  35.070000
3         2013   19.608213   58.750000   70.839996  37.599998
4         2014   19.754642   89.449997   91.029999  40.660000
5         2015   27.332500   93.019997  104.519997  42.139999
6         2016   26.337500   67.589996  100.480003  42.400002
7         2017   29.037500   75.349998  115.839996  41.799999
8         2018   43.064999   98.940002  139.229996  45.540001
9         2019   35.547501   93.430000  125.720001  46.639999
10        2020   74.357498  124.599998  144.279999  54.689999
11        2021  129.410004  118.040001  156.500000  52.759998
12        2022  182.009995  168.210007  171.539993  59.299999

In these pivoting examples you should note:

  1. To pivot "long" data into a "wide" format, the argument index defines what should be the new rows in the wide form, while column defines the variable whose values should be converted to new columns.
  2. The pivot() method will assign the column referenced in index to the row axis names of the resulting object. We can move these values into the DataFrame as a variable using reset_index(inplace=True)

Question 4: The code given below reads data scraped from the RealClearPolitics website prior to the 2016 US Presidential election. It then uses the Sample column to create two distinct columns for the number of people polled, N, and the population polled, Pop (either likely voters, LV, or registered voters, RV).

  • Part A - Use the str.split() method (click here for documentation) to split the variable 'Sample' into two new columns named 'N' and 'Pop'. You might notice that these quantities are separated by a space in the original variable.
  • Part B - Reshape these data into "long" format so that each row represents the polling percentage of a single candidate for a single poll.
  • Part C - Filter your reshaped data to include only polls with at least 600 participants and then find the maximum percentage reached by each candidate in a single poll. Print only these 4 maximum polling percentages as your final answer.
In [30]:
## Read polls data set
polls = pd.read_csv("https://remiller1450.github.io/data/polls2016.csv")

## Info on the dataset
polls.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Poll         7 non-null      object 
 1   Date         7 non-null      object 
 2   Sample       7 non-null      object 
 3   MoE          6 non-null      float64
 4   Clinton..D.  7 non-null      int64  
 5   Trump..R.    7 non-null      int64  
 6   Johnson..L.  7 non-null      int64  
 7   Stein..G.    7 non-null      int64  
dtypes: float64(1), int64(4), object(3)
memory usage: 576.0+ bytes

Part 4 - Data visualization using matplotlib¶

There are many graphics packages availabe in Python. We'll largely stick to simple graphics that can be created using pandas methods or the library matplotlib.

The data visualizations that can be created through pandas are quick and useful for basic exploratory data analysis. In the context of machine learning, they can help us identify variables with unusual values or distributions, as well as marginal relationships between variables.

The code below imports the scatter_matrix() function the plotting module of panadas and uses it to display a scatter plot matrix for all pairs of numeric variables in the Iowa City home sales data:

In [31]:
## Select only numeric vars
ic_num = ic.select_dtypes(include=['number'])

## Scatterplot matrix of all pairing of numeric vars
from pandas.plotting import scatter_matrix
scatter_matrix(ic_num)
plt.show()

Scatter plot matrices are useful for quickly reviewing the distributions and bivariate relationships between a large number of variables.

If an individual variable requires closer examination, pandas DataFrame's have a more generalized plot() method that can be used to create the following graphics using the kind argument:

  • 'line' : line plot (default)
  • 'bar' : vertical bar plot
  • 'barh' : horizontal bar plot
  • 'hist' : histogram
  • 'box' : boxplot
  • 'kde' : Kernel Density Estimation plot
  • 'area' : area plot
  • 'pie' : pie plot
  • 'scatter' : scatter plot (DataFrame only)
  • 'hexbin' : hexbin plot (DataFrame only)

A simple example of this is shown below:

In [32]:
ic_num['built'].plot(kind = 'hist')
Out[32]:
<AxesSubplot:ylabel='Frequency'>

The plot() method of pandas DataFrames actually uses matplotlib as its backend plotting library. Sometimes we won't be working with DataFrame objects, so it's helpful to have some familiarity with the pyplot module of matplotlib (recall we've already loaded this module under the alias plt).

In the future we'll mostly use pyplot to create line charts, which also happen to be the default graphic created by plt.plot(). This function is flexible and can display several different lines of data in the same graphic with a simple syntax.

In [33]:
## Create some toy inputs
x = np.linspace(0, 4, num=20) # 20 equally spaced values b/w 0 and 4

## Display 3 different f(x)
plt.plot(x, x)
plt.plot(x, np.power(x,2))
plt.plot(x, np.exp(x))

## Add a legend
plt.legend(['y = x', 'y = x^2', 'y = e^x'], loc='upper left')

## Optional plt.show() 
plt.show()

We can see that plt.plot() naturally adds layers to the current plot while keeping track of what's already there. You should note that the code plt.show() is not required to display a plot in Jupyter Notebooks, but has the nice benefit of removing some of the meta data that's printed by code chunks with pyplot graphics that do not use it.

Question 5: Considering only the columns of type object in the Iowa City home sales data set, use the plot() method of pandas to create an appropriate visualization to see the distribution of the variable stored in the last column in this subset of data. Hint: consider using groupby() and size() before using plot() with and appropriate input to the kind argument.

Part 5 - Functions and iteration¶

There will be several instances where I'll ask you to program a simplified implementation of an algorithm yourself in order to help you better understand it. This will require some familiarity with the basic mechanics of two programming concepts in Python:

  1. Functions - a function is block of code with an associated name and arguments that carries out a specific, repeatable task
  2. Looping - an instruction that is repeated multiple times, often by iterating over a sequence (for loop) or until a condition is met (while loop)

Functions are created using the def keyword followed by name you give the function, its arguments in parentheses, and a : to initiate the function's code block. This code block must be indented by 4 spaces or 1 tab, and the function should use the return keyword to indicate which object(s) will be returned when the function is used.

Below is an example of a function that computes the sum of squared differences between two numeric vectors:

In [34]:
## Define function
def squared_diff(y, yh):
    diff = y - yh
    ss_diff = np.dot(diff, diff)
    return ss_diff

## Example use
squared_diff(y = np.array([1,1,1]), yh = np.array([1,0,2]))
Out[34]:
2

There are two types of looping you should know how to use in this course. The simpler of the two is the while loop, which repeats until a condition is met. Like functions, the code block used in a while loop is initiated by a : and must be idented by 4 spaces or a tab. Below is an example of a simple while loop:

In [35]:
## Example while loop
x = 0
while x <= 3:
    print(x)
    x = x +1
0
1
2
3

We can see that the code block within the loop executed 3 times. Prior to what would have been the 4th repetition the loop's condition was no longer met because x now held the value 4, so the loop terminated.

One use of while loops in machine learning is to repeat an optimization procedure until a precision criterion is met. A toy example of this would be shrinking a vector towards zero until we get within a certain margin of error using the squared_diff() function we created earlier:

In [36]:
## More realistic use of a while loop
diff = float('inf')
tol = 0.1
yh = np.array([1,1.5,-2])
target = np.array([0,0,0])
while diff > tol:
    diff = squared_diff(yh, target)
    yh = yh/1.5    # shrink values by 50%
    print(yh)
[ 0.66666667  1.         -1.33333333]
[ 0.44444444  0.66666667 -0.88888889]
[ 0.2962963   0.44444444 -0.59259259]
[ 0.19753086  0.2962963  -0.39506173]
[ 0.13168724  0.19753086 -0.26337449]
[ 0.0877915   0.13168724 -0.17558299]
[ 0.05852766  0.0877915  -0.11705533]

Question 6: Initialize a value to 1, then use a while loop to repeatedly divide this initial value by 2 until it is within a tolerance of 0.001 from zero. Print the current value in each iteration of the loop to confirm that the final result is indeed within 0.001 of zero.

The other type of looping you should be comfortable with is the for loop, which iterates through the members of an object in order.

The simplest way to use a for loop is to iterate through a sequence of integers set up using the range() function:

In [37]:
for i in range(3):
    print(i)
0
1
2

Many types of objects are iterable, meaning they can be looped over via a for loop. One place where we'll use this concept in the future will be looping through all images contained in a folder. You're welcome (but not expected) to try this by downloading and extracting this folder containing 50 images of cats: https://remiller1450.github.io/data/cats.zip

In [38]:
## Libraries
import os
import matplotlib.image as mpimg

## Root directory containing the folder (on my PC)
path = 'OneDrive - Grinnell College/Documents/cats/'
file_list = os.listdir(path)[:3]                   # We'll show only the first 3 images

## Loop through file list, notice we can concatenate the file path and name into a single string using '+'
for file in file_list:
    plt.imshow(mpimg.imread(path + file))
    plt.show()

Question 7: Use a for loop to iterate through the following list: ['apple', 'banana', 'watermelon'] printing the number of characters in the current fruit using the len() function at each iteration.

Part 6 - $k$-means clustering¶

$k$-means clustering, along with many other machine learning tools that we'll make extensive use of, is implemented in the sklearn or "scikit-learn" library:

In [39]:
import sklearn
from sklearn.cluster import KMeans # Import k-means function
import warnings
warnings.filterwarnings("ignore") ## We will turn off warnings since k-means will warn about future changes

This portion of the lab will explore $k$-means clustering and related concepts on two simulated data sets that exhibit very different patterns:

In [40]:
## Datasets module
from sklearn import datasets

## Dataset #1 - "moons"
moons = datasets.make_moons(n_samples=500, noise=0.11, random_state=8)
moons_X = moons[0]         # features to use in clustering
moons_labels = moons[1]    # true cluster identities

## Visualization
plt.scatter(moons_X[:,0], moons_X[:,1], c =moons_labels)
plt.show()
In [41]:
## Dataset #2 - "blobs"
blobs = datasets.make_blobs(n_samples=500, cluster_std=2, random_state=8)
blobs_X = blobs[0]         # features to use in clustering
blobs_labels = blobs[1]    # true cluster identities

plt.scatter(blobs_X[:,0], blobs_X[:,1], c = blobs_labels)
plt.show()

Preprocessing¶

In most applications the data you receive are not immediately ready to be input directly into machine learning algorithms. Instead, additional steps known as preprocessing are typically applied before machine algorithms are used.

Because $k$-means clustering relies upon distances between data-points and cluster prototypes, the unit scale of each feature is influential.

  • Humans know that heights of 70 inches vs. 58 inches (12 inches apart) and 6 ft vs. 5 (1 foot apart) are equally far from each other
  • In the $k$-means algorithm, without standardization, a variable expressed in inches will exert more influence than an identical variable expressed in feet.
  • An easy way to equalize the influence of features with different scales is to standardize them via the StandardScaler() function in sklearn, which applies a $Z$-score transformation to each feature:
In [42]:
from sklearn.preprocessing import StandardScaler
moons_XS = StandardScaler().fit_transform(moons_X) # use the fit_transform method of a newly instantiated StandardScaler object

## Plot the standardized features, notice the subtle change in the x/y axes
plt.scatter(moons_XS[:,0], moons_XS[:,1], c =moons_labels) 
Out[42]:
<matplotlib.collections.PathCollection at 0x1bd0204f610>

Choosing $k$¶

The $k$-means algorithm requires $k$ be pre-specified. A simple strategy to help select an appropriate number of clusters is to search over a range of possible values of $k$ to find the point where the rate of improvement in the clustering objective function begins to diminish. You should note that the objective function will always improve as $k$ increases, so we are not looking for a minimum. Instead, we are looking for an inflection point or "elbow" in the curve produced by the loop given below:

In [43]:
## Loop through choices of k ranging from 2 to 10
scores = []
for k in range(2, 12):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(moons_XS)
    scores.append(kmeans.inertia_)

## Visualize
plt.plot(range(2, 12), scores)
plt.xlabel("k")
plt.ylabel("Objective")
plt.show()

For the blobs data we see that this curve shows an inflection point (or elbow) around $k=4$ or $k=5$, suggesting we use that many clusters in our final analysis of these data.

Evaluating cluster fit¶

Silhouette scores provide a method for validating clustering performance. Recall that each data-point's silhouette is a ratio of its difference in mean pairwise distance between points in the nearest neighboring clustering and its mean pairwise distance between points in assigned cluster divided by the larger of these two means.

A silhouette score approaching +1 indicates that a data-point is substantially closer to the data in its assigned cluster than it is to data in the neighboring clustering, a silhouette score of 0 indicates it is equally close, and a negative silhouette score indicates the point is on average closer to the members of a different cluster than the one it was assigned to.

The mean silhouette score across all data-points can be used to summarize the clustering performance of your chosen algorithm. To calculate silhouette scores we can use the silhouette score() function in the metrics module of sklearn:

In [44]:
from sklearn.metrics import silhouette_score
my_clusters = KMeans(n_clusters=4)   # init the k-means object

## Calculate silhouette scores
my_sils = silhouette_score(moons_XS, my_clusters.fit_predict(moons_XS))

## Mean sil score
np.average(my_sils)
Out[44]:
0.42124198131633384

An average silhouette score of 0.42 indicates modest to weak performance. In generally, average silhouette scores between 0.25 and 0.5 indicate weak but potentially meaningful clustering, averages between 0.5 and 0.7 indicate moderate to strong clustering, and values above 0.7 indicate excellent performance. The weak performance of $k$-means for these data should be apparent when graphing the clustering results for $k=4$:

In [45]:
plt.scatter(moons_XS[:,0], moons_XS[:,1], c =my_clusters.predict(moons_XS)) 
plt.show()

Before moving on to answer the question that appears below, I encourage you to skim over the KMeans documentation provided by sklearn. Here you can find information not covered in this lab, such as how to find the coordinates of the cluster prototypes or change the $k$-means initialization algorithm that is being used.

Question 8: Earlier in this section we introduced a second simulated data set named blobs. Using this data set, perform a clustering analysis that includes the following steps:

  • Proper pre-processing of the data to account for possible differences in scaling.
  • Appropriate selection of $k$.
  • Use of Silhouette scores to evaluate clustering fit.
  • Visualization of the final clustering results.

In addition to providing code that accomplishes each step, provide a 3-4 sentence written summary of your methods and results using a professional scientific tone and an appropriate level of detail. In your summary you should clearly summarize the steps you used and how you interpret the success of the method.

Part 7 - Application of $k$-means - image segmentation¶

Image segmentation is the process of dividing an image into collections of pixels that are labeled by "masks". There are many approaches to image segmentation, but for a variety of applications simple $k$-means clustering can work surprisingly well.

The code below reads an image from the web and then displays it using functions contained in the skimage library:

In [46]:
from skimage import io
my_img = io.imread("https://remiller1450.github.io/data/beach_img.jpg")
io.imshow(my_img)
io.show()

To begin, you should notice that the image is stored in a 3-d array:

In [47]:
my_img.shape
Out[47]:
(525, 700, 3)

The first two dimensions of this array define a pixel location, while the third dimension define a pixel's the color channel values under the RGB color model. Thus, each element in the array is a pixel intensity in the respective color channel, with values ranging from 0 to 255.

To apply $k$-means clustering we'll need to reshape the image into a 2-d array where the first axis of the array (ie: rows) represents the pixel and the second axis (ie: columns) represents the color channel. This is because we'd like to cluster all pixels, regardless of their location, based upon their color intensities.

The code below accomplishes this using reshape():

In [48]:
my_img_2d = my_img.reshape(my_img.shape[0]*my_img.shape[1], my_img.shape[2])

While we'll rely upon the data to guide our choice of $k$ in many clustering applications, in image segmentation we might choose $k$ ourselves using domain-specific knowledge and goals.

For example, we might choose $k=2$ if our goal is to separate water from land:

In [49]:
kmeans_fit = KMeans(n_clusters=2, random_state=10, n_init=10).fit(my_img_2d)
plt.imshow(kmeans_fit.labels_.reshape(525,700).astype(int), cmap='gray')
Out[49]:
<matplotlib.image.AxesImage at 0x1bd07e77af0>

Alternatively, we can see that choosing $k=3$ will further segment the water to label a shallower sandy section and a deep water section.

In [50]:
from sklearn.cluster import KMeans
kmeans_fit = KMeans(n_clusters=3, random_state=10, n_init=10).fit(my_img_2d)
plt.imshow(kmeans_fit.labels_.reshape(525,700).astype(int), cmap='gray')
Out[50]:
<matplotlib.image.AxesImage at 0x1bd06f07100>

While this approach isn't perfect, especially compared to more modern image segmentation approaches, the fact that $k$-means is simple, efficient, and doesn't require extensive training or tuning make it an attractive option.

Question 9: The aim of this question is for you to perform your own image segmentation using $k$-means and practice describing your methodology and results using an appropriate scientific tone. To receive full credit, your submitted response must include the following steps:

  1. Locate an image on the web (or anywhere else) that you believe is appropriate for segmentation. You are welcome to try an image from the "cats" data folder that was provided in the lab section on looping.
  2. Read your chosen image into Python (either directly from the web or from a local directory).
  3. Use $k$-means clustering with a meaningful choice of $k$ to segment the image. Pay careful attention to the color channels in your image, as not every image you find on the web will be stored using the RGB color model.
  4. Report your methods and results in a short paragraph (at least 4 sentences, written in a scientific tone) accompanied by visuals of the original and segmented images.