While learning Python is not primary focus of this course, a basic understanding of the language and its tools is necessary to succeed in a high-level machine learning project.
This lab will introduce a few basic concepts related to data handling in Python, and the second portion will demonstrate the application of these concepts with $k$-means clustering. The lab is written under the assumption that you're using the Anaconda distribution, an open source Python distribution that assists in package and environment management.
Directions: Please record your responses to all embedded questions in an Jupyter Notebook that will be submitted via P-web. This notebook should contain only the necessary code to answer the lab's questions and should not include code from the given examples that isn't needed to answer these questions. I encourage you to try out examples in a different notebook.
We will make extensive use of pre-built collections of code known as libraries. If you are using the Anaconda distribution, the code below will import two different libraries, pandas
and numpy
, that each contain tools for data handling and manipulation. This code also assigns an alias to each library (ie: pd
and np
), which reduces the amount of characters/typing needed to reference functions and modules within the libraries.
import pandas as pd
import numpy as np
Python libraries are typically organized hierarchically into modules, which are pre-built bundles of related functions.
The code below uses the .
operator to access the pyplot
module within the matplotlib
library and load it under the alias plt
.
import matplotlib.pyplot as plt
Note: if you are working on your own Python installation (rather than the Anaconda distribution) you might need to download these libraries yourself. This guide provides a set of instructions for Windows, This page provides a few different installation options on Mac.
Python contains several built-in data types, four them are particularly relevant to us:
int
- integer values like 5
or 9
float
- floating point real number values like 5.0
or 1.3145
str
- character strings like '5'
or 'five'
bool
- boolean logical values like True
or False
As with any language, passing the wrong data type into a function can lead to errors. Thus, the first thing we should know how to do is check the type of a variable and coerce it to another type if necessary.
Demonstrated below are a few examples:
x = '4' # Define x as '4'
type(x) # Notice the type is str
str
float(x) # Coerce to float
4.0
The int()
, bool()
, and str()
are three additional coercion functions to be aware of. In particular you might want to note how bool()
will handle different numerical values:
## Three different boolean conversions
print(bool(1.0), bool(-2), bool(0.0))
True True False
Question 1:
y
(defined below) into the integer value 1
storing the result as yn
. Confirm that yn
is type int
using an appropriate function. Hint: first coerce y
to float, then coerce this result to int.y = '1.0'
int
rather than float
.z = 5
float(z)
type(z)
int
There many ways to store data in Python. For the moment we'll concern ourselves with 4 different data objects:
numpy
)pandas
)Lists are the simplest of these. The code given below creates two different list objects:
## A simple list
my_list1 = [1,5,3,9]
## Two lists within a list
my_list2 = [[1,3,5], [2,4,6]]
We can access elements of a list using their index positions. Python starts indexing at 0
:
## The first list in `my_list2`
my_list2[0]
[1, 3, 5]
## Elements in positions 0 through 2 (first 3 positions) in my_list1
my_list1[:3]
[1, 5, 3]
At first it might seem confusing that :3
does not include the element in index position 3
, but this is for compatibility with the output of functions like len()
:
## All elements in my_list1
my_list1[:len(my_list1)]
[1, 5, 3, 9]
Finally, you should be comfortable with more complex uses of indexing, such as:
## Element in position 1 in my_list1
my_list2[0][1]
3
Question 2: Create a list named my_numbers
consisting of three lists, the first being the character strings 'one'
, 'two'
, and 'three
', the second being the integers 1, 2, and 3, and the third being floats 1.0
, 2.0
, and 3.0
. Then, starting with my_numbers
use indices to access and print the last element in the list of integers. Additionally, you should note that we'll generally try to avoid the use of magic numbers), but they're perfectly fine for this exercise.
Dictionaries are used to store data consisting of key:value
pairs.
The following is an example dictionary that uses the keys: "brand"
, "year"
, and "color"
, and the values "Ford"
, 1964
, and ["red", "white", "blue"]
my_dict = {
"brand": "Ford",
"year": 1964,
"colors": ["red", "white", "blue"]
}
We'll primarily use dictionaries to organize preprocessing steps and tuning parameter combinations that we'd like to evaluate in supervised learning tasks. However, it is worth knowing that you can use the name of key to access the corresponding values, as some models will use dictionaries to store information:
my_dict['colors']
['red', 'white', 'blue']
numpy
)¶We previously created my_list2
and saw that the command my_list2[0][1]
could be used to access the element in position 1 of the first list. However, it would be nice if the command my_list2[0,1]
could access this element, which helps motivate numpy
arrays.
my_list2 = [[1,3,5], [2,4,6]] # Same list as before
my_array2 = np.array(my_list2) # Convert to a numpy array, notice our use of the alias np
my_array2[0,1] # Now this works!
3
While many features of numpy
arrays might seem similar to lists, and many functions can use them interchangeably, there are a few benefits:
Below is an example of how vectorized mathematical operations can be used with arrays, but not with lists:
## For lists, + will concatenate
my_list1 = [1,5,3,9]
print(my_list1 + my_list1)
## For arrays, + works as intended (vectorized addition)
my_array1 = np.array(my_list1)
print(my_array1 + my_array1)
[1, 5, 3, 9, 1, 5, 3, 9] [ 2 10 6 18]
Arrays also provide us an opportunity to learn about two important aspects of Python objects:
object_name.method_name(method_arguments)
object_name.attribute_name
The code below demonstrates these on a 2-d array:
## Create the 2-d array
my_array = np.array([[1,3,5], [2,4,6]])
my_array.shape # arrays have a "shape" attribute
(2, 3)
my_array.flatten() # arrays have a "flatten" method
array([1, 3, 5, 2, 4, 6])
I encourage you to consult this documentation page for information on some of the available attributes and methods of numpy
arrays.
Question 3: Use the np.linspace()
function to create an array of 10 linearly spaced values between 0 and 90. Then use the reshape()
method to reshape this array to have the new shape (2,5)
.
pandas
)¶We'll later see that numpy
arrays have the advantage of being able to coherently store data along more than 2 axes. However, when we have conventionally formatted data (each example/observation as a row and each feature as a column) our preferred method of storage will be the pandas
DataFrame due to the wider assortment of data manipulation options they offer.
The code below uses the read_csv()
function from the pandas
library to load a data set from the web. You should recall that we gave the alias pd
to pandas
earlier.
ic = pd.read_csv("https://remiller1450.github.io/data/IowaCityHomeSales.csv")
These data were scraped from the Johnson County (IA) Assessor between 2005 and 2007 and contain various characteristics of all homes sold in Iowa City, IA during that period.
Unlike numpy
arrays, pandas
DataFrames contain only 2 axes and named columns (note that a single axis DataFrame is called a "series"). This consistent arrangement of data allows for a much wider variety of attributes and methods. A few examples are given below:
## Built-in descriptive summaries of the DataFrame's columns
ic.describe()
sale.amount | built | bedrooms | area.base | area.add | area.bsmt | area.garage1 | area.garage2 | area.living | area.lot | lon | lat | assessed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 |
mean | 180098.329472 | 1972.566281 | 3.025740 | 986.942085 | 81.061776 | 345.971686 | 211.738739 | 76.873874 | 1365.144144 | 8886.574003 | -91.522543 | 41.652816 | 174324.916345 |
std | 90655.308636 | 31.252192 | 0.990604 | 353.324287 | 182.587922 | 386.380310 | 252.103302 | 162.755911 | 527.894470 | 8474.390873 | 0.033708 | 0.011389 | 85370.750928 |
min | 38250.000000 | 1873.000000 | 1.000000 | 240.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 312.000000 | 137.000000 | -91.605747 | 41.628040 | 38590.000000 |
25% | 130000.000000 | 1958.000000 | 2.000000 | 726.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1017.000000 | 5398.000000 | -91.549255 | 41.645962 | 126030.000000 |
50% | 157900.000000 | 1981.000000 | 3.000000 | 924.000000 | 0.000000 | 250.000000 | 0.000000 | 0.000000 | 1240.000000 | 7500.000000 | -91.515802 | 41.652618 | 153930.000000 |
75% | 205000.000000 | 1998.000000 | 4.000000 | 1184.000000 | 36.000000 | 600.000000 | 440.000000 | 0.000000 | 1560.000000 | 9960.000000 | -91.496104 | 41.659200 | 198200.000000 |
max | 815000.000000 | 2007.000000 | 7.000000 | 3440.000000 | 2261.000000 | 2500.000000 | 1065.000000 | 856.000000 | 4988.000000 | 158123.000000 | -91.463069 | 41.690921 | 778000.000000 |
# Select just the 'sale.amount' column and print the first 5 values
ic['sale.amount'].head(n = 5)
0 172500 1 90000 2 168500 3 205000 4 121000 Name: sale.amount, dtype: int64
## Select only columns with numeric data types and print the resulting shape
ic.select_dtypes(include=['number']).shape
(777, 13)
## Subset rows using a boolean condition and print the resulting shape.
ic[ic['sale.amount'] > 500000].shape
(11, 19)
Again, these examples are only a few of the many methods and attributes of pandas
DataFrames. I encourage you to browse this page to see a complete list.
pandas
¶A major reason to use pandas
DataFrames is the ease of data manipulation and cleaning. Below is an overview of a few key data manipulation operations in pandas
. If you took Sta-230, these should be conceptually familiar.
First, grouped summarization:
## Group the data by the 'style' column, then find the mean of 'sale.amount' for each group
ic.groupby(by='style')['sale.amount'].mean()
style 1 1/2 Story Frame 186644.000000 1 Story Brick 220225.416667 1 Story Condo 121246.600000 1 Story Frame 166179.740634 2 Story Brick 334985.000000 2 Story Condo 149766.666667 2 Story Frame 215038.451087 Split Foyer Frame 160058.333333 Split Level Frame 208351.612903 Name: sale.amount, dtype: float64
Next, merging and joining:
## Create an example dataframe for illustration
more_data = pd.DataFrame({'sale.date': ['1/3/2005','1/12/2005'],
'new_variable': ['a','b']})
## Left join 'ic' onto this example dataframe according to the 'sale.date' variable
merged_data = more_data.merge(ic, on='sale.date', how='left')
## Print some of the merged data, notice the new columns from 'ic'
merged_data[['sale.date', 'new_variable', 'sale.amount', 'built']]
sale.date | new_variable | sale.amount | built | |
---|---|---|---|---|
0 | 1/3/2005 | a | 172500 | 1993 |
1 | 1/12/2005 | b | 168500 | 1976 |
Now, inserting new columns:
## Add a new column containing the log-transformed sale amount as the 1st column
ic.insert(loc=0, column='new_col_name', value=np.log(ic['sale.amount']))
ic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 777 entries, 0 to 776 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 new_col_name 777 non-null float64 1 sale.amount 777 non-null int64 2 sale.date 777 non-null object 3 occupancy 777 non-null object 4 style 777 non-null object 5 built 777 non-null int64 6 bedrooms 777 non-null int64 7 bsmt 777 non-null object 8 ac 777 non-null object 9 attic 777 non-null object 10 area.base 777 non-null int64 11 area.add 777 non-null int64 12 area.bsmt 777 non-null int64 13 area.garage1 777 non-null int64 14 area.garage2 777 non-null int64 15 area.living 777 non-null int64 16 area.lot 777 non-null int64 17 lon 777 non-null float64 18 lat 777 non-null float64 19 assessed 777 non-null int64 dtypes: float64(3), int64(11), object(6) memory usage: 121.5+ KB
Finally, pivoting:
## Load a data set that is in "wide" format
bluechip_stocks = pd.read_csv("https://remiller1450.github.io/data/bluechips.csv")
bluechip_stocks
Year | AAPL | KO | JNJ | AXP | |
---|---|---|---|---|---|
0 | 2010 | 7.643214 | 28.520000 | 64.680000 | 40.919998 |
1 | 2011 | 11.770357 | 32.610001 | 62.820000 | 43.400002 |
2 | 2012 | 14.686786 | 35.070000 | 65.879997 | 48.389999 |
3 | 2013 | 19.608213 | 37.599998 | 70.839996 | 58.750000 |
4 | 2014 | 19.754642 | 40.660000 | 91.029999 | 89.449997 |
5 | 2015 | 27.332500 | 42.139999 | 104.519997 | 93.019997 |
6 | 2016 | 26.337500 | 42.400002 | 100.480003 | 67.589996 |
7 | 2017 | 29.037500 | 41.799999 | 115.839996 | 75.349998 |
8 | 2018 | 43.064999 | 45.540001 | 139.229996 | 98.940002 |
9 | 2019 | 35.547501 | 46.639999 | 125.720001 | 93.430000 |
10 | 2020 | 74.357498 | 54.689999 | 144.279999 | 124.599998 |
11 | 2021 | 129.410004 | 52.759998 | 156.500000 | 118.040001 |
12 | 2022 | 182.009995 | 59.299999 | 171.539993 | 168.210007 |
## Pivot everything but 'Year' longer (print only the first 6 obs)
bluechip_stocks_long = bluechip_stocks.melt(id_vars=['Year'])
bluechip_stocks_long.head(6)
Year | variable | value | |
---|---|---|---|
0 | 2010 | AAPL | 7.643214 |
1 | 2011 | AAPL | 11.770357 |
2 | 2012 | AAPL | 14.686786 |
3 | 2013 | AAPL | 19.608213 |
4 | 2014 | AAPL | 19.754642 |
5 | 2015 | AAPL | 27.332500 |
## Pivot back to "wide" format
bluechip_stocks_wide = bluechip_stocks_long.pivot(index="Year", columns="variable", values="value")
bluechip_stocks_wide.reset_index(inplace=True) # Extra step to make 'Year' a column rather than row names
print(bluechip_stocks_wide)
variable Year AAPL AXP JNJ KO 0 2010 7.643214 40.919998 64.680000 28.520000 1 2011 11.770357 43.400002 62.820000 32.610001 2 2012 14.686786 48.389999 65.879997 35.070000 3 2013 19.608213 58.750000 70.839996 37.599998 4 2014 19.754642 89.449997 91.029999 40.660000 5 2015 27.332500 93.019997 104.519997 42.139999 6 2016 26.337500 67.589996 100.480003 42.400002 7 2017 29.037500 75.349998 115.839996 41.799999 8 2018 43.064999 98.940002 139.229996 45.540001 9 2019 35.547501 93.430000 125.720001 46.639999 10 2020 74.357498 124.599998 144.279999 54.689999 11 2021 129.410004 118.040001 156.500000 52.759998 12 2022 182.009995 168.210007 171.539993 59.299999
In these pivoting examples you should note:
index
defines what should be the new rows in the wide form, while column
defines the variable whose values should be converted to new columns.pivot()
method will assign the column referenced in index
to the row axis names of the resulting object. We can move these values into the DataFrame as a variable using reset_index(inplace=True)
Question 4: The code given below reads data scraped from the RealClearPolitics website prior to the 2016 US Presidential election. It then uses the Sample
column to create two distinct columns for the number of people polled, N
, and the population polled, Pop
(either likely voters, LV, or registered voters, RV).
str.split()
method (click here for documentation) to split the variable 'Sample'
into two new columns named 'N'
and 'Pop'
. You might notice that these quantities are separated by a space in the original variable.## Read polls data set
polls = pd.read_csv("https://remiller1450.github.io/data/polls2016.csv")
## Info on the dataset
polls.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7 entries, 0 to 6 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Poll 7 non-null object 1 Date 7 non-null object 2 Sample 7 non-null object 3 MoE 6 non-null float64 4 Clinton..D. 7 non-null int64 5 Trump..R. 7 non-null int64 6 Johnson..L. 7 non-null int64 7 Stein..G. 7 non-null int64 dtypes: float64(1), int64(4), object(3) memory usage: 576.0+ bytes
matplotlib
¶There are many graphics packages availabe in Python. We'll largely stick to simple graphics that can be created using pandas
methods or the library matplotlib
.
The data visualizations that can be created through pandas
are quick and useful for basic exploratory data analysis. In the context of machine learning, they can help us identify variables with unusual values or distributions, as well as marginal relationships between variables.
The code below imports the scatter_matrix()
function the plotting
module of panadas
and uses it to display a scatter plot matrix for all pairs of numeric variables in the Iowa City home sales data:
## Select only numeric vars
ic_num = ic.select_dtypes(include=['number'])
## Scatterplot matrix of all pairing of numeric vars
from pandas.plotting import scatter_matrix
scatter_matrix(ic_num)
plt.show()
Scatter plot matrices are useful for quickly reviewing the distributions and bivariate relationships between a large number of variables.
If an individual variable requires closer examination, pandas
DataFrame's have a more generalized plot()
method that can be used to create the following graphics using the kind
argument:
A simple example of this is shown below:
ic_num['built'].plot(kind = 'hist')
<AxesSubplot:ylabel='Frequency'>
The plot()
method of pandas
DataFrames actually uses matplotlib
as its backend plotting library. Sometimes we won't be working with DataFrame objects, so it's helpful to have some familiarity with the pyplot
module of matplotlib
(recall we've already loaded this module under the alias plt
).
In the future we'll mostly use pyplot
to create line charts, which also happen to be the default graphic created by plt.plot()
. This function is flexible and can display several different lines of data in the same graphic with a simple syntax.
## Create some toy inputs
x = np.linspace(0, 4, num=20) # 20 equally spaced values b/w 0 and 4
## Display 3 different f(x)
plt.plot(x, x)
plt.plot(x, np.power(x,2))
plt.plot(x, np.exp(x))
## Add a legend
plt.legend(['y = x', 'y = x^2', 'y = e^x'], loc='upper left')
## Optional plt.show()
plt.show()
We can see that plt.plot()
naturally adds layers to the current plot while keeping track of what's already there. You should note that the code plt.show()
is not required to display a plot in Jupyter Notebooks, but has the nice benefit of removing some of the meta data that's printed by code chunks with pyplot
graphics that do not use it.
Question 5: Considering only the columns of type object
in the Iowa City home sales data set, use the plot()
method of pandas
to create an appropriate visualization to see the distribution of the variable stored in the last column in this subset of data. Hint: consider using groupby()
and size()
before using plot()
with and appropriate input to the kind
argument.
There will be several instances where I'll ask you to program a simplified implementation of an algorithm yourself in order to help you better understand it. This will require some familiarity with the basic mechanics of two programming concepts in Python:
Functions are created using the def
keyword followed by name you give the function, its arguments in parentheses, and a :
to initiate the function's code block. This code block must be indented by 4 spaces or 1 tab, and the function should use the return
keyword to indicate which object(s) will be returned when the function is used.
Below is an example of a function that computes the sum of squared differences between two numeric vectors:
## Define function
def squared_diff(y, yh):
diff = y - yh
ss_diff = np.dot(diff, diff)
return ss_diff
## Example use
squared_diff(y = np.array([1,1,1]), yh = np.array([1,0,2]))
2
There are two types of looping you should know how to use in this course. The simpler of the two is the while loop, which repeats until a condition is met. Like functions, the code block used in a while loop is initiated by a :
and must be idented by 4 spaces or a tab. Below is an example of a simple while loop:
## Example while loop
x = 0
while x <= 3:
print(x)
x = x +1
0 1 2 3
We can see that the code block within the loop executed 3 times. Prior to what would have been the 4th repetition the loop's condition was no longer met because x
now held the value 4
, so the loop terminated.
One use of while loops in machine learning is to repeat an optimization procedure until a precision criterion is met. A toy example of this would be shrinking a vector towards zero until we get within a certain margin of error using the squared_diff()
function we created earlier:
## More realistic use of a while loop
diff = float('inf')
tol = 0.1
yh = np.array([1,1.5,-2])
target = np.array([0,0,0])
while diff > tol:
diff = squared_diff(yh, target)
yh = yh/1.5 # shrink values by 50%
print(yh)
[ 0.66666667 1. -1.33333333] [ 0.44444444 0.66666667 -0.88888889] [ 0.2962963 0.44444444 -0.59259259] [ 0.19753086 0.2962963 -0.39506173] [ 0.13168724 0.19753086 -0.26337449] [ 0.0877915 0.13168724 -0.17558299] [ 0.05852766 0.0877915 -0.11705533]
Question 6: Initialize a value to 1
, then use a while loop to repeatedly divide this initial value by 2 until it is within a tolerance of 0.001 from zero. Print the current value in each iteration of the loop to confirm that the final result is indeed within 0.001 of zero.
The other type of looping you should be comfortable with is the for loop, which iterates through the members of an object in order.
The simplest way to use a for loop is to iterate through a sequence of integers set up using the range()
function:
for i in range(3):
print(i)
0 1 2
Many types of objects are iterable, meaning they can be looped over via a for loop. One place where we'll use this concept in the future will be looping through all images contained in a folder. You're welcome (but not expected) to try this by downloading and extracting this folder containing 50 images of cats: https://remiller1450.github.io/data/cats.zip
## Libraries
import os
import matplotlib.image as mpimg
## Root directory containing the folder (on my PC)
path = 'OneDrive - Grinnell College/Documents/cats/'
file_list = os.listdir(path)[:3] # We'll show only the first 3 images
## Loop through file list, notice we can concatenate the file path and name into a single string using '+'
for file in file_list:
plt.imshow(mpimg.imread(path + file))
plt.show()
Question 7: Use a for loop to iterate through the following list: ['apple', 'banana', 'watermelon']
printing the number of characters in the current fruit using the len()
function at each iteration.
$k$-means clustering, along with many other machine learning tools that we'll make extensive use of, is implemented in the sklearn
or "scikit-learn" library:
import sklearn
from sklearn.cluster import KMeans # Import k-means function
import warnings
warnings.filterwarnings("ignore") ## We will turn off warnings since k-means will warn about future changes
This portion of the lab will explore $k$-means clustering and related concepts on two simulated data sets that exhibit very different patterns:
## Datasets module
from sklearn import datasets
## Dataset #1 - "moons"
moons = datasets.make_moons(n_samples=500, noise=0.11, random_state=8)
moons_X = moons[0] # features to use in clustering
moons_labels = moons[1] # true cluster identities
## Visualization
plt.scatter(moons_X[:,0], moons_X[:,1], c =moons_labels)
plt.show()
## Dataset #2 - "blobs"
blobs = datasets.make_blobs(n_samples=500, cluster_std=2, random_state=8)
blobs_X = blobs[0] # features to use in clustering
blobs_labels = blobs[1] # true cluster identities
plt.scatter(blobs_X[:,0], blobs_X[:,1], c = blobs_labels)
plt.show()
In most applications the data you receive are not immediately ready to be input directly into machine learning algorithms. Instead, additional steps known as preprocessing are typically applied before machine algorithms are used.
Because $k$-means clustering relies upon distances between data-points and cluster prototypes, the unit scale of each feature is influential.
StandardScaler()
function in sklearn
, which applies a $Z$-score transformation to each feature:from sklearn.preprocessing import StandardScaler
moons_XS = StandardScaler().fit_transform(moons_X) # use the fit_transform method of a newly instantiated StandardScaler object
## Plot the standardized features, notice the subtle change in the x/y axes
plt.scatter(moons_XS[:,0], moons_XS[:,1], c =moons_labels)
<matplotlib.collections.PathCollection at 0x1bd0204f610>
The $k$-means algorithm requires $k$ be pre-specified. A simple strategy to help select an appropriate number of clusters is to search over a range of possible values of $k$ to find the point where the rate of improvement in the clustering objective function begins to diminish. You should note that the objective function will always improve as $k$ increases, so we are not looking for a minimum. Instead, we are looking for an inflection point or "elbow" in the curve produced by the loop given below:
## Loop through choices of k ranging from 2 to 10
scores = []
for k in range(2, 12):
kmeans = KMeans(n_clusters=k)
kmeans.fit(moons_XS)
scores.append(kmeans.inertia_)
## Visualize
plt.plot(range(2, 12), scores)
plt.xlabel("k")
plt.ylabel("Objective")
plt.show()
For the blobs
data we see that this curve shows an inflection point (or elbow) around $k=4$ or $k=5$, suggesting we use that many clusters in our final analysis of these data.
Silhouette scores provide a method for validating clustering performance. Recall that each data-point's silhouette is a ratio of its difference in mean pairwise distance between points in the nearest neighboring clustering and its mean pairwise distance between points in assigned cluster divided by the larger of these two means.
A silhouette score approaching +1 indicates that a data-point is substantially closer to the data in its assigned cluster than it is to data in the neighboring clustering, a silhouette score of 0 indicates it is equally close, and a negative silhouette score indicates the point is on average closer to the members of a different cluster than the one it was assigned to.
The mean silhouette score across all data-points can be used to summarize the clustering performance of your chosen algorithm. To calculate silhouette scores we can use the silhouette score()
function in the metrics
module of sklearn
:
from sklearn.metrics import silhouette_score
my_clusters = KMeans(n_clusters=4) # init the k-means object
## Calculate silhouette scores
my_sils = silhouette_score(moons_XS, my_clusters.fit_predict(moons_XS))
## Mean sil score
np.average(my_sils)
0.42124198131633384
An average silhouette score of 0.42 indicates modest to weak performance. In generally, average silhouette scores between 0.25 and 0.5 indicate weak but potentially meaningful clustering, averages between 0.5 and 0.7 indicate moderate to strong clustering, and values above 0.7 indicate excellent performance. The weak performance of $k$-means for these data should be apparent when graphing the clustering results for $k=4$:
plt.scatter(moons_XS[:,0], moons_XS[:,1], c =my_clusters.predict(moons_XS))
plt.show()
Before moving on to answer the question that appears below, I encourage you to skim over the KMeans documentation provided by sklearn
. Here you can find information not covered in this lab, such as how to find the coordinates of the cluster prototypes or change the $k$-means initialization algorithm that is being used.
Question 8: Earlier in this section we introduced a second simulated data set named blobs
. Using this data set, perform a clustering analysis that includes the following steps:
In addition to providing code that accomplishes each step, provide a 3-4 sentence written summary of your methods and results using a professional scientific tone and an appropriate level of detail. In your summary you should clearly summarize the steps you used and how you interpret the success of the method.
Image segmentation is the process of dividing an image into collections of pixels that are labeled by "masks". There are many approaches to image segmentation, but for a variety of applications simple $k$-means clustering can work surprisingly well.
The code below reads an image from the web and then displays it using functions contained in the skimage
library:
from skimage import io
my_img = io.imread("https://remiller1450.github.io/data/beach_img.jpg")
io.imshow(my_img)
io.show()
To begin, you should notice that the image is stored in a 3-d array:
my_img.shape
(525, 700, 3)
The first two dimensions of this array define a pixel location, while the third dimension define a pixel's the color channel values under the RGB color model. Thus, each element in the array is a pixel intensity in the respective color channel, with values ranging from 0 to 255.
To apply $k$-means clustering we'll need to reshape the image into a 2-d array where the first axis of the array (ie: rows) represents the pixel and the second axis (ie: columns) represents the color channel. This is because we'd like to cluster all pixels, regardless of their location, based upon their color intensities.
The code below accomplishes this using reshape()
:
my_img_2d = my_img.reshape(my_img.shape[0]*my_img.shape[1], my_img.shape[2])
While we'll rely upon the data to guide our choice of $k$ in many clustering applications, in image segmentation we might choose $k$ ourselves using domain-specific knowledge and goals.
For example, we might choose $k=2$ if our goal is to separate water from land:
kmeans_fit = KMeans(n_clusters=2, random_state=10, n_init=10).fit(my_img_2d)
plt.imshow(kmeans_fit.labels_.reshape(525,700).astype(int), cmap='gray')
<matplotlib.image.AxesImage at 0x1bd07e77af0>
Alternatively, we can see that choosing $k=3$ will further segment the water to label a shallower sandy section and a deep water section.
from sklearn.cluster import KMeans
kmeans_fit = KMeans(n_clusters=3, random_state=10, n_init=10).fit(my_img_2d)
plt.imshow(kmeans_fit.labels_.reshape(525,700).astype(int), cmap='gray')
<matplotlib.image.AxesImage at 0x1bd06f07100>
While this approach isn't perfect, especially compared to more modern image segmentation approaches, the fact that $k$-means is simple, efficient, and doesn't require extensive training or tuning make it an attractive option.
Question 9: The aim of this question is for you to perform your own image segmentation using $k$-means and practice describing your methodology and results using an appropriate scientific tone. To receive full credit, your submitted response must include the following steps: