This lab is intended to provide an overview of concepts from linear algebra and calculus that will be important to us as we begin to study some of the mathematical details of machine learning. You might find the early sections to be simple, but the notation introduced is important.
Matrices are structures that organize data in a rectangular array of rows and columns. A matrix containing $r$ rows and $c$ columns is described as an $r$ x $c$ matrix.
For example, the numpy
array X
created below is a $4$ x $2$ matrix:
import numpy as np
## Random matrix of predictors using n = 4, p = 2
X = np.random.randint(10, size=(4, 2))
print(X)
[[5 5] [9 6] [6 1] [9 8]]
## Print the shape of X
print(X.shape)
(4, 2)
We'll use bolded capital letters to denote a matrix, so I might reference our example matrix, X
, using the notation: $\mathbf{X}$
A matrix consisting of a single row or column can be described as a vector. Below we set up a $4$ x $1$ vector, y
:
## Random y
y = np.random.randint(10, size = (4,1))
print(y)
[[4] [4] [2] [5]]
I will use bolded lower case letters to denote a vector, so I might reference this example using the notation: $\mathbf{y}$.
Finally, you should remember that vectors are just a special case of matrices, so the operations described in the next few sections apply to them.
Some calculations require us to switch the rows and columns of a matrix using an operation known as transposition.
For example, if we switch the rows and columns of $\mathbf{X}$, the resulting matrix is known as the transpose of $\mathbf{X}$, which we'll denote $\mathbf{X}^T$
numpy
arrays have an attribute, T
, that we can use to access their transpose:
## Print transpose of X
print(X.T)
[[5 9 6 9] [5 6 1 8]]
In this example, we can see:
We'll need to concern ourselves with two different types of addition and subtraction when working with matrices:
The first adds (or subtracts) a scalar at each element in the matrix. For example:
3 + X
array([[ 8, 8], [12, 9], [ 9, 4], [12, 11]])
In this example each entry in X
was increased by 3.
In contrast, Matrix-Matrix addition will add only the corresponding elements, or $(\mathbf{A} + \mathbf{B})_{ij} = \mathbf{A}_{ij} + \mathbf{B}_{ij}$. For example:
A = np.asarray([[-10,10],[0,0],[0,0],[0,0]])
A + X
array([[-5, 15], [ 9, 6], [ 6, 1], [ 9, 8]])
Notice how the first row of X
was impacted by this operation, while the other rows were not (due to the zeros in A
).
Similar to addition and subtraction, multiplication can be also performed in two ways:
The first will multiply each element of the matrix by the scalar. For example:
10*X
array([[50, 50], [90, 60], [60, 10], [90, 80]])
Matrix multiplication is more complex and requires us to define an inner product. For the purposes of this course, we treat the terms inner product and dot product interchangeably, though pure mathematicians might consider the dot product is a special case of an inner product.
Next, consider two $n$ x $1$ vectors, $\mathbf{a}$ and $\mathbf{b}$. The inner product (or dot product) of these vectors is defined as $\mathbf{a}^T \mathbf{b} = \sum_{i=1}^{n}a_ib_i$.
For example, consider y.T
and y
shown below:
print(y.T)
print(y)
[[4 4 2 5]] [[4] [4] [2] [5]]
The dot product $\mathbf{y}^T\mathbf{y}$ is $2*2 + 5*5 + 3*3 + 5*5 = 63$
We can verify calculation this using the dot
function in numpy
:
np.dot(y.T, y)
array([[61]])
Now consider two matrices, $\mathbf{A}$ and $\mathbf{B}$. Their product, $\mathbf{A}\mathbf{B}$, is defined by each combination of inner products between the rows of $\mathbf{A}$ and the columns of $\mathbf{B}$.
It is essential for the dimensions of $\mathbf{A}$ and $\mathbf{B}$ to be compatible for their matrix-matrix product to exist. That is, the length of each row vector in $\mathbf{A}$ (ie: the number of columns) must match the length of each column vector in $\mathbf{B}$ (ie: the number of rows in $\mathbf{B}$).
Consider the following examples of matrix multiplication performed using matmul
in numpy
:
## Example 1 - Valid matrix multiplication
np.matmul(X.T, A)
array([[-50, 50], [-50, 50]])
## Example 2 - Incompatible dimensions
np.matmul(X, A)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_17604\2518866266.py in <module> 1 ## Example 2 - Incompatible dimensions ----> 2 np.matmul(X, A) ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 4 is different from 2)
In Example 1, note that first row vector of X.T
is [6, 8, 3, 1]
and the first column vector of A
is [-10,0,0,0]
. The inner product of these two vectors is calculated $6*-10 + 0 + 0 + 0 = -60$, which we can see is element in position 1,1 of the product $\mathbf{X}^T\mathbf{A}$.
Generalizing this, the element in position $j$,$k$ of the matrix product $\mathbf{A}\mathbf{B}$ is given by the inner product of the $i^{th}$ row vector of $\mathbf{A}$ and the $j^{th}$ column vector of $\mathbf{B}$.
In Example 2, the dimensions of X
are 4 x 2, so each of its row vectors has a length of 2. The dimensions of A
are also 4 x 2, so each of its column vectors has a length of 4. Because the dot product calculation relies on each vector being equal in length, matrix multiplication is not possible in this example.
M1
, containing the row vectors: [1,0,1]
, [0,1,0]
, [0,0,1]
, and another matrix, stored as M2
, containing the row vectors: [2,2,2]
, [3,3,3]
.M1
and M2
M1
and M2.T
M2.T
and M1
M2
and M1
M2.T
and M2
M2
and M2.T
Several rules that are routinely used in ordinary algebra also apply to matrix algebra.
Let $\mathbf{A}$ $\mathbf{B}$ and $\mathbf{C}$ denote matrices and $c$ denote a scalar, then:
However, there are also a few algebraic operations that are not valid for matrices (at least in general settings, there may be special cases where they can work):
The most important distinction between ordinary algebra and matrix algebra is that the order of multiplication matters in matrix algebra, so multiplying "on the left" is not always the same as multiplying "on the right".
We can easily verify any of these rules using numeric examples. A few such examples are shown below:
## Two simple matrices
X = np.array([[2,2],[2,2]])
Y = np.array([[1,2],[3,4]])
## X*Y
print(np.matmul(X, Y))
## Y*X (not the same as X*Y)
print(np.matmul(Y, X))
[[ 8 12] [ 8 12]] [[ 6 6] [14 14]]
There a few special cases of matrices that you should be familiar with:
A symmetric matrix has the property $\mathbf{A}^T = \mathbf{A}$. All symmetric matrices are square, meaning they have the same number of rows and columns.
A special type of symmetric matrix is a diagonal matrix, which is defined by non-zero diagonal entries, and zeros in the off-diagonal elements. We will use the notation $diag(\mathbf{v})$ to denote a diagonal matrix where the values of the vector $\mathbf{v}$ populate the diagonal.
The code given below displays an example of a diagonal matrix, which can be created using np.diag
:
v = [1,2,3,4]
V = np.diag(v)
print(V)
[[1 0 0 0] [0 2 0 0] [0 0 3 0] [0 0 0 4]]
An identity matrix is a special case of the diagonal matrix where every diagonal element is 1. Identity matrices, which we will denote $\mathbf{I}$, have the property: $\mathbf{A}\mathbf{I} = \mathbf{I}\mathbf{A} = \mathbf{A}$
One example where the identity matrix appears is the covariance matrix of the errors in linear regression:
\begin{equation*} Cov(\mathbf{\epsilon}) = \sigma^2 \mathbf{I} = \begin{pmatrix} \sigma^2 & 0 & \cdots & 0 \\ 0 & \sigma^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma^2 \end{pmatrix} \end{equation*}This arises from the assumption that errors are independent with a common variance, so we can factor out the constant $\sigma^2$ from the diagonal matrix.
In algebra we undo multiplication using division, but with matrices the equivalent of division (the operation that undoes multiplication) is achieved by multiplying both sides of an equation by an inverse matrix.
For example, if we'd like to solve: $\mathbf{A}\mathbf{x} = \mathbf{B}$ for $\mathbf{x}$, we must multiple both sides of the equation on the left by $\mathbf{A}^{-1}$, the inverse of $\mathbf{A}$. This would result in the solution $\mathbf{x} = \mathbf{A}^{-1}\mathbf{B}$
Unfortunately, not all matrices have an inverse. First, only square matrices have inverses. Second, $\mathbf{A}$, will be invertible if and only if its component vectors are linearly independent.
For the purposes of our class, we'll operate under the assumption that any square matrix we are working with is invertible. However, you should be aware that is possible for certain collects of data to give rise to non-invertible matrices in certain calculations. A few common examples are: linearly dependent variables (ie: a variable for height measured in inches and another for height measured in cm), or more variables than observations (a data matrix that is "wider" than it is "long").
The following are a few scattered matrix identities involving algebra, transposition, and inverses that will be useful to us:
You should be aware of these identities when performing matrix calculations.
We'll frequently use matrices and vectors to represent unknown model parameters that our methods will learn from the data. Thus, an important topic in machine learning is the use optimization involving techniques from calculus. To introduce a few concepts we'll need, consider the 3-dimensional vector defined below:
$\mathbf{w} = [w_1, w_2, w_3]$
Let's consider a function of this vector:
$f(\mathbf{w}) = 2\mathbf{w}^T\mathbf{w}$
Next, let's suppose our goal is to find the values of $w_1$, $w_2$, and $w_3$ that minimize $f()$.
Calculus tells us that a minimum happens at a location where the partial derivatives for each component of $\mathbf{w}$ all equal zero.
We'll define the gradient as a vector of partial derivatives of our function with respect to each component of the parameter vector. In our example:
$$\nabla f(\mathbf{w}) = \bigg(\tfrac{\partial f}{\partial w_1}, \tfrac{\partial f}{\partial w_2}, \tfrac{\partial f}{\partial w_3}\bigg)$$The symbol $\nabla$ is used to denote the gradient.
In this simple example, we can write out $f()$ using individual elements (rather than vectors), and then we can find each partial derivative:
$$f(\mathbf{w}) = 2(w_1^2 + w_2^2 + w_3^2)$$We can see that $\tfrac{\partial f}{\partial w_0} = 4w_1$, $\tfrac{\partial f}{\partial w_1} = 4w_2$, and $\tfrac{\partial f}{\partial w_3} = 4w_2$, so we can write the gradient: $$\nabla f(\mathbf{w}) = 4\mathbf{w} \equiv 4\mathbf{w}^T$$
The purposes of our course, we won't worry about row vs. column orientation of the gradient.
In our previous example, $f(\mathbf{w})$ mapped a 3-dimensional input, the vector $\mathbf{w}$, to a scalar output. This allowed us to have a single gradient vector. However, let's now consider the function: $f(\mathbf{w}) = 2\mathbf{w}$
This scenario is a different because the output of $f$ is now a 3-dimensional vector.
The Jacobian matrix, $\mathbf{J} = \tfrac{d \mathbf{f}}{d \mathbf{w}}$, is defined as the matrix of all partial derivatives (ie: each output with respect to each input). In our example:
\begin{equation*} \mathbf{J} = \begin{pmatrix} \tfrac{\partial f_1}{\partial w_1} & \tfrac{\partial f_1}{\partial w_2} & \tfrac{\partial f_1}{\partial w_3} \\ \tfrac{\partial f_2}{\partial w_1} & \tfrac{\partial f_2}{\partial w_2} & \tfrac{\partial f_2}{\partial w_3} \\ \tfrac{\partial f_3}{\partial w_1} & \tfrac{\partial f_3}{\partial w_2} & \tfrac{\partial f_3}{\partial w_3} \\ \end{pmatrix} \end{equation*}We could also represent $\mathbf{J}$ as:
\begin{equation*} \mathbf{J} = \begin{pmatrix} \nabla^T f_1 \\ \nabla^T f_2 \\ \nabla^T f_3 \\ \end{pmatrix} \end{equation*}In this notation, $\nabla^T f_1$ is the row-vector form of the gradient of component $f_1$ computed with respect to $\mathbf{w}$
When we write out $f(\mathbf{w})$ as $[2w_1, 2w_2, 2w_3]$ it's easy to see that:
\begin{equation*} \tfrac{d \mathbf{f}}{d \mathbf{w}} = \begin{pmatrix} 2 & 0 & 0 \\ 0 & 2 & 0\\ 0 & 0 & 2 \\ \end{pmatrix} \end{equation*}[1,1,0]
and [3,0,2]
, and the function $f(\mathbf{w}) = \mathbf{A}\mathbf{w}^2$ where the notation $\mathbf{w}^2$ represent element-wise squaring of the components of $\mathbf{w}$. Calculate: $\tfrac{d \mathbf{f}}{d \mathbf{w}}$Below are a few useful shortcuts that you are welcome to use:
You may also use:
Note that you should be comfortable "proving" any of these shortcuts in a finite dimensional setting to verify that it is accurate.
Let's suppose we have the function:
$\mathbf{f}(\mathbf{w}) = exp(\mathbf{v})$ where $\mathbf{v} = \mathbf{X}\mathbf{w}$
In this notation, the exponent function, $f(x) = e^x$, is applied element-wise to each component of the vector $\mathbf{v}$. In other words, $exp(\mathbf{v}) = [exp(v_1), exp(v_2), \ldots]$
Further, let's define $\mathbf{w} = [w_0, w_1, w_2]$ and let $\mathbf{X}$ be an $n$ by $3$ matrix.
If we wanted to differentiate $f$ with respect to $\mathbf{w}$, we could substitute $\mathbf{X}\mathbf{w}$ for $\mathbf{v}$ and attempt to calculate each partial derivative, or we could use chain rule to argue:
$\tfrac{d \mathbf{f}}{d \mathbf{w}} = \tfrac{d \mathbf{f}}{d \mathbf{v}} \tfrac{d \mathbf{v}}{d \mathbf{w}}$
This approach might be a bit easier, since it's easy to differentiate $exp(\mathbf{v})$ with respect to each component of the vector $\mathbf{v}$.
Doing so yields:
\begin{equation*} \tfrac{d \mathbf{f}}{d \mathbf{v}} = \begin{pmatrix} exp(v_1) & 0 & \cdots & 0 \\ 0 & exp(v_2) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & exp(v_n) \end{pmatrix} = diag(\mathbf{v}) \end{equation*}Remember that we ended up with a matrix here because we must find the partial derivative of each component $exp(\mathbf{v})$ (there are $n$ of them) with respect to each component of $\mathbf{v}$ (there are also $n$ of them). However, the procedure was not difficult since each component of $\mathbf{v}$ only appears once in each component of $exp(\mathbf{v})$.
Since our matrix derivative short cuts tell us that $\tfrac{d \mathbf{v}}{d \mathbf{w}}$ is $\mathbf{X}$, we have: $\nabla f(\mathbf{w}) = diag(\mathbf{v}) \mathbf{X}$