The purpose of this lab is to cover notations and concepts from linear algebra and calculus that will be used in our study of machine learning. The early sections may be simple, but the notation they introduce will be important.
Matrices are structures that organize data in a rectangular array consisting of rows and columns. A matrix containing $r$ rows and $c$ columns is often described as an $r$ x $c$ matrix.
For example, the numpy
array X
that is created below is a $4$ x $2$ matrix:
import numpy as np
## Random matrix of predictors using n = 4, p = 2
X = np.random.randint(10, size=(4, 2))
print(X)
[[0 1] [9 4] [7 5] [7 8]]
## Print the shape of X
print(X.shape)
(4, 2)
We will use bolded capital letters to denote a matrix, so I might reference our example matrix, X
, using the notation: $\mathbf{X}$
A matrix consisting of a single row or column can be described as a vector. Below we set up a $4$ x $1$ vector, y
:
## Random y
y = np.random.randint(10, size = (4,1))
print(y)
[[2] [5] [3] [5]]
I will use bolded lower case letters to denote a vector, so I might reference this example using the notation: $\mathbf{y}$.
Some calculations will require us to switch the rows and columns of a matrix using an operation known as transposition.
For example, if we switch the rows and columns of $\mathbf{X}$, the resulting matrix is known as the transpose of $\mathbf{X}$, and we will denote it using the notation: $\mathbf{X}^T$, where the superscript $T$ indicates that $\mathbf{X}$ has been transposed.
numpy
arrays have an attribute, T
, that we can use to access their transpose:
## Print transpose of X
print(X.T)
[[0 9 7 7] [1 4 5 8]]
In this example, we can see that the first column of $\mathbf{X}$ is now the first row of $\mathbf{X}^T$, and the second column of $\mathbf{X}$ is the second row of $\mathbf{X}^T$, etc.
We will frequently consider two different types of addition and subtraction:
The first adds (or subtracts) a scalar at each element in the matrix. For example:
3 + X
array([[ 3, 4], [12, 7], [10, 8], [10, 11]])
In this example, notice how each entry in X
was increased by 3.
In contrast, Matrix-Matrix addition will add only the corresponding elements, or $(\mathbf{A} + \mathbf{B})_{ij} = \mathbf{A}_{ij} + \mathbf{B}_{ij}$. For example:
A = np.asarray([[-10,10],[0,0],[0,0],[0,0]])
A + X
array([[-10, 11], [ 9, 4], [ 7, 5], [ 7, 8]])
Notice how the first row of X
was impacted by this operation, while the other rows were not (due to the zeros in A
).
Similar to addition and subtraction, multiplication can be also performed in two ways:
The first will multiply each element of the matrix by the scalar. For example:
10*X
array([[ 0, 10], [90, 40], [70, 50], [70, 80]])
Matrix multiplication is more complex and relies requires us to define an inner product. For the purposes of this course, we treat the terms inner product and dot product interchangeably, though pure mathematicians might note that a dot product is a special case of an inner product.
Next, consider two $n$ x $1$ vectors, $\mathbf{a}$ and $\mathbf{b}$. The inner product (or dot product) of these vectors is defined as $\mathbf{a}^T \mathbf{b} = \sum_{i=1}^{n}a_ib_i$.
For example, consider y.T
and y
shown below:
print(y.T)
print(y)
[[2 5 3 5]] [[2] [5] [3] [5]]
The dot product $\mathbf{y}^T\mathbf{y}$ is $4*4 + 3*3 + 9*9 + 8*8 = 170$
We can verify calculation this using the dot
function in numpy
:
np.dot(y.T, y)
array([[63]])
Now consider two matrices, $\mathbf{A}$ and $\mathbf{B}$. The product, $\mathbf{A}\mathbf{B}$, of these matrices is defined by each combination of inner products between the rows of $\mathbf{A}$ and the columns of $\mathbf{B}$.
It is essential for the dimensions of $\mathbf{A}$ and $\mathbf{B}$ to be compatible for their matrix-matrix product to exist.
Consider the following examples of matrix multiplication performed using matmul
in numpy
:
## Example 1 - Valid matrix multiplication
np.matmul(X.T, A)
array([[ 0, 0], [-10, 10]])
## Example 2 - Incompatible dimensions
np.matmul(X, A)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_32428\2518866266.py in <module> 1 ## Example 2 - Incompatible dimensions ----> 2 np.matmul(X, A) ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 4 is different from 2)
In Example 1, note that first row vector of X.T
is [6, 8, 3, 1]
and the first column vector of A
is [-10,0,0,0]
. The inner product of these two vectors is calculated $6*-10 + 0 + 0 + 0 = -60$, which we can see is element in position 1,1 of the product $\mathbf{X}^T\mathbf{A}$.
Generalizing this, the element in position $j$,$k$ of the matrix product $\mathbf{A}\mathbf{B}$ is given by the inner product of the $i^{th}$ row vector of $\mathbf{A}$ and the $j^{th}$ column vector of $\mathbf{B}$.
In Example 2, the dimensions of X
are 4 x 2, so each of its row vectors has a length of 2. The dimensions of A
are also 4 x 2, so each of its column vectors has a length of 4. Because the dot product calculation relies on each vector being equal in length, matrix multiplication is not possible in this example.
M1
, containing the row vectors: [1,0,1]
, [0,1,0]
, [0,0,1]
, and another matrix, stored as M2
, containing the row vectors: [2,2,2]
, [3,3,3]
.M1
and M2
M1
and M2.T
M2.T
and M1
M2
and M1
M2.T
and M2
M2
and M2.T
M2.T*M2
Several rules from algebra also apply to matrix algebra.
Let $\mathbf{A}$ $\mathbf{B}$ and $\mathbf{C}$ denote matrices and $c$ denote a scalar, then:
While the rules described above mirror those in algebra, there are a few algebraic operations that are not valid for matrices (at least in general settings, there may be special cases):
The largest difference is that in matrix algebra the order of multiplication matters, so multiplying "on the left" is not the same as multiplying "on the right".
We can easily verify any of these rules using numeric examples:
## Two simple matrices
X = np.array([[2,2],[2,2]])
Y = np.array([[1,2],[3,4]])
## X*Y
print(np.matmul(X, Y))
## Y*X (not the same as X*Y)
print(np.matmul(Y, X))
[[ 8 12] [ 8 12]] [[ 6 6] [14 14]]
There a few special cases of matrices that we should be familiar with:
A symmetric matrix has the property $\mathbf{A}^T = \mathbf{A}$. All symmetric matrices are square, meaning they have the same number of rows and columns.
A special type of symmetric matrix is a diagonal matrix, which is defined by non-zero diagonal entries, and zeros in the off-diagonal elements. We will use the notation $diag(\mathbf{v})$ to denote a diagonal matrix where the values of the vector $\mathbf{v}$ populate the diagonal.
The code given below displays an example of a diagonal matrix, which can be created using np.diag
:
v = [1,2,3,4]
V = np.diag(v)
print(V)
[[1 0 0 0] [0 2 0 0] [0 0 3 0] [0 0 0 4]]
An identity matrix is a special case of the diagonal matrix where every diagonal element is 1. Identity matrices, which we will denote $\mathbf{I}$, have the property: $\mathbf{A}\mathbf{I} = \mathbf{I}\mathbf{A} = \mathbf{A}$
One example where the identity matrix appears is the covariance matrix of the errors in linear regression:
\begin{equation*} Cov(\mathbf{\epsilon}) = \sigma^2 \mathbf{I} = \begin{pmatrix} \sigma^2 & 0 & \cdots & 0 \\ 0 & \sigma^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma^2 \end{pmatrix} \end{equation*}This arises from the assumption that errors are independent with a common variance, so we can factor out the constant $\sigma^2$ from the diagonal matrix.
In algebra we undo multiplication using division, but with matrices the equivalent of division (the operation that undoes multiplication) is achieved by multiplying both sides of an equation by an inverse matrix.
For example, if we'd like to solve: $\mathbf{A}\mathbf{x} = \mathbf{B}$ for $\mathbf{x}$, we must multiple both sides of the equation on the left by $\mathbf{A}^{-1}$, the inverse of $\mathbf{A}$. This would result in the solution $\mathbf{x} = \mathbf{A}^{-1}\mathbf{B}$
Unfortunately, not all matrices have an inverse. First, only square matrices have inverses. Second, $\mathbf{A}$, will be invertible if and only if its component vectors are linearly independent.
For the purposes of our class, we'll operate under the assumption that any square matrix we are working with is invertible. However, you should be aware that is possible for certain collects of data to give rise to non-invertible matrices in certain calculations. A few common examples are: linearly dependent variables (ie: a variable for height measured in inches and another for height measured in cm), or more variables than observations (a data matrix that is "wider" than it is "long").
The following are a few scattered matrix identities involving algebra, transposition, and inverses that will be useful to us:
You should be aware of these identities when performing matrix calculations.
We'll frequently use matrices and vectors to represent unknown model parameters that our methods will learn from the data. In this setting, the a major component of machine learning is the application optimization using techniques from calculus. To introduce a few concepts we'll need, consider the 3-dimensional vector defined below:
$\mathbf{w} = [w_1, w_2, w_3]$
Let's consider a function of this vector:
$f(\mathbf{w}) = 2\mathbf{w}^T\mathbf{w}$
Next, let's suppose our goal is to find the values of $w_1$, $w_2$, and $w_3$ that minimize $f()$.
Calculus tells us that a minimum occurs at a location where the partial derivatives for each component of $\mathbf{w}$ all equal zero. We'll define the gradient as a vector of partial derivatives of our function with respect to each component of the parameter vector. In our example:
$$\nabla f(\mathbf{w}) = \bigg(\tfrac{\partial f}{\partial w_1}, \tfrac{\partial f}{\partial w_2}, \tfrac{\partial f}{\partial w_3}\bigg)$$The symbol $\nabla$ is used to denote the gradient.
In this simple example, we can write out $f()$ using individual elements (rather than vectors), and then we can find each partial derivative:
$$f(\mathbf{w}) = 2(w_1^2 + w_2^2 + w_3^2)$$We can see that $\tfrac{\partial f}{\partial w_0} = 4w_1$, $\tfrac{\partial f}{\partial w_1} = 4w_2$, and $\tfrac{\partial f}{\partial w_3} = 4w_2$, so we can write the gradient: $$\nabla f(\mathbf{w}) = 4\mathbf{w} \equiv 4\mathbf{w}^T$$
The purposes of our course, we won't worry about row vs. column orientation of the gradient.
In our previous example, $f(\mathbf{w})$ mapped a 3-dimensional input, the vector $\mathbf{w}$, into a scalar output. This allowed us to have a single gradient vector. However, let's now consider the function: $f(\mathbf{w}) = 2\mathbf{w}$
This scenario is a different because the output of $f$ is now a 3-dimensional vector.
The Jacobian matrix, $\mathbf{J} = \tfrac{d \mathbf{f}}{d \mathbf{w}}$, is defined as the matrix of all partial derivatives (ie: each output with respect to each input). In our example:
\begin{equation*} \mathbf{J} = \begin{pmatrix} \tfrac{\partial f_1}{\partial w_1} & \tfrac{\partial f_1}{\partial w_2} & \tfrac{\partial f_1}{\partial w_3} \\ \tfrac{\partial f_2}{\partial w_1} & \tfrac{\partial f_2}{\partial w_2} & \tfrac{\partial f_2}{\partial w_3} \\ \tfrac{\partial f_3}{\partial w_1} & \tfrac{\partial f_3}{\partial w_2} & \tfrac{\partial f_3}{\partial w_3} \\ \end{pmatrix} \end{equation*}We could also represent $\mathbf{J}$ as:
\begin{equation*} \mathbf{J} = \begin{pmatrix} \nabla^T f_1 \\ \nabla^T f_2 \\ \nabla^T f_3 \\ \end{pmatrix} \end{equation*}In this notation, $\nabla^T f_1$ is the row-vector form of the gradient of component $f_1$ computed with respect to $\mathbf{w}$
When we write out $f(\mathbf{w})$ as $[2w_1, 2w_2, 2w_3]$ it's easy to see that:
\begin{equation*} \tfrac{d \mathbf{f}}{d \mathbf{w}} = \begin{pmatrix} 2 & 0 & 0 \\ 0 & 2 & 0\\ 0 & 0 & 2 \\ \end{pmatrix} \end{equation*}[1,1,0]
and [3,0,2]
, and the function $f(\mathbf{w}) = \mathbf{A}\mathbf{w}^2$ where the notation $\mathbf{w}^2$ represent element-wise squaring of the components of $\mathbf{w}$. Calculate: $\tfrac{d \mathbf{f}}{d \mathbf{w}}$Below are a few useful shortcuts that you are welcome to use:
You may also use:
Note that you should be comfortable "proving" any of these shortcuts in a finite dimensional setting to verify that it is accurate.
Let's suppose we have the function:
$\mathbf{f}(\mathbf{w}) = exp(\mathbf{v})$ where $\mathbf{v} = \mathbf{X}\mathbf{w}$
In this notation, the exponent function, $f(x) = e^x$, is applied element-wise to each component of the vector $\mathbf{v}$. In other words, $exp(\mathbf{v}) = [exp(v_1), exp(v_2), \ldots]$
Further, let's define $\mathbf{w} = [w_0, w_1, w_2]$ and let $\mathbf{X}$ be an $n$ by $3$ matrix.
If we wanted to differentiate $f$ with respect to $\mathbf{w}$, we could substitute $\mathbf{X}\mathbf{w}$ for $\mathbf{v}$ and attempt to calculate each partial derivative, or we could use chain rule to argue:
$\tfrac{d \mathbf{f}}{d \mathbf{w}} = \tfrac{d \mathbf{f}}{d \mathbf{v}} \tfrac{d \mathbf{v}}{d \mathbf{w}}$
This approach might be a bit easier, since it's easy to differentiate $exp(\mathbf{v})$ with respect to each component of the vector $\mathbf{v}$.
Doing so yields:
\begin{equation*} \tfrac{d \mathbf{f}}{d \mathbf{v}} = \begin{pmatrix} exp(v_1) & 0 & \cdots & 0 \\ 0 & exp(v_2) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & exp(v_n) \end{pmatrix} = diag(\mathbf{v}) \end{equation*}Remember that we ended up with a matrix here because we must find the partial derivative of each component $exp(\mathbf{v})$ (there are $n$ of them) with respect to each component of $\mathbf{v}$ (there are also $n$ of them). However, the procedure was not difficult since each component of $\mathbf{v}$ only appears once in each component of $exp(\mathbf{v})$.
Since our matrix derivative short cuts tell us that $\tfrac{d \mathbf{v}}{d \mathbf{w}}$ is $\mathbf{X}$, we have: $\nabla f(\mathbf{w}) = diag(\mathbf{v}) \mathbf{X}$