Lab 10 - Linear Algebra Review¶

This lab will review some key mathematical concepts that are necessary to understand the optimization methods behind many of the most powerful machine learning algorithms. The only library we will need is numpy:

In [1]:

import numpy as np

Part 1 - Matrix Notation and Basic Operations¶

As you're aware, matrices are used to represent data in a rectangular array of rows and columns. In our course, we'll use $i \in \{1, 2, \ldots, n\}$ to index the rows of a matrix and $j \in \{1, 2, \ldots, p\}$ to index the columns. We will use bolded upper case letters to denote that an object is a matrix (ie: $\mathbf{X}$ is a matrix, while $X$ is a variable).

The code below uses numpy to create $\mathbf{X}$, which is an $4$ x $2$ matrix of random integer values, and $\mathbf{y}$, which is a $4$ by $1$ matrix. Note that we typically use the term "vector" to describe matrices with either 1 row or 1 column.

In [2]:

## Create X and y
X = np.random.randint(10, size=(4,2))
y = np.random.randint(10, size=(4,1))

## Print element 2,1 in X (remembering zero-indexing in Python)
print(X[1,0])

The subsections that follow will cover the following operations involving $\mathbf{X}$ and $\mathbf{y}$ (and other matrices/vectors as needed):

Transposition
Addition and subtraction
Multiplication

Transposition¶

Transposition switches the rows and columns of a matrix. If our original matrix is $\mathbf{X}$, we'll use $\mathbf{X}^T$ to denote the transpose of $\mathbf{X}$.

We can use the .T attribute of numpy arrays to find the transpose of a matrix in Python:

In [3]:

## Transpose of X
print(X.T)

[[5 1 1 3]
 [7 8 6 9]]

Notice that X.T is a $2$ x $4$ matrix whose rows previously were the columns of X.

Addition and Subtraction¶

There are two distinct types of addition and subtraction that we must be aware of when working with matrices and vectors:

Scalar - Matrix
Matrix - Matrix

In the former, the scalar value is added or subtracted from every element in the matrix. For example:

In [4]:

## Matrix plus a scalar
print(X + 3)

[[ 8 10]
 [ 4 11]
 [ 4  9]
 [ 6 12]]

In the latter, values are added elementwise such that $(\mathbf{A} + \mathbf{B})_{ij} = \mathbf{A}_{ij} + \mathbf{B}_{ij}$. For example:

In [5]:

## Matrix plus a matrix
A = np.asarray([[-10,10],[0,0],[0,0],[0,0]])
A + X

Out[5]:

array([[-5, 17],
       [ 1,  8],
       [ 1,  6],
       [ 3,  9]])

Note that matrix - matrix addition/subtraction is only possible when the dimensions of each matrix are identical.

Multiplication¶

Like addition and subtraction, there are two types of multiplication we need to be aware of:

Scalar - Matrix
Matrix - Matrix

In the former, each element of the matrix is simply multiplied by the scalar. For example:

In [6]:

## Scalar times a matrix
10*X

Out[6]:

array([[50, 70],
       [10, 80],
       [10, 60],
       [30, 90]])

Matrix - matrix multiplication is more complex, and before describing it we will introduce the term inner product, which we might at times call the "dot product" (though mathematicians would describe the dot product as a special case of the inner product).

Consider two $n$ x $1$ vectors, $\mathbf{a}$ and $\mathbf{b}$, their inner product (dot product) is defined as: $$\mathbf{a}^T\mathbf{b} = \sum_{i=1}^{n}a_i\cdot b_i$$

As an example, consider the vector y, which was defined at the start of this section:

In [7]:

## Note the shape of y^T
print(y.T)

## And of y itself
print(y)

[[3 4 4 6]]
[[3]
 [4]
 [4]
 [6]]

The inner product of y with itself (ie: $\mathbf{y}^T \mathbf{y}$) is defined as y[0]*y[0] + y[1]*y[1] + y[2]*y[2] + y[3]*y[3]

In [8]:

## Inner product "by hand"
y[0]*y[0] + y[1]*y[1] + y[2]*y[2] + y[3]*y[3]

Out[8]:

array([77])

We can verify this using the dot() function in numpy:

In [9]:

## Inner product using numpy
np.dot(y.T, y)

Out[9]:

array([[77]])

Now consider two matrices, $\mathbf{A}$ and $\mathbf{B}$ and let $\mathbf{AB}$ denote their product. Each element in $\mathbf{AB}$ is an inner product of the corresponding row from $\mathbf{A}$ and column from $\mathbf{B}$. In other words, $\mathbf{AB}_{ij}$ is the inner of the $i^{th}$ row of $\mathbf{A}$ and the $j^{th}$ column of $\mathbf{B}$.

Because of this definition, matrix multiplication is only possible when the number of columns in the left matrix equals the number of rows in the right matrix. The dot() function can also be used to perform matrix multiplication, and the examples below demonstrate compatible and incompatible dimensions:

In [10]:

## Incompatible dimensions, X is 4 x 2 and A is 4 x2
np.dot(X,A)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[10], line 2
      1 ## Incompatible dimensions, X is 4 x 2 and A is 4 x2
----> 2 np.dot(X,A)

File <__array_function__ internals>:200, in dot(*args, **kwargs)

ValueError: shapes (4,2) and (4,2) not aligned: 2 (dim 1) != 4 (dim 0)

In [11]:

## Compatible dimensions, X.T is 2 x 4 and A is 4 x
np.dot(X.T,A)

Out[11]:

array([[-50,  50],
       [-70,  70]])

Question #1: I encourage you to attempt these without using Python, then check your answers with Python if necessary.

Part A: Consider a 3 x 3 matrix, $\mathbf{M}$, consisting of the row vectors [1,0,1], [0,1,0], and [0,0,1]. Which elements of the transpose of this matrix contain the value 1?
Part B: Now consider a 2 x 3 matrix, $\mathbf{N}$, consisting of the row vectors [2,2,2] and [3,3,3]. Which of the following matrix multiplications are possible:
- I: $\mathbf{M}$ and $\mathbf{N}$
- II: $\mathbf{M}$ and $\mathbf{N}^T$
- III: $\mathbf{M}^T$ and $\mathbf{N}$
- IV: $\mathbf{M}^T$ and $\mathbf{N}^T$
- V: $\mathbf{M}$ and $\mathbf{M}^T$
- VI: $\mathbf{N}$ and $\mathbf{N}^T$
- VII: $\mathbf{N}^T$ and $\mathbf{N}$
Part C: What is the value of $(\mathbf{M}^T - \mathbf{M})_{33}$?
Part D: What is the value of $(\mathbf{M}^T \cdot \mathbf{M})_{33}$?

Part 2 - Matrix Algebra and Special Cases¶

Some of the rules you routinely use in algebra can be applied to matrices, but others cannot:

Let $\mathbf{A}$, $\mathbf{B}$, and $\mathbf{C}$ denote matrices of compatible dimensions, and let $c$ denote a scalar. The following rules apply:

$\mathbf{A} + \mathbf{B} = \mathbf{B} + \mathbf{A}$
$(\mathbf{A}\cdot \mathbf{B}) \cdot \mathbf{C} = \mathbf{A} \cdot (\mathbf{B} \cdot \mathbf{C})$
$\mathbf{A}\cdot(\mathbf{B} + \mathbf{C}) = \mathbf{A}\cdot\mathbf{B} + \mathbf{A}\cdot\mathbf{C}$
$c \cdot (\mathbf{A} + \mathbf{B}) = c\cdot\mathbf{A} + c\cdot\mathbf{B}$

Importantly, the following algebraic operations are not valid with matrices:

$\mathbf{A}\cdot \mathbf{B} \ne \mathbf{B} \cdot \mathbf{A}$
$\mathbf{A}\cdot(\mathbf{B} + \mathbf{C}) \ne (\mathbf{B} + \mathbf{C}) \cdot \mathbf{A}$

Special Cases¶

There are three special cases of matrices that each hold beneficial properties which you should be aware of:

Symmetric Matrices
Diagonal Matrices
Identity Matrices

A symmetric matrix has the property $\mathbf{A} = \mathbf{A}^T$, meaning transposition does not change the matrix.

A diagonal matrix is a symmetric matrix with zeros in all of its off-diagonal elements.

An identity matrix is a diagonal matrix with ones in all of its on-diagonal elements. Identity matrices, which we'll denote as $\mathbf{I}$, have the property: $\mathbf{A}\cdot \mathbf{I} = \mathbf{I} \cdot \mathbf{A} = \mathbf{A}$. In other words, they do not change a matrix when multiplying it.

Inversion¶

In algebra we use the division operation to undo multiplication. The matrix equivalent is an operation known as inversion. Consider the matrix equation: $$\mathbf{A} \cdot \mathbf{x} = \mathbf{B}$$

To solve this equation for $\mathbf{x}$ we need to undo the multiplication of $\mathbf{A}$ with $\mathbf{x}$. We do this by multiplying both sides of the equation by $\mathbf{A}^{-1}$, the inverse of $\mathbf{A}$, on the left, which yields: $$\mathbf{x} = \mathbf{A}^{-1} \cdot \mathbf{B}$$

It is essential to know that not all matrices have an inverse. First, only square matrices can have inverses, and these matrices will only be invertible if their component vectors are linearly independent. In our course you may assume that any square matrix you are asked to work with has an inverse, but more generally you should be aware that "wide" data (more columns than rows) can lead to problems in algorithms that require matrix inversions.

We will use the numpy function linalg.inv() to perform matrix inversions.

Additional Properties¶

To close off this section, below are a few additional matrix properties involving inverses, transposes, and other operations:

$(\mathbf{A} + \mathbf{B})^T = \mathbf{A}^T + \mathbf{B}^T$
$(\mathbf{A} \cdot \mathbf{B})^T = \mathbf{B} \cdot \mathbf{A}^T$
$(\mathbf{A} \cdot \mathbf{B})^{-1} = \mathbf{B}^{-1} \cdot \mathbf{A}^{-1}$
$(\mathbf{A}^{T})^{-1} = (\mathbf{A}^{-1})^{T}$

Question #2: Like the previous question, you are encouraged to attempt Parts A and B without Python, but you may use Python to check your work. You must use Python on Part C.

Part A: Consider an $n$ x $p$ matrix, $\mathbf{M}$, where $n \ne p$. is the matrix $\mathbf{M}^T \mathbf{M}$ symmetric? Explain your reasoning.
Part B: Let $\mathbf{M}$ be an $n$ x $p$ matrix, $\mathbf{a}$ be an $n$-dimensional vector, $\mathbf{b}$ be a $p$-dimensional vector, and $\mathbf{0}$ be an $n$-dimensional vector of zeroes. Solve the equation $\mathbf{M}^T (\mathbf{a} - \mathbf{M} \cdot \mathbf{b}) = \mathbf{0}$ for $\mathbf{b}$.
Part C: For the matrices and vectors defined in Part B, suppose we observe the data given below. Use linear algebra operations in numpy to find the values of each element of $\mathbf{b}$ in the equation given in Part B.

$$\mathbf{a} = [1,-1,2,3,-2,1,2] \quad \text{and} \quad \mathbf{M} = \begin{bmatrix} 1 & -1 \\ 2 & -1 \\ 1 & 2 \\ 2 & 2 \\ 1 & -2 \\ 2 & 0 \\ 0 & 2 \end{bmatrix}$$

Part 3 - Matrix Calculus¶

In machine learning, matrices and vectors are frequently used to represent the learnable parameters (weights and biases) in complex models like neural networks. The optimal values of these parameters are found by applying optimization methods to a cost function involving the model structure and training data. This optimization requires us to use calculus, which we will review using the examples in this section.

To begin, consider a 3-dimensional vector of weight parameters: $\mathbf{w} = [w_1, w_2, w_3]$ and a function involving this vector: $f(\mathbf{w}) = 2\mathbf{w}^T\mathbf{w}$

Notice that the output of this function is a scalar, and suppose we would like to find values of $w_1$, $w_2$, and $w_3$ that minimize $f()$. Calculus tells us that a minimum can only occur at a location where the partial derivatives of $f()$ with respect to each component of $\mathbf{w}$ are equal to zero.

We will use the term gradient to define the vector of partial derivatives of a function we are seeking to minimize with respect to each component of the parameter vector. In our example:

$$\nabla f(\mathbf{w}) = \bigg[\frac{\partial f}{\partial w_1}, \frac{\partial f}{\partial w_2}, \frac{\partial f}{\partial w_3}\bigg]$$

Our example is simple enough that we can write out $f(\mathbf{w})$ and find each partial derivative from the resulting expression: $$f(\mathbf{w}) = 2(w_1^2 + w_2^2 + w_3^2)$$

We see $\frac{\partial f}{\partial w_1} = 4w_1$, $\frac{\partial f}{\partial w_2} = 4w_2$, and $\frac{\partial f}{\partial w_3}=4w_3$, so we can write: $\nabla f(\mathbf{w}) = 4\mathbf{w} \equiv 4\mathbf{w}^T$

In this course we will not worry about whether we set things up such that our gradient vector is a single row or single column.

The Jacobian Matrix¶

In our previous example, the function $f(\mathbf{w})$ mapped out weight vector to a scalar output, which allowed us to consider just a single gradient vector. However, some functions we encounter will map our weight vector to another vector, for example: $f(\mathbf{w}) = 2\mathbf{w}$

The Jacobian matrix, or $\frac{\partial \mathbf{f}}{\partial \mathbf{w}}$, is the matrix of all partial derivatives (each output: $f_1, f_2, \ldots$ with respect to each parameter: $w_1, w_2, \ldots$). In our example $f(\mathbf{w})$ is 3-dimensional, as is $\mathbf{w}$, so we may write out the Jacobian as:

$$ \frac{\partial \mathbf{f}}{\partial \mathbf{w}} = \begin{bmatrix} \frac{\partial f_1}{\partial w_1} & \frac{\partial f_1}{\partial w_2} & \frac{\partial f_1}{\partial w_3} \\ \frac{\partial f_2}{\partial w_1} & \frac{\partial f_2}{\partial w_2} & \frac{\partial f_2}{\partial w_3} \\ \frac{\partial f_3}{\partial w_1} & \frac{\partial f_3}{\partial w_2} & \frac{\partial f_3}{\partial w_3} \end{bmatrix} $$

We could also represent this Jacobian as follows:

$$ \frac{\partial \mathbf{f}}{\partial \mathbf{w}} = \begin{bmatrix} \nabla^T f_1 \\ \nabla^T f_2 \\ \nabla^T f_3 \\ \end{bmatrix} $$

In this representation $\nabla^T f_1$ is the row-vector form of the gradient of the component $f_1$ computed with respect to the parameter vector $\mathbf{w}$.

In our example we can use matrix algebra to recognize that: $f(\mathbf{w})=2\mathbf{w}=[2w_1, 2w_2, 2w_3]$, thus:

$$ \frac{\partial \mathbf{f}}{\partial \mathbf{w}} = \begin{bmatrix} 2 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 2 \end{bmatrix} $$

Question #3:

Part A: Consider the parameter vector $\mathbf{w} = [w_1, w_2, w_3]$ and a matrix of constants $\mathbf{A}$ consisting of the row vectors $[1, 1, 0]$ and $[3, 0, 2]$. Let $f(\mathbf{w}) = \mathbf{A} \mathbf{w}^2$, where $\mathbf{w}^2$ contains the squared elements of $\mathbf{w}$ (ie: $[w_1^2, w_2^2, w_3^2]$). Find the Jacobian $\frac{\partial \mathbf{f}}{\partial \mathbf{w}}$.
Part B: Consider $\mathbf{w} = [w_1, w_2, w_3]$ and the function $f(\mathbf{w}) = w_1 + \mathbf{w} \cdot \mathbf{I} \cdot [1, 0, 1]^T$ where $\mathbf{I}$ is the identity matrix. Find the Jacobian $\frac{\partial \mathbf{f}}{\partial \mathbf{w}}$.

Part 4 - Derivative Shortcuts and Chain Rule¶

While it is sometimes helpful to consider each individual element of a matrix or vector when finding the gradient or Jacobian, there are several shortcuts that you should be aware of:

$\text{For } f(\mathbf{x}) = c \cdot \mathbf{x}^T \mathbf{x}, \quad \frac{df}{d\mathbf{x}} = 2c \mathbf{x}$
$\text{For } f(\mathbf{x}) = c \cdot \mathbf{x}^T A \mathbf{x}, \quad \frac{df}{d\mathbf{x}} = 2c A \mathbf{x}$
$\text{For } f(\mathbf{x}) = c \cdot \mathbf{x}^T \mathbf{a}, \quad \frac{df}{d\mathbf{x}} = c \mathbf{a}$
$\text{For } f(\mathbf{x}) = c \cdot \mathbf{x}^T A, \quad \frac{df}{d\mathbf{x}} = c A$

You might note that for "squared" forms we may keep the parameter vector on either side:

$\text{For } f(\mathbf{x}) = c \cdot \mathbf{x}^T A \mathbf{x} \quad \frac{df}{d\mathbf{x}} = 2c \mathbf{x}^T A$
$\text{For } f(\mathbf{x}) = c \cdot \mathbf{x}^T \mathbf{x} \quad \frac{df}{d\mathbf{x}} = 2c \mathbf{x}^T$

Now we will introduce chain rule, which is one of the most important concepts used in machine learning optimization (namely the backpropagation algorithm, which is used to train neural networks).

Consider the function $f(\mathbf{w}) = exp(\mathbf{v})$ where $\mathbf{v} = \mathbf{X}\cdot \mathbf{w}$. Here the exponent operation, $e^{x}$, is applied element-wise to each component of $\mathbf{v}$. Furthermore, let $\mathbf{X}$ be an $n$ by $3$ matrix and $\mathbf{w}$ be a 3-dimensional vector of weights.

Suppose we would ultimately like to find $\frac{\partial \mathbf{f}}{\partial \mathbf{w}}$. One approach would be to substitute $\mathbf{X}\cdot\mathbf{v}$, find each resulting element, and calculate the partial derivatives. However, a better approach would be to express the differentiation in two steps using chain rule: $$\frac{\partial \mathbf{f}}{\partial \mathbf{w}} = \frac{\partial \mathbf{f}}{\partial \mathbf{v}} \frac{\partial \mathbf{v}}{\partial \mathbf{w}}$$

This approach is superior because $\frac{\partial \mathbf{f}}{\partial \mathbf{v}}$ is easy to find, it is $diag(e^\mathbf{v})$ if you recognize that each of the $n$ elements of $exp(\mathbf{v})$ has a non-zero derivative with respect to a single element of $\mathbf{v}$

It is also easy to find $\frac{\partial \mathbf{v}}{\partial \mathbf{w}}$, which we can do by applying the matrix derivative short cuts to arrive at $\mathbf{X}$.

Thus, we have: $\frac{\partial \mathbf{f}}{\partial \mathbf{w}} = diag(e^\mathbf{v}) \cdot \mathbf{X}$

Question #4:

Part A: Consider the function $f(\mathbf{w}) = log\big(\frac{1}{1+exp(-\mathbf{x}^T \mathbf{w})}\big)$ where $\mathbf{x}$ is a $p$-dimensional vector of constants, and $\mathbf{w}$ is a $p$-dimensional vector of weight parameters. Use chain rule to find $\frac{\partial \mathbf{f}}{\partial \mathbf{w}}$. Hint: You should introduce your own notation to facilitate the use of chain rule.
Part B: Now suppose $f(\mathbf{w}) = \sum_{i=1}^{n}log\big(\frac{1}{1+exp(-\mathbf{x}_i^T \mathbf{w})}\big)$ where each $\mathbf{x}_i$ is a different $p$-dimensional vector of constants. Using your results from Part A, what is $\frac{\partial \mathbf{f}}{\partial \mathbf{w}}$ in this scenario? Hint: You may introduce some additional notation of your own here, and you do not need to find any new partial derivatives.