Pretty (and) fast matrix multiplication

Combining C++ and BLAS to get readable and efficient code

Felipe Attanasio

24 Apr 2023 9 min read programming

Assume we have matrices $A$, $B$, and $C$ of dimension $N$ and want to compute $$ C = A\,B\,, $$

or, in compontents, $$ C_{ij} = \sum_{k=0}^{N-1} A_{ik} B_{kj} \quad \forall i, j \in [0, N-1]\,. $$

The naïve implementation

 1#include <vector>
 2
 3int constexpr N = 2000;
 4
 5int main() {
 6    // assumed to be stored in column major format
 7    std::vector<double> A(N*N), B(N*N), C(N*N);
 8
 9    for (int i = 0; i < N; ++i) {
10        for (int j = 0; j < N; ++j) {
11            for (int k = 0; k < N; ++k) {
12                C[i + N * j] = A[i + N * k] * B [k + N * j];
13            }
14        }
15    }
16
17    return 0;
18}

This code is quite readable – it is not too hard to see it computes a matrix product – but it is certainly not pretty. Incidentally, it is also not fast. This can be seen by, e.g., using the <chrono> library:

 1#include <vector>
 2#include <chrono>
 3#include <iostream>
 4
 5int constexpr N = 2000;
 6int main() {
 7    // assumed to be stored in column major format
 8    std::vector<double> A(N*N), B(N*N), C(N*N);
 9
10    auto timerStart = std::chrono::high_resolution_clock::now();
11    for (int i = 0; i < N; ++i) {
12        for (int j = 0; j < N; ++j) {
13            for (int k = 0; k < N; ++k) {
14                C[i + N * j] = A[i + N * k] * B [k + N * j];
15            }
16        }
17    }
18    auto timerStop  = std::chrono::high_resolution_clock::now();
19    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(timerStop - timerStart);
20    std::cout << "Run time: " << duration.count() << "ms\n";
21
22    return 0;
23}

Running on my machine, this results in

1Run time: 27493ms

A cache friendlier version

Pulling a page from the book “Matrix Computations”, by Golub and van Loan¹, we should reorder those loops to make better use of the CPU cache. In particular, the code

 1#include <vector>
 2#include <chrono>
 3#include <iostream>
 4
 5int constexpr N = 2000;
 6
 7int main() {
 8    // assumed to be stored in column major format
 9    std::vector<double> A(N*N), B(N*N), C(N*N);
10
11    auto timerStart = std::chrono::high_resolution_clock::now();
12    for (int j = 0; j < N; ++j) {
13        for (int k = 0; k < N; ++k) {
14            for (int i = 0; i < N; ++i) {
15                C[i + N * j] = A[i + N * k] * B [k + N * j];
16            }
17        }
18    }
19    auto timerStop  = std::chrono::high_resolution_clock::now();
20    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(timerStop - timerStart);
21    std::cout << "Run time: " << duration.count() << "ms\n";
22
23    return 0;
24}

accesses the elements of both A and C sequentially, while B[k + N * j] is constant within the inner loop. The new execution time after this simple change is

1Run time: 5250ms

So the code is now faster, but still not pretty.

Here comes Eigen

You may be asking yourself at this point: “What about Eigen?” Using Eigen, the code is considerably clearer,

 1#include <chrono>
 2#include <iostream>
 3#include <Eigen/Core>
 4
 5int constexpr N = 2000;
 6
 7int main() {
 8    Eigen::MatrixXd A(N, N), B(N, N), C(N, N);
 9
10    auto timerStart = std::chrono::high_resolution_clock::now();
11    C = A * B;
12    auto timerStop  = std::chrono::high_resolution_clock::now();
13    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(timerStop - timerStart);
14    std::cout << "Run time: " << duration.count() << "ms\n";
15
16    return 0;
17}

and also faster:

1Run time: 1020ms

From our initial version, we have gained a speedup factor of $30$ and a “pretty” factor of $+ \infty$.

This is, of course, not the end of the story. Eigen is a well-known library that has been around for a while now. If I were here to tell people to use it, I’d be a few years too late.

But how fast can we really go?

Those of you who are into the numerical linear algebra business are undoubtedly aware of BLAS. It is a FORTRAN library², that provides a standard framework for Basic Linear Algebra Subprograms (hence the name). However, BLAS was standardised in he 1970’s????, and its subroutines are meant to be as generic as they can be. In particular, the matrix multiplication routine for double-precision floating point numbers can be used to perform any of the following operations $$C = \alpha A\,B + \beta C\,$$ $$C = \alpha A^T\,B + \beta C\,$$ $$C = \alpha A\,B^T + \beta C\,$$ $$C = \alpha A^T\,B^T + \beta C\,$$

Hermitian conjugation is also a possibility when complex matrices are used.

and also multiplications of blocks of $A$ and $B$. Of course, with great ~~power~~ generality comes ~~great responsibilities~~ large number of parameters.

Compounding that with the fact that in FORTRAN variables are always passed by reference, our simple code would, using BLAS, read

 1#include <vector>
 2#include <chrono>
 3#include <iostream>
 4#include <cblas.h>
 5
 6int constexpr N = 2000;
 7char constexpr ta = 'N';
 8
 9int main() {
10    // assumed to be stored in column major format
11    std::vector<double> A(N*N), B(N*N), C(N*N);
12
13    double alpha = 1.0;
14    double beta  = 0.0;
15
16    auto timerStart = std::chrono::high_resolution_clock::now();
17
18    dgemm_(&ta, &ta, &N, &N, &N, &alpha, A.data(), &N, B.data(), &N, &beta, C.data(), &N);
19
20    auto timerStop  = std::chrono::high_resolution_clock::now();
21    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(timerStop - timerStart);
22    std::cout << "Run time: " << duration.count() << "ms\n";
23
24    return 0;
25}

The details of what each parameter does is not important for this post, and the reader is encouraged to have a look at BLAS’ documentation. But let us get down to business: how long does this matrix multiplication take?

For this test, I have used the openBLAS implementation of BLAS.

1Run time: 297ms

Note that openBLAS uses openMP by default, when available. For this test I ran the executable with OMP_NUM_THREADS=1 in order to have a fair comparison with the other methods.

This is a speedup factor, compared to the naïve implementation, of $\sim 120$! But it came at the cost of calling a cryptically named function with 13 parameters. So long, readability!

…or not?

Lazy evaluation to the rescue

The “problem” with dgemm_, or one of them, is that the matrix $C$ which holds the result is passed as an argument. Assume we have a matrix structure

1struct Matrix {
2    std::vector<double> mat;
3
4    Matrix(int N)
5    : mat{std::vector<double>(N*N)}
6    { }
7};

A “standard” implementation of operator* would have the signature

1Matrix operator*(Matrix const& A, Matrix const& B);

In this case, the product $A\,B$, potentially done with BLAS, would have to be written to some temporary matrix that is then returned by the function. This temporary would have to be (re-)allocated every time operator* is called via, e.g., A * B. In practice the could would do something like tmp $\leftarrow$ A*B and then C $\leftarrow$ tmp, i.e., the contents of the temporary matrix would be copied or moved into C. After that step, tmp is de-allocated.

What we would like here would be to have both operator* and operator= working together, such that all the information about the matrices involed, $A$, $B$, and $C$, is available to dgemm_. One of of accomplishing this is to make operator* return something that knows about the matrices being multiplied, but that does not actually perform the multiplication.

 1struct Matrix;  // forward declaration
 2
 3struct mul_op {
 4    Matrix const& A;
 5    Matrix const& B;
 6};
 7
 8struct Matrix {
 9    std::vector<double> mat;
10    int const size;
11
12    Matrix(int N)
13    : mat{std::vector<double>(N*N)}
14    , size{N}
15    { }
16
17    Matrix& operator=(mul_op mul);
18};
19
20mul_op operator*(Matrix const& A, Matrix const& B) { return mul_op{A, B}; }
21
22char constexpr ta = 'N';
23double constexpr alpha = 1.0;
24double constexpr beta  = 0.0;
25
26Matrix& Matrix::operator=(mul_op mul) {
27    int const N = mul.A.size;
28
29    dgemm_(&ta, &ta, &N, &N, &N, &alpha, mul.A.mat.data(), &N,
30        mul.B.mat.data(), &N, &beta, this->mat.data(), &N);
31
32    return *this;
33}

Let us unpack this. struct mul_op is merely a container that holds references to two matrices. Because C++ is strongly typed, we could have similar structs for different operations involving two matrices without problems, as long as their names are different. Note that now operator* simply returns a mul_op and does nothing further. No actual multiplication takes place within it!

When operator= is called, it now receives a mul_op that carries the information necessary for the multiplication. operator= itself then calls dgemm_ and passes its own container as the output matrix as a parameter. Using the code above, we can finally run

 1int constexpr N = 2000;
 2
 3int main() {
 4    Matrix A(N), B(N), C(N);
 5
 6    auto timerStart = std::chrono::high_resolution_clock::now();
 7    C = A * B;
 8    auto timerStop  = std::chrono::high_resolution_clock::now();
 9    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(timerStop - timerStart);
10    std::cout << "Run time: " << duration.count() << "ms\n";
11
12    return 0;
13}

and get

1Run time: 286ms

We get the crazy efficiency of BLAS combined with a code just as readable as if it used Eigen.

I took a particularly simple case here of a product of two square matrices as example. The idea shown in this last section can be, with a bit of effort, expanded to deal with the general case handled by BLAS of $C = \alpha op(A) \, op(B) + \beta C$, where $op(X)$ can be $X$, $X^T$, or $X^\dagger$.

It is important to keep in mind that Eigen already provides a way to use BLAS functions within itself, see link. Eigen also makes extensive use of lazy evaluation to achieve the flexibility and readability it is well-known for.

Matrix Computations (Golub, Gene H. and Van Loan, Charles F.), The Johns Hopkins University Press, 2012. ↩︎
I am not aware of C headers that declare the FORTRAN functions directly. CBLAS is a wrapper around BLAS that allows the user to choose between row- and column-major storage format for the matrices; BLAS itself only supports column-major. My usual approach is to simply write the function declaration myself from the BLAS documention for the functions I need. You can find here the declaration for dgemm_ that is used in this post:
```
1extern "C" {
2    void dgemm_(char const *transa, char const *transb, int const *m,
3    int const *n, int const *k,  double const *alpha, double const *a,
4    int const *lda, double const *b, int const *ldb, double const *beta,
5    double *c, int const *ldc);
6}
```
↩︎

cache programming linear-algebra c++ eigen BLAS