# A predictable SIMD library for GEMM routines

Iryna De Albuquerque Silva, Thomas Carle, Adrien Gauffriau, Victor Jegu, Claire Pagetti

RTAS, May 14th 2024



## Introduction

- General Matrix Multiplication (GEMM) used for scientific simulations and ML-based applications
- Heavily optimized in HPC domain
  - Blocked matrix multiplication (improved cache usage)
  - SIMD extensions (vector instructions)
- Safety-critical real-time systems ⇒ certification standards
  - Traceability
  - Timing-predictability
- Objective: optimized and certifiable GEMM library
  - Analyzable and traceable code
    - No dynamic memory management
    - No compiler optimization
  - Choice of blocking parameters to improve predictability

## **Blocked GEMM**

- Goal : compute  $C = A \cdot B + C$
- Improve spatial and temporal locality in caches
  - Divide A, B and C in blocks (submatrices) that fit in the various levels of cache
- Use vector instructions
- 3 routines:
  - Macro-kernel (shaping the blocks, driving the algorithm)
  - Micro-kernel (computing partial products)
  - Packing (copying and reordering elements in the blocks)



- Call micro-kernel on micro-panels
  - $\circ \quad \rightarrow \text{compute micro-tile}$



Matrix A



- Switch to next micro-panel of A
  - $\circ \rightarrow$  compute next micro-tile



Matrix A



- Switch to next micro-panel of A
  - $\circ \rightarrow$  compute next micro-tile



Matrix A



- Switch to next micro-panel of A
  - $\circ \rightarrow$  compute next micro-tile



Matrix A



- Switch to next micro-panel of B
  - $\circ \rightarrow$  compute next micro-tile



Matrix A



- Switch to next micro-panel of B
  - $\circ \rightarrow$  compute next micro-tile



Matrix A



# Blocked GEMM (Micro-kernel)



Matrix A





Matrix C

# Blocked GEMM (Micro-kernel)



Matrix B

Matrix A







#### Blocked GEMM (Micro-kernel) Matrix A Matrix B Loading elements in SIMD registers Vector register -Need to re-order them 0 $\otimes \otimes \otimes \otimes$ Matrix C $\circ \rightarrow \mathsf{Packing}$ Vector Multiply-Matrix A Accumulate Vector register Matrix C 13





















## Contributions

- Contribution 1: Traceable Blocked GEMM library
  - No compiler optimization
  - Optimized NEON code, no intrinsics
- Contribution 2: Cache miss prediction
  - Implementation rules for predictability
  - Formulas for worst-case cache misses
- Target = Texas Keystone II
  - 1 ARM Cortex A15 (ARM v7)
  - NEON extension: 16 128-bits registers (4 FP32 elements each)
  - L1D: 32KB, 64 Bytes cache line, **2-ways associative (256 sets), LRU**
  - L2D: 4MB, 64 Bytes cache line, 16-ways associative (4096 sets), random

### Library code optimization

- Force register usage in C code
  - Reduce pressure on the stack
- Select NEON instructions in assembly code blocks
  - With -O0, ARM NEON intrinsics translation by compiler is not efficient

| TABLE I                                                                 |                                                   |
|-------------------------------------------------------------------------|---------------------------------------------------|
| MEASURED EXECUTION TIMES, IN CYCLES, ON AN ARM CORTEX-A15 WITH -O0 FLAC | B. DIFFERENCE W.R.T. Base IS GIVEN IN PARENTHESES |

| Matr | ix confi | guration |                                    | GEMM implementation                                         |                                                             |             |                                                             |                            |                                                       |                          |
|------|----------|----------|------------------------------------|-------------------------------------------------------------|-------------------------------------------------------------|-------------|-------------------------------------------------------------|----------------------------|-------------------------------------------------------|--------------------------|
| m    | n        | k        | 5                                  | Base                                                        | Firs                                                        | t optimized | wit                                                         | Optimized<br>th intrinsics |                                                       | Optimized<br>in assembly |
| 272  | 272      | 272      | mean<br>min<br>max<br>min-max var. | 2 543 708 367<br>2 543 704 067<br>2 543 711 485<br>2.92e-4% | 988 682 239<br>988 676 103<br>988 692 654<br>1.67e-3%       | (-61.13%)   | 2 220 393 123<br>2 220 386 698<br>2 220 398 214<br>5.19e-4% | (-12.71%)                  | 235 168 329<br>235 153 669<br>235 181 181<br>1.17e-2% | (-90.75%)                |
| 528  | 528      | 528      | mean<br>min<br>max<br>min-max var. | 6 920 673 125<br>6 920 653 603<br>6 920 698 753<br>6.52e-4% | 2 883 613 750<br>2 883 100 642<br>2 888 548 217<br>1.89e-1% | (-58.33%)   | 3 312 934 607<br>3 312 863 994<br>3 312 956 562<br>2.79e-3% | (-52.13%)                  | 1566589440156656568415666039682.44e-3%                | (-77.36%)                |
| 192  | 736      | 528      | mean<br>min<br>max<br>min-max var. | 5 789 246 930<br>5 789 237 191<br>5 789 266 393<br>5.02e-4% | 3 622 514 818<br>3 622 514 793<br>3 622 514 892<br>2.73e-6% | (-37.43%)   | 3 875 069 543<br>3 875 069 520<br>3 875 069 565<br>1.16e-6% | (-33.06%)                  | 778 924 693<br>778 923 061<br>778 926 162<br>3.98e-4% | (-85.55%)                |
| 256  | 784      | 2,016    | mean<br>min<br>max<br>min-max var. | 2 120 705 709<br>2 120 700 674<br>2 120 709 745<br>4.28e-4% | 579964569<br>579962833<br>579966116<br>5.66e-4%             | (-72.65%)   | 857 163 412<br>857 161 851<br>857 165 391<br>4.13e-4%       | (-59.58%)                  | 134 562 545<br>134 561 357<br>134 563 809<br>1.82e-3% | (-93.65%)                |

- Objective: statically bound memory accesses and cache misses
  - For each library routine (packing A, packing B, macro kernel)
  - Take advantage of regular code (no if-then-else)
- These values can then be used as input to static WCET analyser
  - Out of scope

• Corollary (accesses to matrix slices = complete blocks in column)

**Corollary 1.** If  $N_b$  is an odd integer, any slice of shape  $s_{Li} \times X_{Li}$  of matrix M, whose first row is aligned to the start of a cache block, has each of its  $s_{Li}$  cache blocks mapped to a different cache set.

| B4   B5   B6   B7     B8   B9   B10   B11     B12   Image: Constraint of the state                                                                                                                                                          | BO  | B1 | B2  | B3  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|----|-----|-----|
| B8   B9   B10   B11     B12   B10   B11     B12   Comparison   Comparison     B16   Comparison   Comparison     B20   Comparison   Comparison     B24   Comparison   Comparison     B28   Comparison   Comparison                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | B4  | B5 | B6  | B7  |
| B12 B16   B16 B16   B20 B24   B28 B28                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | B&  | B9 | B10 | B11 |
| B16 Image: Constraint of the second | B12 |    |     |     |
| B20<br>B24<br>B28                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | B16 |    |     |     |
| B24<br>B28                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | B20 |    |     |     |
| B28                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | B24 |    |     |     |
| <                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | B28 |    |     |     |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |     |    |     |     |



• Corollary (accesses to matrix slices = complete blocks in column)

**Corollary 1.** If  $N_b$  is an odd integer, any slice of shape  $s_{Li} \times X_{Li}$  of matrix M, whose first row is aligned to the start of a cache block, has each of its  $s_{Li}$  cache blocks mapped to a different cache set.

| BO  | B1 | B2  | B3  | Padding |  |
|-----|----|-----|-----|---------|--|
| B4  | B5 | B6  | B7  | Padding |  |
| B&  | B9 | B10 | B11 | Padding |  |
| B12 |    |     |     | Padding |  |
| B16 |    |     |     | Padding |  |
| B20 |    |     |     | Padding |  |
| B24 |    |     |     | Padding |  |
| B28 |    |     |     | Padding |  |
|     |    |     |     |         |  |



- Packing B
  - Rule 1: Align Matrix B and packing buffer to cache block boundaries
  - Rule 2: Pad rows of matrix B to an odd number of complete cache blocks (if necessary)
  - Rule 3: Choose slice height equal to number of cache sets





21

|   | 4 |  |  |   |  |
|---|---|--|--|---|--|
|   |   |  |  |   |  |
| 1 |   |  |  |   |  |
|   |   |  |  | / |  |

- Packing B
  - Rule 1: Align Matrix B and packing buffer to cache block boundaries
  - Rule 2: Pad rows of matrix B to an odd number of complete cache blocks (if necessary)
  - Rule 3: Choose slice height equal to number of cache sets





25

|  |  |   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                                                                                                | 6 |
|--|--|---|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|---|
|  |  |   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                                                                                                |   |
|  |  |   | a second s |                                                                                                                |   |
|  |  | 1 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | the second s | 4 |

- Packing B
  - Rule 1: Align Matrix B and packing buffer to cache block boundaries
  - Rule 2: Pad rows of matrix B to an odd number of complete cache blocks (if necessary)
  - Rule 3: Choose slice height equal to number of cache sets





26

. . .

- Packing B
  - Rule 1: Align Matrix B and packing buffer to cache block boundaries
  - Rule 2: Pad rows of matrix B to an odd number of complete cache blocks (if necessary)
  - Rule 3: Choose slice height equal to number of cache sets



1 miss + 1 miss



| 1 |  |  |  |  |
|---|--|--|--|--|
|   |  |  |  |  |
|   |  |  |  |  |
|   |  |  |  |  |

- Packing B
  - Rule 1: Align Matrix B and packing buffer to cache block boundaries
  - Rule 2: Pad rows of matrix B to an odd number of complete cache blocks (if necessary)
  - Rule 3: Choose slice height equal to number of cache sets



|                                   | Cache    |
|-----------------------------------|----------|
| Way 1                             | Way 2    |
| $\otimes \otimes \otimes \otimes$ |          |
|                                   | 88888888 |
|                                   |          |
|                                   |          |
|                                   |          |
| $\otimes \otimes \otimes \otimes$ |          |
|                                   |          |
|                                   |          |

#### 2 misses + 1 miss



| <u> </u> |  |  |
|----------|--|--|
|          |  |  |



- Packing B
  - Rule 1: Align Matrix B and packing buffer to cache block boundaries
  - Rule 2: Pad rows of matrix B to an odd number of complete cache blocks (if necessary)
  - Rule 3: Choose slice height equal to number of cache sets



8 misses + 4 misses



|                                         |            | 0         |  |
|-----------------------------------------|------------|-----------|--|
| 000000000000000000000000000000000000000 | 000000000  | 0000000   |  |
|                                         |            |           |  |
|                                         | 0000000000 | 000000000 |  |
|                                         | 00000000   | 00000000  |  |

- Packing B
  - Rule 1: Align Matrix B and packing buffer to cache block boundaries Ο
  - Rule 2: Pad rows of matrix B to an odd number of complete cache blocks (if necessary) 0

Way 2

Rule 3: Choose slice height equal to number of cache sets 0



8 misses + 5 misses



. . .

- Packing B
  - Exact prediction of number of memory accesses and L1 cache misses
  - Cache miss bound is theoretical minimum
- Packing A is handled in a similar fashion



8 misses + 8 misses i.e. number of blocks

Matrix B



. . .

- Macro kernel
  - Rule 1: Matrix C is aligned to a cache block boundary 0
  - Rule 2: Matrix C is padded if necessary 0
  - Prop. 1: Buffer A, Buffer B aligned to a cache block boundary Ο
  - Prop. 2: Micro-panels of A are contiguous in memory (same for B)  $\bigcirc$
  - Prop. 3: Successive rows of each C<sub>r</sub>(i) mapped to distinct cache sets Ο
  - Prop. 4: All micro-panels of A and B have the same size 0
- $C_{r}(0) += A_{r}(0)^{*}B_{r}(0);$  $C_{r}(1) += A_{r}(1)^{*}B_{r}(0);$
- $C'_{r}(2) += A_{r}(2)^{*}B_{r}(0);$
- $C_{r}(3) += A_{r}(3)*B_{r}(0);$
- $C_r(9) += A_r(1)^*B_r(2);$

Matrix A

Matrix C

Br(0) Br(1) Br(2) Br(3) CH(D) CH(4) CH(8) CH(12) C(1) C(5) C(9) C(13) c(2) c(6) c(10) c(14) CA3) CA7) CA11) CA15)

Ar(0)

Ar(1)

Ar(2)

Ar(3)







# Improving predictability

- Macro kernel
  - Rule 1: Matrix C is aligned to a cache block boundary 0
  - Rule 2: Matrix C is padded if necessary 0
  - Prop. 1: Buffer A, Buffer B aligned to a cache block boundary Ο
  - Prop. 2: Micro-panels of A are contiguous in memory (same for B)  $\bigcirc$
  - Prop. 3: Successive rows of each C (i) mapped to distinct cache sets Ο
  - Prop. 4: All micro-panels of A and B have the same size 0

• 
$$C_r(0) += A_r(0)^*B_r(0);$$
  
•  $C_r(1) += A_r(1)^*B_r(0);$  Reuse  $B_r(0)$ 

- $C_{r}(2) += A_{r}(2)^{*}B_{r}(0); \ J$   $C_{r}(3) += A_{r}(3)^{*}B_{r}(0);$

. . .

 $C_r(9) += A_r(1)^*B_r(2);$ 

| Ar(0) | ۲(۵)  |
|-------|-------|
| Ar(1) | Cr(1) |
| Ar(2) | ८८(२) |
| Ar(3) | Cr(3) |

Matrix A

B~(0)

Matrix C

- Macro kernel

  - $\bigcirc \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // R-C_{r}(1); R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$
  - $\circ$  C<sub>r</sub>(2) += A<sub>r</sub>(2)\*B<sub>r</sub>(0);
  - $\circ$  C<sub>r</sub>(3) += A<sub>r</sub>(3)\*B<sub>r</sub>(0);

| C | B | A |
|---|---|---|
|   |   |   |
|   |   |   |
|   |   |   |
|   |   |   |
|   |   |   |

Required cache sets



Matrix C

- Macro kernel
  - $\bigcirc \quad C_{r}(0) += A_{r}(0)^{*}B_{r}(0); // \mathbf{R-C}_{r}(0); \ R-A_{r}(0)||R-B_{r}(0); W-C_{r}(0);$
  - $\circ \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // R-C_{r}(1); R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$
  - $\circ$  C<sub>r</sub>(2) += A<sub>r</sub>(2)\*B<sub>r</sub>(0);
  - $\circ$  C<sub>r</sub>(3) += A<sub>r</sub>(3)\*B<sub>r</sub>(0);



#### Required cache sets

Matrix A

Matrix B

35

- Macro kernel
  - $\bigcirc \quad C_{r}(0) += A_{r}(0)^{*}B_{r}(0); // R-C_{r}(0); R-A_{r}(0)||R-B_{r}(0); W-C_{r}(0);$
  - $\circ \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // R-C_{r}(1); R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$
  - $\circ$  C<sub>r</sub>(2) += A<sub>r</sub>(2)\*B<sub>r</sub>(0);
  - $\circ$  C<sub>r</sub>(3) += A<sub>r</sub>(3)\*B<sub>r</sub>(0);



#### Required cache sets

Matrix A

36
- Macro kernel
  - $\bigcirc \qquad C_{r}(0) += A_{r}(0)^{*}B_{r}(0); // R-C_{r}(0); R-A_{r}(0)||R-B_{r}(0); W-C_{r}(0);$
  - $\circ \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // R-C_{r}(1); R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$
  - $\circ$  C<sub>r</sub>(2) += A<sub>r</sub>(2)\*B<sub>r</sub>(0);
  - $\circ$  C<sub>r</sub>(3) += A<sub>r</sub>(3)\*B<sub>r</sub>(0);

| ( <sup>(0)</sup> , <b>C</b> | B | A |         |
|-----------------------------|---|---|---------|
|                             |   |   | Read Br |
|                             |   |   |         |
|                             |   |   |         |
|                             |   |   |         |
|                             |   |   |         |

#### Required cache sets

Matrix A

- Macro kernel
  - $\bigcirc \qquad C_{r}(0) += A_{r}(0)^{*}B_{r}(0); // R-C_{r}(0); R-A_{r}(0)||R-B_{r}(0); W-C_{r}(0);$
  - $\circ \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // R-C_{r}(1); R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$
  - $\circ$  C<sub>r</sub>(2) += A<sub>r</sub>(2)\*B<sub>r</sub>(0);
  - $\circ$  C<sub>r</sub>(3) += A<sub>r</sub>(3)\*B<sub>r</sub>(0);



Required cache sets



- Macro kernel
  - $\bigcirc \qquad C_{r}(0) += A_{r}(0)^{*}B_{r}(0); // R-C_{r}(0); R-A_{r}(0)||R-B_{r}(0); W-C_{r}(0);$
  - $\circ \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // R-C_{r}(1); R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$
  - $\circ$  C<sub>r</sub>(2) += A<sub>r</sub>(2)\*B<sub>r</sub>(0);
  - $\circ$  C<sub>r</sub>(3) += A<sub>r</sub>(3)\*B<sub>r</sub>(0);







Matrix B

- Macro kernel
  - $\bigcirc \quad C_{r}(0) += A_{r}(0)^{*}B_{r}(0); // R-C_{r}(0); R-A_{r}(0)||R-B_{r}(0); W-C_{r}(0);$
  - $\circ \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // R-C_{r}(1); R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$
  - $\circ$  C<sub>r</sub>(2) += A<sub>r</sub>(2)\*B<sub>r</sub>(0);
  - $C_r(3) += A_r(3)*B_r(0);$







40

Matrix B

- Macro kernel
  - $\circ \quad C_{r}(0) += A_{r}(0)^{*}B_{r}(0); // R-C_{r}(0); R-A_{r}(0)||R-B_{r}(0); W-C_{r}(0);$
  - $\circ \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // \mathbf{R-C_{r}(1)}; R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$
  - $\circ$  C<sub>r</sub>(2) += A<sub>r</sub>(2)\*B<sub>r</sub>(0);
  - $C_r(3) += A_r(3)*B_r(0);$



#### Required cache sets

Matrix A

- Macro kernel
  - $\bigcirc \quad C_{r}(0) += A_{r}(0)^{*}B_{r}(0); \ // \ R-C_{r}(0); \ R-A_{r}(0)||R-B_{r}(0); W-C_{r}(0);$
  - $\bigcirc \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // R-C_{r}(1); R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$
  - $\circ$  C<sub>r</sub>(2) += A<sub>r</sub>(2)\*B<sub>r</sub>(0);
  - $C_r(3) += A_r(3)*B_r(0);$



| C | R | A |         |
|---|---|---|---------|
|   |   |   | 5 1 1   |
|   |   |   | Read Ar |
|   |   |   |         |
|   |   |   |         |
|   |   |   |         |
|   |   |   |         |
|   |   |   |         |
|   |   |   | }       |

Required cache sets

T

Matrix A

- Macro kernel
  - $\bigcirc \quad C_{r}(0) += A_{r}(0)^{*}B_{r}(0); // R-C_{r}(0); R-A_{r}(0)||R-B_{r}(0); W-C_{r}(0);$
  - $\bigcirc \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // R-C_{r}(1); R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$

~

 $\circ$  C<sub>r</sub>(2) += A<sub>r</sub>(2)\*B<sub>r</sub>(0);

• 
$$C_r(3) += A_r(3) B_r(0);$$

Case 1: Br gets old because of Ar and is evicted by Cr

| C | B   | A     |     |
|---|-----|-------|-----|
|   |     |       |     |
|   |     |       | ト   |
|   |     |       |     |
|   | old | young | , V |
|   |     |       |     |
|   |     |       |     |
|   |     |       |     |
|   |     |       |     |
|   |     |       |     |
|   |     |       |     |
|   |     |       |     |
|   |     |       | 1   |

Required cache sets

D

λ

Read Ar//Br; Write Cr



Matrix C

- Macro kernel
  - $\bigcirc \quad C_{r}(0) += A_{r}(0)^{*}B_{r}(0); // R-C_{r}(0); R-A_{r}(0)||R-B_{r}(0); W-C_{r}(0);$
  - $\circ \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // \mathbf{R-C_{r}(1)}; R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$
  - $\circ$  C<sub>r</sub>(2) += A<sub>r</sub>(2)\*B<sub>r</sub>(0);

• 
$$C_r(3) += A_r(3) + B_r(0);$$



| с.    | B       | A   |                            |
|-------|---------|-----|----------------------------|
| young | evicted | old | Read Cr<br>Br gets evicted |
|       |         |     |                            |
|       |         |     |                            |

Required cache sets

Matrix A

- Macro kernel
  - $\bigcirc \quad C_{r}(0) += A_{r}(0)^{*}B_{r}(0); // R-C_{r}(0); R-A_{r}(0)||R-B_{r}(0); W-C_{r}(0);$
  - $\circ \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // R-C_{r}(1); R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$

~

 $\circ$  C<sub>r</sub>(2) += A<sub>r</sub>(2)\*B<sub>r</sub>(0);

• 
$$C_r(3) += A_r(3)*B_r(0);$$



| C | B     | A |              |
|---|-------|---|--------------|
|   |       |   | Road 10//Rai |
|   |       |   | Read AIT DI, |
|   |       |   | Write Cr     |
|   | young |   |              |
|   |       |   |              |
|   |       |   |              |
|   |       |   |              |
|   |       |   |              |
|   | 1     |   | 1            |
|   |       |   |              |

.



T



- Macro kernel
  - $\bigcirc \quad C_{r}(0) += A_{r}(0)^{*}B_{r}(0); \ // \ R-C_{r}(0); \ R-A_{r}(0)||R-B_{r}(0); W-C_{r}(0);$
  - $\bigcirc \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // \mathbf{R-C}_{r}(1); \ R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$

~

 $\circ$  C<sub>r</sub>(2) += A<sub>r</sub>(2)\*B<sub>r</sub>(0);

• 
$$C_r(3) += A_r(3) B_r(0);$$



|          | A | D   | C     |
|----------|---|-----|-------|
| Read Co  |   |     |       |
| Thead CI |   |     |       |
| 1        |   | old | young |
| 1        |   |     |       |
|          |   |     |       |
| 4        |   |     |       |
| 4        |   |     |       |
| J        |   |     |       |

A

Required cache sets

D

Matrix A

Matrix B

Matrix C

- Macro kernel
  - $\bigcirc \quad C_{r}(0) += A_{r}(0)^{*}B_{r}(0); // R-C_{r}(0); R-A_{r}(0)||R-B_{r}(0); W-C_{r}(0);$
  - $\bigcirc \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // R-C_{r}(1); R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$

~

 $\circ$  C<sub>r</sub>(2) += A<sub>r</sub>(2)\*B<sub>r</sub>(0);

• 
$$C_r(3) += A_r(3) B_r(0);$$



| C   | В       | A     |                 |
|-----|---------|-------|-----------------|
|     |         |       | Read Ar         |
|     |         |       | Bo note aviated |
| old | evicted | young | DI GELS EVICLEN |
|     |         |       |                 |
|     |         |       |                 |
|     |         |       |                 |
|     |         |       |                 |
|     |         |       | L               |

.

Required cache sets

T

# Matrix A

- Macro kernel
  - $C_r(0) += A_r(0)*B_r(0); // R-C_r(0); R-A_r(0)||R-B_r(0); W-C_r(0);$ Ο
  - $C_r(1) += A_r(1)*B_r(0); // R-C_r(1); R-A_r(1)||R-B_r(0);W-C_r(1);$ Ο

 $\bigcirc$  $C_r(2) += A_r(2)^*B_r(0);$ 

• 
$$C_r(3) += A_r(3)*B_r(0);$$



| c C   | B | A |         |
|-------|---|---|---------|
| young |   |   | Read Cr |
|       |   |   |         |
|       |   |   |         |
|       |   |   |         |
|       |   |   |         |
|       |   |   | ]       |

Required cache sets

T

Matrix A Matrix C

- Macro kernel
  - $\bigcirc \quad C_{r}(0) += A_{r}(0)^{*}B_{r}(0); // R-C_{r}(0); \mathbf{R-A}_{r}(0)||R-B_{r}(0); W-C_{r}(0);$
  - $\bigcirc \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // R-C_{r}(1); R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$
  - $\circ$  C<sub>r</sub>(2) += A<sub>r</sub>(2)\*B<sub>r</sub>(0);

• 
$$C_r(3) += A_r(3)*B_r(0);$$



| C     | B | A     |      |    |
|-------|---|-------|------|----|
| young |   | young | Read | Ar |
|       |   |       |      |    |
|       |   |       |      |    |
|       |   |       |      |    |
|       |   |       |      |    |
|       |   |       | ]    |    |
|       |   |       | l    |    |

Required cache sets

Matrix A

- Macro kernel
  - $\bigcirc \qquad C_{r}(0) += A_{r}(0)^{*}B_{r}(0); // R-C_{r}(0); R-A_{r}(0)||R-B_{r}(0); W-C_{r}(0);$
  - $\circ \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // R-C_{r}(1); R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$

~

•  $C_r(2) += A_r(2)^*B_r(0);$ 

• 
$$C_r(3) += A_r(3) + B_r(0);$$



| C   | R     | A     |         |
|-----|-------|-------|---------|
| old | young | young | Read Br |
|     |       |       | ]       |
|     |       |       |         |
|     |       |       |         |
|     |       |       |         |

Required cache sets

T

Matrix A

- Macro kernel
  - $\bigcirc \qquad C_{r}(0) += A_{r}(0)^{*}B_{r}(0); // R-C_{r}(0); R-A_{r}(0)||R-B_{r}(0); W-C_{r}(0);$
  - $\circ \quad C_{r}(1) += A_{r}(1)^{*}B_{r}(0); // R-C_{r}(1); R-A_{r}(1)||R-B_{r}(0); W-C_{r}(1);$
  - $\circ$  C<sub>r</sub>(2) += A<sub>r</sub>(2)\*B<sub>r</sub>(0);

• 
$$C_r(3) += A_r(3)*B_r(0);$$



| C C     | B   | A              |                 |
|---------|-----|----------------|-----------------|
| evicted | old | young<br>young | Read Ar         |
|         |     |                | Cr gets evicted |
|         |     |                |                 |
|         |     |                |                 |
|         |     |                |                 |
|         |     |                | ł               |

Required cache sets

Matrix A

- For packing A and B, formulas predict misses with max 0.03% overhead
- For macro-kernel, overhead of 5% on average
  - Except particular case with small blocks  $A_c$  and  $C_c$  (up to 13% overhead)
- Less than 5% overhead in average execution time
- For packing B, L1D refill reduced by up to 60% in the best case

|             | MEASURES FOR THE NUMBER OF MEMORY ACCESS AND LTD REFILL OF THE GEMM ROUTINE. |            |        |              |            |               |          |           |           |           |           |          |
|-------------|------------------------------------------------------------------------------|------------|--------|--------------|------------|---------------|----------|-----------|-----------|-----------|-----------|----------|
|             | Matr                                                                         | ix configu | ration |              | Men        | nory accesses |          |           | Ι         | 1D refill |           |          |
|             | m                                                                            | n          | k      |              | Ours       | Low et        | al.      | Ours      | Low of    | et al.    | Theore    | etical   |
| <i>(i)</i>  | 272                                                                          | 272        | 272    | Packing B    | 148 032    | 148 000       | (-0.02%) | 9 2 5 3   | 10612     | (12.81%)  | 9254      | (0.01%)  |
|             |                                                                              |            |        | Packing A    | 148 032    | 148 000       | (-0.02%) | 9252      | 9251      | (-0.01%)  | 9254      | (0.02%)  |
|             |                                                                              |            |        | Macro-kernel | 2811414    | 2 663 435     | (-5.26%) | 333 950   | 347 307   | (3.85%)   | 384 886   | (13.23%) |
|             |                                                                              |            |        | Total        | 3 107 478  | 2959435       | (-4.76%) | 352 455   | 367 170   | (4.01%)   | 403 394   | (12.63%) |
| <i>(ii)</i> | 528                                                                          | 528        | 528    | Packing B    | 557 664    | 557 632       | (-0.01%) | 34 856    | 78145     | (55.40%)  | 34 857    | (0.00%)  |
|             |                                                                              |            |        | Packing A    | 557 664    | 557 632       | (-0.01%) | 34 854    | 34852     | (-0.01%)  | 34 857    | (0.01%)  |
|             |                                                                              |            |        | Macro-kernel | 20072481   | 19514902      | (-2.78%) | 2 552 600 | 2 509 309 | (-1.73%)  | 2703897   | (5.60%)  |
|             |                                                                              |            |        | Total        | 21 187 809 | 20 630 166    | (-2.63%) | 2622310   | 2 622 306 | (0.00%)   | 2773611   | (5.46%)  |
| (iii)       | 256                                                                          | 784        | 2,016* | Packing B    | 3 161 344  | 3 161 216     | (0.00%)  | 197 591   | 341 213   | (42.09%)  | 197 592   | (0.00%)  |
|             |                                                                              |            |        | Packing A    | 1 032 448  | 1 032 320     | (-0.01%) | 64 528    | 64 520    | (-0.01%)  | 64 536    | (0.01%)  |
|             |                                                                              |            |        | Macro-kernel | 53 788 760 | 52 183 084    | (-2.99%) | 6894927   | 6766186   | (-1.90%)  | 7217528   | (4 47%)  |
|             |                                                                              |            |        | Total        | 57 982 552 | 56376620      | (-2.77%) | 7157046   | 7171919   | (0.21%)   | 7479656   | (4.31%)  |
| (iv)        | 192                                                                          | 736*       | 528    | Packing B    | 777 312    | 777 248       | (-0.01%) | 48 584    | 121 948   | (60.16%)  | 48 585    | (0.00%)  |
|             |                                                                              |            |        | Packing A    | 202 848    | 202 784       | (-0.03%) | 12677     | 12675     | (-0.02%)  | 12681     | (0.03%)  |
|             |                                                                              |            |        | Macro-kernel | 10174497   | 9891862       | (-2.78%) | 1 248 905 | 1 258 791 | (0.79%)   | 1 322 057 | (5.53%)  |
| * Pad       | ding app                                                                     | plied.     |        | Total        | 11 154 657 | 10871894      | (-2.53%) | 1 310 166 | 1 393 414 | (5.97%)   | 1 383 323 | (5.29%)  |

| TABLE III                                                 |                     |
|-----------------------------------------------------------|---------------------|
| Measures for the number of memory access and $L1D$ refill | OF THE GEMM ROUTINE |

# Conclusion

- Predictable and traceable yet efficient implementation of a blocked GEMM algorithm
  - Execution time reduced by up to 99% compared to unoptimized code, with no compiler optimization
  - L1D cache misses over-estimated by 5% on average, except for one matrix configuration
  - $\circ$   $\,$  Cost of predictability is less than 5%  $\,$
- Future work
  - Extend our formulas to predictable L2 caches
  - Refine analysis for the problematic case
  - Automatic code generation targeting particular architectures and matrix configurations
    - Instead of generic library

#### Thank you for your attention

Questions ?

# Blocked GEMM (Micro-kernel)

Matrix B



Matrix A

Matrix C

Outer product

# Blocked GEMM (Micro-kernel)

Matrix B



Matrix A





# Blocked GEMM (Micro-kernel)

Matrix B



Matrix A





# Blocked GEMM (Micro-kernel)

Matrix B



Matrix A





# Blocked GEMM (Micro-kernel)

Matrix B





#### Blocked GEMM (macro-kernel)

- Switch to next panel of A
  - $\circ \quad \rightarrow \text{compute next block of C}$



Matrix A



Matrix B

#### Blocked GEMM (macro-kernel)

- Switch back to previous panels of A
  - Use next blocks in panels
  - $\circ \quad \rightarrow \text{accumulate in blocks of C}$



Matrix A









Blocked GEMM (Packing A)

Packed Panel Buffer Ãc





























#### Parameter tuning

- Traditionally, the blocks dimensions are chosen empirically so that:
  - Micro-panels  $B_r$  stay in the L1 cache, one at a time
  - $\circ$  The packed buffer  $\tilde{A}_c$  stays in the L2 cache
  - The packed buffer  $\widetilde{B}_{c}$  stays in the L3 cache, if any

- Low et al. [1] proposed an analytical method to tune the parameters automatically to reduce execution time
  - Here, same philosophy for predictability

[1] T. M. Low, F. D. Igual, T. M. Smith, and E. S. Quintana-Orti, "Analytical modeling is enough for high-performance BLIS" ACM Transactions on Mathematical Software, vol. 43, no. 2, pp. 1–18, Aug. 2016

• General theorem (strided accesses in a matrix)

**Theorem 1.** Let  $b, n \in \mathbb{N}$ . If n is odd, then the cache blocks of indices  $b, b+n, b+2 \cdot n, \ldots, b+(s_{Li}-1) \cdot n$  are all mapped to different cache sets.

| BO | B1                   | B2                               |
|----|----------------------|----------------------------------|
| B3 | B4                   | B5                               |
| B6 | B7                   | B&                               |
| Ba |                      |                                  |
|    |                      |                                  |
|    |                      |                                  |
|    |                      |                                  |
|    |                      |                                  |
|    | B0<br>B3<br>B6<br>B9 | B0  B1    B3  B4    B6  B7    B9 |


• General theorem (strided accesses in a matrix)

**Theorem 1.** Let  $b, n \in \mathbb{N}$ . If n is odd, then the cache blocks of indices  $b, b+n, b+2 \cdot n, \ldots, b+(s_{Li}-1) \cdot n$  are all mapped to different cache sets.



• General theorem (strided accesses in a matrix)

**Theorem 1.** Let  $b, n \in \mathbb{N}$ . If n is odd, then the cache blocks of indices  $b, b+n, b+2 \cdot n, \ldots, b+(s_{Li}-1) \cdot n$  are all mapped to different cache sets.

Matrix B

| BO          | B1 | B2 |
|-------------|----|----|
| B3 V        | B4 | B5 |
| B6 V        | B7 | B& |
| B9 <b>V</b> |    |    |
| B12 ¥       |    |    |
| B15 V       |    |    |
| B18 V       |    |    |
| B21 🗸       |    |    |

Ex. 2: 6=0, n=3

| Way 1 | Way 2 |
|-------|-------|
| BO    |       |
| B9    |       |
| B18   |       |
| B3    |       |
| B12   |       |
| B21   |       |
| B6    |       |
| B15   |       |

General theorem (mapping a matrix to a data cache with s<sub>Li</sub> sets)
Theorem 1. Let b, n ∈ N. If n is odd, then the cache blocks of indices b, b+n, b+2·n, ..., b+(s<sub>Li</sub>-1)·n are all mapped to different cache sets.



#### • Packing B

- Rule 1: Align Matrix B and packing buffer to cache block boundaries
- Rule 2: Pad rows of matrix B to an odd number of complete cache blocks (if necessary)

Cooko

- Rule 3: Choose vector width as a divisor of cache block size
- Rule 4: Choose slice height equal to number of cache sets



| Way 1    | Way 2                             |
|----------|-----------------------------------|
| <u> </u> |                                   |
|          | 00000000                          |
| 8888     | $\otimes \otimes \otimes \otimes$ |
|          |                                   |
|          |                                   |
| 8888     |                                   |
|          |                                   |
|          |                                   |



| 000000000000 |  |  |                                                                                                                |
|--------------|--|--|----------------------------------------------------------------------------------------------------------------|
|              |  |  |                                                                                                                |
|              |  |  |                                                                                                                |
|              |  |  | And a second |



#### • Packing B

- Rule 1: Align Matrix B and packing buffer to cache block boundaries
- Rule 2: Pad rows of matrix B to an odd number of complete cache blocks (if necessary)
- Rule 3: Choose vector width as a divisor of cache block size
- Rule 4: Choose slice height equal to number of cache sets



|                                   | Cache    |
|-----------------------------------|----------|
| Way 1                             | Way 2    |
| $\otimes \otimes \otimes \otimes$ |          |
| $\otimes \otimes \otimes \otimes$ | <u> </u> |
| $\otimes \otimes \otimes \otimes$ |          |
| $\otimes \otimes \otimes \otimes$ | <u> </u> |
| $\otimes \otimes \otimes \otimes$ | <u> </u> |
| $\otimes \otimes \otimes \otimes$ |          |
| 8888                              |          |
| $\infty$                          |          |



Buffer  $\tilde{B}_{c}$ 

|                                         |                                                                                                                                                             |           | - |
|-----------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|---|
|                                         | $ \otimes $ | $\otimes$ |   |
| 000000000000000000000000000000000000000 |                                                                                                                                                             | 000000000 |   |

#### • Packing B

- Rule 1: Align Matrix B and packing buffer to cache block boundaries
- Rule 2: Pad rows of matrix B to an odd number of complete cache blocks (if necessary)
- Rule 3: Choose vector width as a divisor of cache block size
- Rule 4: Choose slice height equal to number of cache sets







Cache

Way 2

 $\otimes$ 

 $\otimes \otimes \otimes \otimes$ 

#### • Packing B

- Rule 1: Align Matrix B and packing buffer to cache block boundaries
- Rule 2: Pad rows of matrix B to an odd number of complete cache blocks (if necessary)

Cache

- Rule 3: Choose vector width as a divisor of cache block size
- Rule 4: Choose slice height equal to number of cache sets



| Way 1                             | Way 2     |
|-----------------------------------|-----------|
| wayı                              | way       |
| $\otimes$                         |           |
| 8888                              | 000000000 |
| $\otimes \otimes \otimes \otimes$ | 8888888   |
| $\otimes \otimes \otimes \otimes$ | 8888888   |
| $\otimes \otimes \otimes \otimes$ | 8888888   |
| <u> </u>                          |           |
| 8888                              |           |
| $\infty \infty \infty$            | 1         |



#### • Packing B

- Rule 1: Align Matrix B and packing buffer to cache block boundaries
- Rule 2: Pad rows of matrix B to an odd number of complete cache blocks (if necessary)
- Rule 3: Choose vector width as a divisor of cache block size
- Rule 4: Choose slice height equal to number of cache sets



Buffer  $\widetilde{B}_{c}$ 



Matrix B