Twelve Labs

최첨단 기술은 바로 적용되지 않습니다: B300에서 FlashAttention-4 커스텀하기

샘 최

If you truly want to be on the cutting edge, plug and play is no longer an option. We aim to be a team that moves fast, solving problems directly whenever we need to.

If you truly want to be on the cutting edge, plug and play is no longer an option. We aim to be a team that moves fast, solving problems directly whenever we need to.

목차

No headings found on page

뉴스레터 구독하기

뉴스레터 구독하기

영상 이해 분야의 최신 기술 업데이트, 튜토리얼 및 인사이트를 받아보세요.

영상 이해 분야의 최신 기술 업데이트, 튜토리얼 및 인사이트를 받아보세요.

AI로 영상을 검색하고, 분석하고, 탐색하세요.

2026. 5. 29.

10분

링크 복사하기

Why is B300 Slower Than H100?

This was the first thought that came to mind during our initial B300 training run. According to the spec sheet, it boasts 3.5 times the VRAM and over twice the maximum FLOPs of the previous generation (Hopper, H100), yet the model forward and backward passes were actually slower. As we dug into the code to find the root cause, we realized the bottleneck lay at the very heart of the Transformer: attention. More specifically, the issue was with the Flash Attention kernel designed to accelerate it.

Up until then, we had been using the Flash Attention 3 (FA3) kernel, which is highly optimized specifically for Hopper. However, this kernel could not be used on the B300 (which features the Blackwell architecture), forcing a fallback to a more generic, previous-generation kernel: Flash Attention 2 (FA2). In other words, while our hardware had leapt a generation forward, our software had slipped a generation backward.

Fortunately, Flash Attention 4 (FA4), written specifically for Blackwell, was available as a pre-release at the time. Much like FA3 before it, FA4 was a complete rewrite aimed at delivering massive performance gains on Blackwell compared to older kernels. Unfortunately, we couldn't just use it out of the box. Our model's attention head dimensions fell outside the list of shapes supported by FA4 at that stage.

Typically, you face one of two choices at this point:

  1. Redesign the model architecture to fit the head dimensions supported by FA4.

  2. Keep the architecture as-is and stick with the older fallback kernel.

We chose option three: write our own kernel tailored to our model's exact head dimensions.

This post is a record of how a Research Scientist rolled up their sleeves to write a custom GPU kernel, and what we learned about why cutting-edge hardware is rarely plug-and-play—and just how deep a model team has to go to extract the performance they need.


Flash Attention Recap & Why It Must Be Rewritten for Every Generation

Let’s do a quick, high-level recap.

If you calculate attention naively, the entire score matrix S = Q · Kᵀ # [..., T_q, T_k] must be written to HBM (GPU High Bandwidth Memory). As the sequence length grows, the GPU spends more time writing and reading these intermediate values to and from memory than it does on the actual matrix multiplication. FlashAttention solved this by

Why is B300 Slower Than H100?

This was the first thought that came to mind during our initial B300 training run. According to the spec sheet, it boasts 3.5 times the VRAM and over twice the maximum FLOPs of the previous generation (Hopper, H100), yet the model forward and backward passes were actually slower. As we dug into the code to find the root cause, we realized the bottleneck lay at the very heart of the Transformer: attention. More specifically, the issue was with the Flash Attention kernel designed to accelerate it.

Up until then, we had been using the Flash Attention 3 (FA3) kernel, which is highly optimized specifically for Hopper. However, this kernel could not be used on the B300 (which features the Blackwell architecture), forcing a fallback to a more generic, previous-generation kernel: Flash Attention 2 (FA2). In other words, while our hardware had leapt a generation forward, our software had slipped a generation backward.

Fortunately, Flash Attention 4 (FA4), written specifically for Blackwell, was available as a pre-release at the time. Much like FA3 before it, FA4 was a complete rewrite aimed at delivering massive performance gains on Blackwell compared to older kernels. Unfortunately, we couldn't just use it out of the box. Our model's attention head dimensions fell outside the list of shapes supported by FA4 at that stage.

Typically, you face one of two choices at this point:

  1. Redesign the model architecture to fit the head dimensions supported by FA4.

  2. Keep the architecture as-is and stick with the older fallback kernel.

We chose option three: write our own kernel tailored to our model's exact head dimensions.

This post is a record of how a Research Scientist rolled up their sleeves to write a custom GPU kernel, and what we learned about why cutting-edge hardware is rarely plug-and-play—and just how deep a model team has to go to extract the performance they need.


Flash Attention Recap & Why It Must Be Rewritten for Every Generation

Let’s do a quick, high-level recap.

If you calculate attention naively, the entire score matrix S = Q · Kᵀ # [..., T_q, T_k] must be written to HBM (GPU High Bandwidth Memory). As the sequence length grows, the GPU spends more time writing and reading these intermediate values to and from memory than it does on the actual matrix multiplication. FlashAttention solved this by