Synthyra
/

ESM2-650M

@@ -18,7 +18,7 @@ Load any ESM2 models into a FastEsm model to dramatically speed up training and
 | Backend | Key | Notes |
 | :--- | :--- | :--- |
 | PyTorch SDPA | `"sdpa"` | Default. Exact numerics, stable on all hardware. |
-| Flash Attention | `"kernels_flash"` | Fastest. Requires `pip install kernels` (pre-built — no hours-long compilation). Outputs differ slightly from SDPA (online softmax reordering) but are numerically harmless. |
 | Flex Attention | `"flex"` | Skips padding tokens via block mask — faster on variable-length batches. Near-exact numerics. First use compiles a Triton kernel (30–120 s). |
 | Auto | `"auto"` | Picks the best available: `kernels_flash` → `flex` → `sdpa`. |

 | Backend | Key | Notes |
 | :--- | :--- | :--- |
 | PyTorch SDPA | `"sdpa"` | Default. Exact numerics, stable on all hardware. |
+| Flash Attention | `"kernels_flash"` | Fastest. Requires `pip install kernels` (pre-built — no hours-long compilation). Outputs are not bitwise identical to SDPA due to online softmax reordering; differences are often small but not guaranteed to be inconsequential — use `"sdpa"` if exact numerics matter. |
 | Flex Attention | `"flex"` | Skips padding tokens via block mask — faster on variable-length batches. Near-exact numerics. First use compiles a Triton kernel (30–120 s). |
 | Auto | `"auto"` | Picks the best available: `kernels_flash` → `flex` → `sdpa`. |