Blt Model can't pass training_ci

### System Info

- `transformers` version: 5.0.0.dev0
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Python version: 3.12.9
- Huggingface_hub version: 1.1.5
- Safetensors version: 0.6.2
- Accelerate version: 1.11.0
- Accelerate config:    not found
- DeepSpeed version: 0.18.2
- PyTorch version (accelerator?): 2.9.0+cu128 (CUDA)
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA H100 80GB HBM3

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

As part of the `training_ci` (https://github.com/huggingface/transformers/pull/42597), we noticed that BLT model can't overfit a single sentence (and thus generated the same sentence back). There must be a bug somewhere. Raising issue to flag it

`pytest tests/models/blt/test_modeling_blt.py::BltModelTest::test_training_overfit -s`

<img width="1144" height="483" alt="Image" src="https://github.com/user-attachments/assets/d2afb856-0a9f-4856-888a-d063f577393f" />


### Expected behavior

matching generation and `loss` + `grad_norm` reduction should be at least 90%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Blt Model can't pass training_ci #42629

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Blt Model can't pass training_ci #42629

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions