Skip to content

Conversation

@pushkar-hue
Copy link

When initializing gpt-oss model with attn_implementation="flash_attention_2" or "flash_attention_3" would result in silent failures and garbage generation output as reported in #42533.

gpt-oss models rely on attention sinks which are not yet implemented for the flash_attention as suggested the safest path is to strictly block unsupported attention backends rather than failing silently or assuming a fallback.

@vasqu can you see if this is what you had in mind?

@github-actions
Copy link
Contributor

github-actions bot commented Dec 9, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: gpt_oss

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments, I have some concerns on the check + somehow there are a lot of changes in the modeling that shouldn't be there (e.g. __ changes to _)

def __init__(self, config: GptOssConfig):
super().__init__(config)

if config._attn_implementation in ["flash_attention_2", "flash_attention_3"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 things that might not work as expected

  • This only checks during init, but we can also update afterwards with e.g. set_attn_implementation. We probably cannot avoid checking during runtime (e,g, forward).
  • There are more FA implementations especially with the kernels ones. Afaik, we only have the vllm kernel working so we should check along if "flash" in config._attn_implementation and config._attn_implementation != "vllm-kernel": raise ValueError(...) (don't remember the name of the kernel).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird, there is a lot of changes like __init__ -> _init_ which shouldn't happen. Maybe some ruff version mismatch? It's weird

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants