gpt-oss is not working with flash-attention #42736
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When initializing
gpt-ossmodel withattn_implementation="flash_attention_2"or"flash_attention_3"would result in silent failures and garbage generation output as reported in #42533.gpt-ossmodels rely on attention sinks which are not yet implemented for theflash_attentionas suggested the safest path is to strictly block unsupported attention backends rather than failing silently or assuming a fallback.@vasqu can you see if this is what you had in mind?