Skip to content

gh-68451: Fix unittest discovery to support Unicode module names#144853

Open
RoryGlenn wants to merge 1 commit intopython:mainfrom
RoryGlenn:gh-68451-unittest-unicode-discovery
Open

gh-68451: Fix unittest discovery to support Unicode module names#144853
RoryGlenn wants to merge 1 commit intopython:mainfrom
RoryGlenn:gh-68451-unittest-unicode-discovery

Conversation

@RoryGlenn
Copy link

gh-68451: Fix unittest discovery to support Unicode module names

Summary

unittest test discovery previously used an ASCII-only regex ([_a-z]\w*\.py$) to validate module names, which rejected test files starting with non-ASCII Unicode letters (e.g., café.py, 測試.py). This PR replaces the regex with str.isidentifier(), which correctly handles all valid Python identifiers per PEP 3131.

Changes

Lib/unittest/loader.py

  • Removed VALID_MODULE_NAME regex and the unused import re
  • Added _valid_module_name() function that uses os.path.splitext() + str.isidentifier() to validate module filenames
  • Added directory name validation in _find_test_path() — directories with invalid identifier names (e.g., containing hyphens) are now properly skipped during package discovery

Lib/test/test_unittest/test_discovery.py

  • test_valid_module_name: Tests the new _valid_module_name() function with ASCII, Unicode, and invalid names
  • test_find_tests_with_unicode_modules: Integration test verifying Unicode-named modules are discovered alongside ASCII ones
  • test_find_test_path_rejects_invalid_dir_name: Tests that directories with invalid identifier names are skipped

Prior work

This issue has two stale PRs (#1338 from 2017, #13149 from 2019) that were never merged. This PR incorporates reviewer feedback from those PRs:

Testing

All 1,095 unittest tests pass.

Replace the ASCII-only VALID_MODULE_NAME regex with str.isidentifier()
to support test modules whose names start with non-ASCII Unicode letters
(e.g., café.py, 測試.py).

Also add a directory name validation check so that directories with
invalid identifier names (e.g., containing hyphens) are skipped during
package discovery.
@python-cla-bot
Copy link

python-cla-bot bot commented Feb 15, 2026

All commit authors signed the Contributor License Agreement.

CLA signed

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3ac73104fa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# gh-68451: use str.isidentifier() to support Unicode module names,
# rather than a restrictive ASCII-only regex.
root, ext = os.path.splitext(path)
return ext == '.py' and root.isidentifier()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve case-insensitive .py matching in module filter

This new check makes extension validation case-sensitive (ext == '.py'), so files like test_example.PY are now silently skipped even when discovery patterns match them (notably on Windows/case-insensitive filesystems where fnmatch('test*.py') can match test_example.PY). The previous regex was case-insensitive, so this change regresses discovery for existing suites that use uppercase/mixed-case .py extensions.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant