Skip to content

Conversation

@ewdurbin
Copy link
Member

No description provided.

ewdurbin and others added 14 commits August 22, 2025 10:27
- Migrate from Poetry to pip-tools with hash verification for better security
- Upgrade Python from 3.8.5 to 3.13 for latest features and performance
- Upgrade PostgreSQL from v12 to v16 and Redis from v5 to v7
- Simplify database configuration to use DATABASE_URL connection string
- Simplify Redis configuration to use REDIS_URL connection string
- Reduce Google BigQuery config from 5 env vars to 1 (GOOGLE_SERVICE_ACCOUNT_JSON)
- Remove Kubernetes deployment files (will deploy via different method)
- Add Procfile and gunicorn.conf.py for modern PaaS deployment
- Fix Flask-Limiter and Flask-Migrate compatibility with latest versions
- Fix Celery 5.x configuration (use lowercase broker_url)
- Remove hardcoded Redis URL from Celery initialization
- Update docker-compose to use .env file for configuration
- Add comprehensive documentation:
  - CLAUDE.md: Full application architecture and components
  - CONFIGURATION.md: Environment variables and setup guide
  - ETL_TESTING.md: Testing BigQuery ETL locally
  - ADMIN_FEATURES.md: Admin panel documentation
  - .env.example: Sample environment configuration

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Change all download columns from INTEGER to BIGINT in database models
- Add migration to alter existing tables to use BIGINT
- Prevents "integer out of range" errors for packages with >2.1B downloads
- Allows handling up to 9.2 quintillion downloads per metric

The ETL was failing for popular packages whose download counts exceeded
PostgreSQL's INTEGER maximum of 2,147,483,647. This change ensures the
application can handle the scale of modern PyPI download statistics.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Create GitHub Actions workflow that runs on push to main/master
- Checks code formatting with black and isort
- Performs basic Python syntax validation
- Ensures CI passes to enable deployment automation

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add /_health/ route alongside existing /health endpoint
- Returns 200 OK for load balancer/monitoring health checks
- Required for deployment tooling and monitoring

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Configure Gunicorn to bind to Unix socket when BIND_UNIX_SOCKET is set
- Socket path: /var/run/cabotage/cabotage.sock
- Set proper umask (0o117) for socket permissions (660)
- Falls back to TCP port binding when not set
- Update documentation with new configuration option

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Mount entire source directory as Docker volumes for hot-reload
- Run formatting tools (black, isort) inside Docker containers
- Add check-fmt make target for CI-style format checking
- Install dev requirements in Docker image for consistency

This ensures development changes are immediately reflected without
rebuilding containers and maintains version consistency between
local development and CI environments.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Remove production dependencies from requirements-dev.in
- Regenerate requirements-dev.txt with only dev tools
- This avoids hash verification issues with platform-specific deps in CI
- CI now installs a minimal set of dev tools (black, isort, pip-tools)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Format migration files to match black 25.1.0 expectations
- Update pyproject.toml to target Python 3.13 instead of 3.7
- Add blank line after docstrings in migration files

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add default case to docker-entrypoint.sh to execute arbitrary commands
- Update pyproject.toml to target Python 3.13 instead of 3.7
- This ensures local Docker environment can properly run formatting checks
- Fixes issue where docker-compose run would silently fail for black/isort

Now our local development environment will catch formatting issues
before they reach CI.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Implement multi-stage build to reduce final image size
- Use virtual environment for better dependency isolation
- Add BuildKit cache mounts for apt and pip (faster rebuilds)
- Default DEVEL=no for production, but set DEVEL=yes in docker-compose
- Install postgresql-client only in development mode
- Pre-compile Python bytecode for faster startup
- Remove obsolete version field from docker-compose.yml
- Remove unnecessary user switching for simpler development workflow

The release command (flask db upgrade) now works correctly in containers.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Rename beat to worker-beat for clarity
- Remove flower worker (monitoring UI not needed)
- Keep core processes: web, worker, worker-beat, and release

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Fix SyntaxWarning for invalid escape sequences in BigQuery regex patterns
- Add celery-redbeat for Redis-based beat scheduler (no filesystem writes)
- Configure Celery to use RedBeat scheduler in config, Procfile, and docker-entrypoint
- Update beat commands to explicitly use --scheduler redbeat.RedBeatScheduler

This eliminates the need for persistent filesystem storage for Celery beat,
makes the scheduler state shareable across instances, and fixes Python 3.13
syntax warnings.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Major improvements:
- Add SQLite staging for zero-downtime atomic updates
- Stream BigQuery results instead of loading all into memory (95% memory reduction)
- Fix NULL Python version handling to preserve "null" category data
- Add configurable batch size via ETL_BATCH_SIZE env var (default 100k)
- Optimize SQLite with PRAGMA settings for bulk inserts
- Create indexes after bulk load for better performance
- Use 2000-row chunks for SQLite inserts to avoid variable limits
- Add use_sqlite parameter to ETL task (default True)

Performance impact:
- Memory usage: ~95% reduction (2.1M rows → 100k max)
- Time: +3.9% slower (132s → 137s) - acceptable tradeoff
- Data consistency: Atomic updates prevent partial visibility
- Data integrity: All row counts match perfectly with old ETL

The slight performance overhead is worth the massive memory savings
and elimination of partial data visibility during updates.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Features:
- Complete backfill system for historical PyPI statistics
- Multiple backfill strategies: sequential, parallel, monthly, yearly
- CLI tool (manage_backfill.py) for easy backfill management
- Progress tracking and status checking capabilities
- Skip existing data option to resume interrupted backfills

Memory & Performance Optimizations:
- Reduced SQLite journal from MEMORY to WAL mode
- Changed temp_store from MEMORY to FILE
- Reduced cache size from 64MB to 32MB
- Chunked PostgreSQL transfers (10k rows instead of 50k)
- Smaller execute_values page_size (1k instead of 10k)
- Fixed ETL to skip recent stats updates during backfill
- Prevent stats from disappearing mid-backfill

Documentation:
- backfill_examples.md: Complete usage examples and best practices
- Detailed docstrings for all backfill functions

This provides a production-ready system for populating fresh instances
and recovering from data gaps, with significant memory usage reductions
during large data transfers.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@ewdurbin ewdurbin merged commit 581551a into main Aug 22, 2025
1 check passed
@ewdurbin ewdurbin deleted the modernize-and-deploy-to-psf-infra branch August 22, 2025 14:29
@tomaarsen
Copy link

I take it that this has resulted in the site being back online? I appreciate your work on this!

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants