High-performance AWS Greengrass v2 component for asynchronous raw data dumping to Parquet format with optimized compression and metadata management for the {____} platform.
- {____} AWS Greengrass RawDataDumper Component
The {____} RawDataDumper component is a high-performance AWS Greengrass v2 component designed for asynchronous raw data persistence in industrial IoT environments. It subscribes to filtered data streams and efficiently writes them to Apache Parquet files with optimized compression, intelligent file rotation, and comprehensive metadata management.
- High-Performance Async I/O: Asynchronous processing with configurable batch sizes for optimal throughput
- Parquet Optimization: ZSTD compression, dictionary encoding, and comprehensive statistics
- Data Integrity: Complete metadata preservation and data provenance tracking
- Intelligent File Rotation: Rotation based on row count, time intervals, or idle periods
- Resource Management: Graceful shutdown and proper cleanup mechanisms
- Industrial Grade: Designed for 24/7 operation in harsh industrial environments
- Asynchronous Processing: High-throughput async pipeline with queue management
- Batch Optimization: Configurable batch processing for optimal I/O performance
- Memory Efficiency: Streaming operations with minimal memory footprint
- Schema Evolution: Dynamic schema handling and validation
- Error Recovery: Robust error handling with file corruption prevention
- Intelligent Rotation: Row count, time-based, and idle time rotation strategies
- Directory Organization: Hierarchical date-based directory structure
- Metadata Sidecar: JSON sidecar files with comprehensive metadata
- Temporary File Handling: Safe atomic file operations with .tmp extensions
- Cleanup Operations: Automatic cleanup of temporary and orphaned files
- IPC Communication: Native Greengrass IPC for high-performance data streaming
- Topic Subscription: Reliable subscription to filter output topics
- Asynchronous Operations: Non-blocking message handling and processing
- Performance Monitoring: Built-in statistics and health monitoring
- ZSTD Compression: High-performance compression with configurable levels
- Dictionary Encoding: Optimized encoding for categorical data
- Batch Processing: Configurable batch sizes for optimal write performance
- Async File I/O: Optional async file operations for better resource utilization
- Statistics Collection: Comprehensive performance metrics and logging
┌─────────────────┐ ┌───────────────────┐ ┌─────────────────┐
│ Filter Output │───▶│ RawDataDumper │───▶│ Parquet Files │
│ (IPC Stream) │ │ Component │ │ + JSON Metadata │
└─────────────────┘ └───────────────────┘ └─────────────────┘
│
┌─────────────────┐
│ Performance │
│ Statistics │
└─────────────────┘
-
Dumper Manager (
src/dumper_manager.py)- Asynchronous data processing and buffering engine
- Parquet file writing with optimization settings
- File rotation and lifecycle management
- Performance metrics collection and monitoring
- Schema management and validation
-
Main Service (
src/main.py)- Component lifecycle management and orchestration
- Command-line argument parsing and configuration
- IPC subscription handling and event processing
- Signal handling and graceful shutdown
- Logging configuration and management
- Stream Subscription: Subscribe to filter output via AWS Greengrass IPC
- Message Queuing: Asynchronous message queuing with overflow protection
- Batch Processing: Accumulate data in configurable batch sizes
- Parquet Writing: Write batches to optimized Parquet files with compression
- File Management: Handle rotation and cleanup based on configured policies
- Metadata Generation: Create JSON sidecar files with comprehensive metadata
- Processor: x86_64 architecture (Intel/AMD)
- Memory: Minimum 2GB RAM (4GB+ recommended for high-throughput processing)
- Storage: Sufficient space for data files and rotation buffers (SSD recommended)
- I/O Performance: Fast storage recommended for optimal write performance
- Operating System: Linux amd64 (tested on Ubuntu 20.04+ LTS)
- Python: 3.8 or higher
- AWS Greengrass Core: Nucleus Classic v2.8.0 or higher
- File System: ext4, xfs, or other POSIX-compliant file system
- {____}: Provides filtered data streams for dumping
- aws.greengrass.ipc.pubsub: Required for inter-component communication
- Apache Arrow/Parquet: PyArrow library for Parquet file operations
# Clone the repository
git clone <repository-url>
cd <repository-name>
# Setup development environment
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3 -m pip install -U git+https://github.com/aws-greengrass/[email protected]# Deploy locally (for dev & testing)
./local_deploy.sh
# Remove local deployment
./remove_local_deploy.sh# Build component package
gdk component buildgdk component publishsudo /greengrass/v2/bin/greengrass-cli pubsub sub -t /TOPIC_NAME/Where
TOPIC_NAMEis the name of the topic you want to subscribe to.
Branch model (left to right = promotion path): feature branch → dev → staging → main
Environments and version patterns:
- feature branches: isolated feature / fix work (no version tagging)
- dev: integration / active development (pre-release versions vX.Y.Z-alpha.N)
- staging: pre-production verification (pre-release versions vX.Y.Z-beta.N)
- main: production (stable releases vX.Y.Z)
Rules:
- Only merge into dev via PR from feature branches
- staging only receives squash merges from dev
- main only receives fast-forward or squash from staging (no direct commits)
- Hotfix (critical) may branch from main, then merge back to staging and dev
semantic-release (single source of truth) runs on push to dev, staging, main:
- Analyze commits (Conventional Commits) to determine version bump:
- BREAKING CHANGE: / feat! → major
- feat: → minor
- fix / perf / refactor: → patch
- Other types (docs, chore, ci, style, test) → no version bump
- Prepare phase writes the computed version to:
- gdk-config.json (component version)
- recipe.yaml (ComponentVersion)
- Generate changelog + Git tag:
- dev → vX.Y.Z-alpha.N
- staging → vX.Y.Z-beta.N
- main → vX.Y.Z
- Create GitHub Release / Pre-release with artifacts
- Build & publish to AWS Greengrass ONLY for staging and main
- No separate version-bump workflow; remove any legacy bumps
ci.yml (runs on every PR and push):
- Formatting / style: black, isort, flake8
- (Soft) typing: mypy (non-blocking initially)
- JSON / YAML schema & syntax validation
- Greengrass component structure & recipe consistency checks
- (Soft) security & secret scanning (upgradeable to blocking later)
- Fails fast on critical structural or syntax errors
Format: [optional scope]:
[optional body]
[optional footer(s)]
Common types:
- feat: new feature (triggers minor)
- fix: bug fix (patch)
- perf: performance improvement (patch)
- refactor: code change w/o behavior change (patch)
- docs, style, test, chore, ci: no version bump by default
Breaking changes:
- Use feat! / refactor! suffix OR
- Add footer: BREAKING CHANGE: explanation
Recommended workflow:
- Use squash merge
- Ensure PR title is a valid Conventional Commit (final squashed commit = PR title)
Examples: feat: add new sensor data filtering algorithm fix: resolve memory leak in data processing loop feat!: change API response format feat: change API response format BREAKING CHANGE: response changed from array to object
Automated (preferred):
- Develop on feature branch → open PR to dev
- Ensure commits / PR title comply with Conventional Commits
- Merge → semantic-release assigns pre-release (alpha) on dev
- Promote dev → staging → beta release (build + publish to Greengrass)
- Promote staging → main → stable release (build + publish)
Manual trigger (if needed):
- Go to Actions: "Release (semantic-release → build → publish)"
- Select target branch (dev / staging / main)
- Workflow auto-detects:
- dev: pre-release alpha (no publish to AWS)
- staging: pre-release beta (publish)
- main: stable (publish)
Safeguards:
- Never edit versions manually
- All version sources are derived; commit modified version files produced by workflow only
- Avoid tagging manually to prevent semantic-release drift
Operational Notes:
- alpha churn is expected; do not pin production to alpha/beta
- Consumers should pin only stable main releases unless testing
- Monitor release notes for BREAKING CHANGE sections
Copyright (c) 2025 imCloud Co., Ltd. All rights reserved.
This software is proprietary and confidential. Unauthorized copying, modification, distribution, or use of this software is strictly prohibited.