{____} AWS Greengrass RawDataDumper Component

High-performance AWS Greengrass v2 component for asynchronous raw data dumping to Parquet format with optimized compression and metadata management for the {____} platform.

Overview

The {____} RawDataDumper component is a high-performance AWS Greengrass v2 component designed for asynchronous raw data persistence in industrial IoT environments. It subscribes to filtered data streams and efficiently writes them to Apache Parquet files with optimized compression, intelligent file rotation, and comprehensive metadata management.

Key Features

High-Performance Async I/O: Asynchronous processing with configurable batch sizes for optimal throughput
Parquet Optimization: ZSTD compression, dictionary encoding, and comprehensive statistics
Data Integrity: Complete metadata preservation and data provenance tracking
Intelligent File Rotation: Rotation based on row count, time intervals, or idle periods
Resource Management: Graceful shutdown and proper cleanup mechanisms
Industrial Grade: Designed for 24/7 operation in harsh industrial environments

Features

Data Processing Capabilities

Asynchronous Processing: High-throughput async pipeline with queue management
Batch Optimization: Configurable batch processing for optimal I/O performance
Memory Efficiency: Streaming operations with minimal memory footprint
Schema Evolution: Dynamic schema handling and validation
Error Recovery: Robust error handling with file corruption prevention

File Management

Intelligent Rotation: Row count, time-based, and idle time rotation strategies
Directory Organization: Hierarchical date-based directory structure
Metadata Sidecar: JSON sidecar files with comprehensive metadata
Temporary File Handling: Safe atomic file operations with .tmp extensions
Cleanup Operations: Automatic cleanup of temporary and orphaned files

AWS Integration

IPC Communication: Native Greengrass IPC for high-performance data streaming
Topic Subscription: Reliable subscription to filter output topics
Asynchronous Operations: Non-blocking message handling and processing
Performance Monitoring: Built-in statistics and health monitoring

Performance Features

ZSTD Compression: High-performance compression with configurable levels
Dictionary Encoding: Optimized encoding for categorical data
Batch Processing: Configurable batch sizes for optimal write performance
Async File I/O: Optional async file operations for better resource utilization
Statistics Collection: Comprehensive performance metrics and logging

Architecture

┌─────────────────┐    ┌───────────────────┐    ┌─────────────────┐
│  Filter Output  │───▶│   RawDataDumper   │───▶│  Parquet Files  │
│  (IPC Stream)   │    │    Component      │    │ + JSON Metadata │
└─────────────────┘    └───────────────────┘    └─────────────────┘
                              │
                       ┌─────────────────┐
                       │   Performance   │
                       │    Statistics   │
                       └─────────────────┘

Core Components

Dumper Manager (src/dumper_manager.py)
- Asynchronous data processing and buffering engine
- Parquet file writing with optimization settings
- File rotation and lifecycle management
- Performance metrics collection and monitoring
- Schema management and validation
Main Service (src/main.py)
- Component lifecycle management and orchestration
- Command-line argument parsing and configuration
- IPC subscription handling and event processing
- Signal handling and graceful shutdown
- Logging configuration and management

Data Flow

Stream Subscription: Subscribe to filter output via AWS Greengrass IPC
Message Queuing: Asynchronous message queuing with overflow protection
Batch Processing: Accumulate data in configurable batch sizes
Parquet Writing: Write batches to optimized Parquet files with compression
File Management: Handle rotation and cleanup based on configured policies
Metadata Generation: Create JSON sidecar files with comprehensive metadata

Prerequisites

Hardware Requirements

Processor: x86_64 architecture (Intel/AMD)
Memory: Minimum 2GB RAM (4GB+ recommended for high-throughput processing)
Storage: Sufficient space for data files and rotation buffers (SSD recommended)
I/O Performance: Fast storage recommended for optimal write performance

Software Requirements

Operating System: Linux amd64 (tested on Ubuntu 20.04+ LTS)
Python: 3.8 or higher
AWS Greengrass Core: Nucleus Classic v2.8.0 or higher
File System: ext4, xfs, or other POSIX-compliant file system

Component Dependencies

{____}: Provides filtered data streams for dumping
aws.greengrass.ipc.pubsub: Required for inter-component communication
Apache Arrow/Parquet: PyArrow library for Parquet file operations

Development

Local Development

# Clone the repository
git clone <repository-url>
cd <repository-name>

# Setup development environment
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3 -m pip install -U git+https://github.com/aws-greengrass/[email protected]

# Deploy locally (for dev & testing)
./local_deploy.sh

# Remove local deployment
./remove_local_deploy.sh

Building

# Build component package
gdk component build

Manual Publish to AWS (Not Recommended, Use CI/CD instead)

gdk component publish

Subscribe to Greengrass Topic

sudo /greengrass/v2/bin/greengrass-cli pubsub sub -t /TOPIC_NAME/

Where TOPIC_NAME is the name of the topic you want to subscribe to.

Git Flow

Branch model (left to right = promotion path): feature branch → dev → staging → main

Environments and version patterns:

feature branches: isolated feature / fix work (no version tagging)
dev: integration / active development (pre-release versions vX.Y.Z-alpha.N)
staging: pre-production verification (pre-release versions vX.Y.Z-beta.N)
main: production (stable releases vX.Y.Z)

Rules:

Only merge into dev via PR from feature branches
staging only receives squash merges from dev
main only receives fast-forward or squash from staging (no direct commits)
Hotfix (critical) may branch from main, then merge back to staging and dev

Automated Release Workflow

semantic-release (single source of truth) runs on push to dev, staging, main:

Analyze commits (Conventional Commits) to determine version bump:
- BREAKING CHANGE: / feat! → major
- feat: → minor
- fix / perf / refactor: → patch
- Other types (docs, chore, ci, style, test) → no version bump
Prepare phase writes the computed version to:
- gdk-config.json (component version)
- recipe.yaml (ComponentVersion)
Generate changelog + Git tag:
- dev → vX.Y.Z-alpha.N
- staging → vX.Y.Z-beta.N
- main → vX.Y.Z
Create GitHub Release / Pre-release with artifacts
Build & publish to AWS Greengrass ONLY for staging and main
No separate version-bump workflow; remove any legacy bumps

Code Quality & Compliance Checks

ci.yml (runs on every PR and push):

Formatting / style: black, isort, flake8
(Soft) typing: mypy (non-blocking initially)
JSON / YAML schema & syntax validation
Greengrass component structure & recipe consistency checks
(Soft) security & secret scanning (upgradeable to blocking later)
Fails fast on critical structural or syntax errors

Conventional Commit Guidelines

Format: [optional scope]:

[optional body]

[optional footer(s)]

Common types:

feat: new feature (triggers minor)
fix: bug fix (patch)
perf: performance improvement (patch)
refactor: code change w/o behavior change (patch)
docs, style, test, chore, ci: no version bump by default

Breaking changes:

Use feat! / refactor! suffix OR
Add footer: BREAKING CHANGE: explanation

Recommended workflow:

Use squash merge
Ensure PR title is a valid Conventional Commit (final squashed commit = PR title)

Examples: feat: add new sensor data filtering algorithm fix: resolve memory leak in data processing loop feat!: change API response format feat: change API response format BREAKING CHANGE: response changed from array to object

Release & Publishing Process

Automated (preferred):

Develop on feature branch → open PR to dev
Ensure commits / PR title comply with Conventional Commits
Merge → semantic-release assigns pre-release (alpha) on dev
Promote dev → staging → beta release (build + publish to Greengrass)
Promote staging → main → stable release (build + publish)

Manual trigger (if needed):

Go to Actions: "Release (semantic-release → build → publish)"
Select target branch (dev / staging / main)
Workflow auto-detects:
- dev: pre-release alpha (no publish to AWS)
- staging: pre-release beta (publish)
- main: stable (publish)

Safeguards:

Never edit versions manually
All version sources are derived; commit modified version files produced by workflow only
Avoid tagging manually to prevent semantic-release drift

Operational Notes:

alpha churn is expected; do not pin production to alpha/beta
Consumers should pin only stable main releases unless testing
Monitor release notes for BREAKING CHANGE sections

License

This software is proprietary and confidential. Unauthorized copying, modification, distribution, or use of this software is strictly prohibited.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md
gdk-config.json		gdk-config.json
local_deploy.sh		local_deploy.sh
recipe.yaml		recipe.yaml
remove_local_deploy.sh		remove_local_deploy.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

{____} AWS Greengrass RawDataDumper Component

Table of Contents

Overview

Key Features

Features

Data Processing Capabilities

File Management

AWS Integration

Performance Features

Architecture

Core Components

Data Flow

Prerequisites

Hardware Requirements

Software Requirements

Component Dependencies

Development

Local Development

Building

Manual Publish to AWS (Not Recommended, Use CI/CD instead)

Subscribe to Greengrass Topic

Git Flow

Automated Release Workflow

Code Quality & Compliance Checks

Conventional Commit Guidelines

Release & Publishing Process

License

About

Uh oh!

Languages

JW-Albert/aws_ggc-dumper-RawDataDumper

Folders and files

Latest commit

History

Repository files navigation

{____} AWS Greengrass RawDataDumper Component

Table of Contents

Overview

Key Features

Features

Data Processing Capabilities

File Management

AWS Integration

Performance Features

Architecture

Core Components

Data Flow

Prerequisites

Hardware Requirements

Software Requirements

Component Dependencies

Development

Local Development

Building

Manual Publish to AWS (Not Recommended, Use CI/CD instead)

Subscribe to Greengrass Topic

Git Flow

Automated Release Workflow

Code Quality & Compliance Checks

Conventional Commit Guidelines

Release & Publishing Process

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages