darthunix
diff --git a/‎README.md‎
Lines changed: 67 additions & 21 deletions b/‎README.md‎
Lines changed: 67 additions & 21 deletions
diff --git a/‎ai/README.md‎
Lines changed: 25 additions & 0 deletions b/‎ai/README.md‎
Lines changed: 25 additions & 0 deletions
diff --git a/‎ai/log/.gitkeep‎
Lines changed: 1 addition & 0 deletions b/‎ai/log/.gitkeep‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎ai/memory/CODEX_MEMORY_MODEL.md‎
Lines changed: 172 additions & 0 deletions b/‎ai/memory/CODEX_MEMORY_MODEL.md‎
Lines changed: 172 additions & 0 deletions
diff --git a/‎ai/memory/architecture.md‎
Lines changed: 41 additions & 0 deletions b/‎ai/memory/architecture.md‎
Lines changed: 41 additions & 0 deletions
diff --git a/‎ai/memory/components/.gitkeep‎
Lines changed: 1 addition & 0 deletions b/‎ai/memory/components/.gitkeep‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎ai/memory/components/executor.md‎
Lines changed: 27 additions & 0 deletions b/‎ai/memory/components/executor.md‎
Lines changed: 27 additions & 0 deletions
@@ -1,28 +1,46 @@
 # pg_fusion
 
-Currently, PostgreSQL operates a row-based engine built on the Volcano
-architecture. While this is a good solution for OLTP workloads, it is
-very slow for analytical processing. Modern OLAP engines leverage columnar
-data representation in memory to enable SIMD optimizations and take advantage
-of data locality in CPU caches.
+pg_fusion is a PostgreSQL extension that delegates query execution to an external Apache DataFusion runtime. The extension hooks PG planning/execution, streams heap pages via shared memory to the runtime, and streams back results as wire‑friendly MinimalTuple frames.
 
-Additionally, PostgreSQL operates on a process-based model, where each process
-executes a single thread. The issue with this approach is that these threads
-handle not only data processing but also I/O tasks: reading blocks from disk
-into a shared memory cache, and communicating with clients over the network.
-As a result, the operating system scheduler frequently removes threads from CPU
-cores and later reinstates them, which negatively impacts TLB efficiency.
+Why: PostgreSQL’s Volcano (row‑at‑a‑time) engine is great for OLTP but slow for OLAP. DataFusion provides a modern, vectorized execution engine in Rust. pg_fusion integrates it in‑process via a background worker and shared memory IPC.
 
-These limitations make historical analytics in PostgreSQL a slow and cumbersome
-process. In comparison, DataFusion processes queries an order of magnitude faster.
-This leads to the hypothesis that PostgreSQL users need a CPU-efficient engine
-capable of significantly accelerating most read-heavy queries on heap tables.
-This is the motivation behind the development of `pg_fusion`.
+## Architecture (high level)
 
-## How to run
+- Extension (pgrx) sits in the backend process and drives Parse/Bind/Optimize/Translate/Begin/Exec/End.
+- Executor (runtime) runs DataFusion; `PgTableProvider → PgScanExec → PgScanStream` scans heap pages from shared memory and produces Arrow RecordBatches.
+- Protocol defines control/data packets and a compact wire tuple format with explicit alignment.
 
-After installing `postgres`, you need to set up `rustup`, `cargo-pgrx` to build
-the extension.
+See: `ai/memory/architecture.md` and component notes in `ai/memory/components/`.
+
+## Repository layout
+
+- `postgres/` — pgrx extension (hooks, IPC, TupleTableSlot fill)
+- `executor/` — DataFusion runtime (planning/execution, SHM access, result encoding)
+- `protocol/` — shared packets and wire formats
+- `storage/` — heap page reader + zero‑allocation tuple decoder to DataFusion `ScalarValue`
+- `common/` — shared errors/types
+
+## Build & test
+
+Workspace targets Rust 1.89.
+
+Basics:
+
+```
+cargo check --workspace
+cargo test --workspace
+```
+
+Lint/format:
+
+```
+cargo fmt --all
+cargo clippy --all-targets --features "pg17, pg_test" --no-default-features
+```
+
+## pgrx quickstart (PG 17)
+
+Install and initialize pgrx:
 
 ```
 # install rustup
@@ -34,9 +52,37 @@ cargo install cargo-pgrx
 # configure pgrx
 cargo pgrx init --pg17 $(which pg_config)
 
-# append the extension to shared_preload_libraries in ~/.pgrx/data-17/postgresql.conf
+# enable extension in the dev cluster
 echo "shared_preload_libraries = 'pg_fusion'" >> ~/.pgrx/data-17/postgresql.conf
 
-# run cargo-pgrx to build and install the extension
+# run the dev cluster and build/install the extension
 cargo pgrx run
 ```
+
+Extension‑specific commands:
+
+```
+# build only the extension crate
+cargo build -p pg_fusion
+
+# run pgrx tests for storage’s pg_test
+cargo pgrx test pg17 -p pg_test
+```
+
+## Developer guidelines
+
+- Rust 2021; keep changes small and focused; surface structured errors (no panics in extension paths).
+- Before PR: `cargo fmt`, `cargo clippy -D warnings`, `cargo test --workspace`.
+- Commit style: `area: concise change` (e.g., `executor: fix buffer rollback`).
+
+## Memory bank for agents (RAG)
+
+We maintain a human‑readable “memory bank” for agents and humans under `/ai/memory` (Markdown + YAML frontmatter). Start with:
+
+- `/ai/memory/index.md` — how to read the bank
+- `/ai/memory/architecture.md` — overview
+- `/ai/memory/components/` — component facts
+- `/ai/memory/decisions/` — ADR‑lite decisions
+- `/ai/memory/invariants.md` — project invariants
+
+Agent workflow requirement: after you implement or change behavior, update the relevant files under `/ai/memory` (components, decisions, invariants, architecture) so future agents have accurate context.
@@ -0,0 +1,25 @@
+# Codex Memory Model for This Repo
+
+This repository adopts a human-readable memory model for agents and humans.
+
+Structure:
+
+```
+/ai/
+  memory/      # invariants, decisions, architecture, components
+  log/         # project log (optional, day-by-day)
+  README.md    # this file — how to use the memory bank
+```
+
+Agent usage quickstart:
+
+- Read `/ai/memory/index.md` first.
+- For architecture answers: load `architecture.md`, relevant component memos, and all invariants with `importance >= 0.8`.
+- For code generation: respect all invariants in `/ai/memory`.
+- If an invariant would be violated, state which one and propose an alternative.
+
+See `/ai/memory/CODEX_MEMORY_MODEL.md` for the complete model description.
+
+Agent workflow requirement:
+
+- After implementing or changing behavior, update the corresponding files under `/ai/memory` (components, decisions, invariants, architecture). Keep the memory bank current so future agents have accurate context.
@@ -0,0 +1 @@
+
@@ -0,0 +1,172 @@
+Below is a clean, compact memory dump for Codex/LLM agents that captures the key ideas of a “human‑readable memory bank” and “when to start a vector index.” Use as a system prompt, agent README, or memory-bank insert.
+
+---
+
+# 📦 Codex Memory Model — Dump
+
+## 1. Big Picture
+
+Project memory should be human‑readable, durable, and agent‑friendly.
+Primary format: Markdown + YAML frontmatter.
+Top directory:
+
+```
+/ai/
+  memory/      # invariants, ADR/decisions, architecture
+  log/         # project diary
+  README.md    # how to use this
+```
+
+---
+
+## 2. Memory Record Format (Knowledge Fragments)
+
+Every memory fragment is Markdown with a YAML frontmatter.
+
+```markdown
+---
+id: inv-raft-001
+type: invariant        # invariant | decision | fact | gotcha | todo | note
+scope: planner
+tags: ["raft", "sharding"]
+updated_at: "2025-11-29"
+importance: 0.95       # 0..1 — how critical it is to follow
+---
+
+# Invariant: The Planner Operates Within a Single Raft Group
+
+(human‑readable explanation)
+```
+
+Why this format:
+
+- readable as documentation,
+- easy to parse by agents,
+- YAML provides the structure,
+- Markdown provides narrative and context.
+
+---
+
+## 3. Memory Types
+
+Use a small fixed set:
+
+- `invariant` — a principle that must not be violated
+- `decision` — an architectural decision (ADR‑lite)
+- `fact` — important information about the system
+- `gotcha` — pitfalls and caveats
+- `todo` — long‑lived improvements, not tasks
+- `note` — useful observations
+
+This helps agents:
+for architecture — look at `invariant` and `decision`,
+for debugging — `gotcha`,
+for system analysis — `fact`.
+
+---
+
+## 4. Separate Memory Files
+
+Example layout:
+
+```
+/ai/memory/index.md
+/ai/memory/invariants.md
+/ai/memory/architecture.md
+/ai/memory/components/planner.md
+/ai/memory/components/storage.md
+/ai/memory/decisions/0001-sharding-model.md
+/ai/memory/decisions/0002-datafusion-integration.md
+```
+
+`index.md` describes structure and reading order.
+
+---
+
+## 5. Instructions for LLM Agents
+
+Agents must:
+
+1. Read `/ai/memory/index.md` first.
+2. Before architecture answers, load:
+   - `architecture.md`
+   - relevant components
+   - all `invariant` with `importance >= 0.8`
+3. Before code generation, honor all invariants in `/ai/memory`.
+4. If an invariant would be violated, name it and propose an alternative.
+5. Avoid stale notes (`updated_at` too old or a `deprecated` flag, if present).
+
+---
+
+## 6. When to Start a Vector Index
+
+Two layers of indexing:
+
+### Layer A — Vector Index of Memory (`/ai/memory`)
+
+Start early, after ~10–20 meaningful notes. Cheap, stable, useful.
+
+### Layer B — Vector Index of Source Code
+
+Start when 2–3 conditions hold:
+
+- multiple subsystems/crates,
+- architecture relatively stabilized,
+- “where is X implemented?” requires significant search,
+- context does not fit a single prompt.
+
+Initial scope only:
+
+1. public APIs,
+2. key modules (planner/storage/executor),
+3. doc comments and gotchas.
+
+Grow beyond this as the project scales.
+
+---
+
+## 7. Why Not Start Too Early
+
+- code structure changes rapidly → index rots quickly;
+- noise outweighs signal;
+- maintenance cost > value early on;
+- while the project is small, LSP/grep is enough.
+
+---
+
+## 8. Practical Timeline
+
+### Phase 1: first weeks
+
+Create `/ai/memory/*.md` with no index yet.
+
+### Phase 2: 10–20 memory fragments exist
+
+Start the memory vector index (still not code).
+
+### Phase 3: architecture stabilized, project grew
+
+Start the source code index.
+
+---
+
+## 9. Project Log (Optional)
+
+In `/ai/log/YYYY-MM-DD.md` keep human‑readable diaries:
+
+- key decisions,
+- observations,
+- issues.
+
+This is a RAG data source, but not part of invariants.
+
+---
+
+# ✔ Recommended Standard of Memory for Codex/LLM Agents
+
+Optionally, we can:
+
+- generate templates (memory template generator),
+- provide JSON Schema for frontmatter validation,
+- generate an example `/ai/memory` for your project (pg_fusion / picodata),
+- suggest an index refresh strategy (pre‑commit hook + partial refresh).
@@ -0,0 +1,41 @@
+---
+id: arch-overview-0001
+type: fact
+scope: repo
+tags: ["architecture", "datafusion", "pgrx", "shared-memory", "ipc"]
+updated_at: "2025-11-29"
+importance: 0.8
+---
+
+# pg_fusion Architecture Overview
+
+In short: a PostgreSQL (pgrx) extension intercepts planning/execution and delegates to a separate Apache DataFusion runtime. Communication uses shared memory (lock‑free rings + slot buffers for heap pages). The wire protocol is defined under `protocol/`.
+
+## Top‑Level Directories
+
+- `postgres/`: pgrx extension — plan/execute, IPC with runtime, builds Slot/MinimalTuple.
+- `executor/`: DataFusion runtime — parse/optimize/plan/execute, PgScanExec/Stream, encodes results into wire tuples.
+- `protocol/`: control and data messages, wire tuple/attribute formats, types.
+- `storage/`: low‑level heap page reader and attribute decoder to ScalarValue.
+- `common/`: shared errors/types (FusionError).
+
+## Control Path
+
+1. Parse → Metadata → Compile (logical plan)
+2. Bind (Columns) → Optimize → Translate (physical plan)
+3. BeginScan (register channels/slots) → ExecScan (start) → ExecReady
+4. EndScan (state reset)
+
+## Data Path
+
+- Executor requests heap blocks (scan_id, table_oid, slot_id).
+- Backend reads blocks, copies into SHM slots, sends metadata + visibility bitmap length.
+- `PgScanStream` reads pages from SHM, decodes tuples via `storage::heap`, builds Arrow RecordBatches.
+- Results are encoded to wire MinimalTuple and written to the result ring; backend reads frames and fills `TupleTableSlot`.
+
+## Responsibilities
+
+- Backend (`postgres/`): PG memory safety, `TupleTableSlot` formation, control FSM, heap IO.
+- Executor (`executor/`): DataFusion planning/execution, heap requests, decode/encode results, backpressure.
+- Protocol: stable binary formats/messages.
+- Storage: precise heap/attribute decoder (zero‑copy where possible).
@@ -0,0 +1 @@
+
@@ -0,0 +1,27 @@
+---
+id: comp-executor-0001
+type: fact
+scope: executor
+tags: ["datafusion", "runtime", "pgscan", "ipc", "arrow"]
+updated_at: "2025-11-29"
+importance: 0.8
+---
+
+# Component: Executor (DataFusion Runtime)
+
+## Essentials
+
+- Table source: `PgTableProvider` → physical node `PgScanExec` → `PgScanStream`.
+- On `scan.execute()`, register the heap‑block receiver in a per‑connection `ScanRegistry`.
+- `PgScanStream` decodes heap pages (via `storage::heap`) to Arrow, then results are encoded as wire MinimalTuple and written to the result ring.
+- Partition strategy: temporarily force `target_partitions = 1` (see decision `dec-0004`).
+
+## Public Surfaces
+
+- `server::{parse, optimize, translate, start_data_flow, end_data_flow}` — control‑path FSM.
+- `pgscan::{PgTableProvider, PgScanExec, ScanRegistry}` — sources/scans.
+
+## Gotchas
+
+- JOIN in multi‑partition mode emits no rows without partition‑aware reading — use single partition.
+- SHM: borrowed slices must not outlive the producer’s write; do not cache references.