Csv datasource by jobala · Pull Request #32 · jobala/sphinx

jobala · 2026-02-25T03:16:38Z

Overview

closes jobala/cabin#30
closes jobala/cabin#31

Copilot

Pull request overview

Adds an initial CSV-backed Datasource implementation (Arrow schema + record batches) and wires it into the CMake build/test setup, addressing issues #30 and #31.

Changes:

Introduces types (Arrow-backed vectors + RecordBatch) and datasource libraries, including a CSV datasource + iterator.
Adds GoogleTest coverage and sample CSV fixtures for schema inference and projection.
Restructures library CMake targets (replaces planner with logical_plan) and updates top-level build wiring.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
`test/datasource_test.cpp`	Adds unit tests for CSV schema inference, provided schema, and projection.
`test/data/csv/without_header.csv`	Adds CSV fixture without header.
`test/data/csv/with_header.csv`	Adds CSV fixture with header.
`test/data/csv/invalid.csv`	Adds invalid CSV fixture for negative-path testing.
`test/CMakeLists.txt`	Adds a new `datasource_test` executable and registers it with CTest.
`src/main.cpp`	Minor formatting/style adjustments to `main`.
`lib/types/types.h`	Introduces Arrow type aliases, `ColumnVector` abstraction, and `RecordBatch`.
`lib/types/types.cpp`	Implements `ArrowFieldVector` value accessors.
`lib/types/CMakeLists.txt`	Adds `types` library target and links Arrow.
`lib/planner/CMakeLists.txt`	Removes old `planner` library target.
`lib/logical-plan/planner.h`	Minor formatting adjustment.
`lib/logical-plan/planner.cpp`	Reorders includes / formatting.
`lib/logical-plan/CMakeLists.txt`	Adds `logical_plan` library target.
`lib/datasource/datasource.h`	Defines `Datasource` interface and `CsvDatasource` API.
`lib/datasource/csv_datasource_iter.h`	Declares a generic iterator interface and CSV iterator.
`lib/datasource/csv_datasource_iter.cpp`	Implements CSV scanning into Arrow arrays / `RecordBatch`.
`lib/datasource/csv_datasource.cpp`	Implements schema inference and projection schema construction.
`lib/datasource/CMakeLists.txt`	Adds `datasource` library target and links `types`.
`cmake/AddGoogleTest.cmake`	Tweaks googletest FetchContent configuration (compile options commented out).
`CMakeLists.txt`	Adds Arrow dependency and new subdirectories/targets wiring.
`.clang-format`	Introduces project formatting configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-25T03:26:47Z

test/datasource_test.cpp

+auto get_path(const std::string &file_name) -> std::string
+{
+  std::filesystem::path cwd(__FILE__);
+  return cwd.parent_path().append("data").append("csv").append(file_name);


get_path returns a std::filesystem::path expression, but the function return type is std::string. std::filesystem::path is not implicitly convertible to std::string, so this won’t compile as written; convert explicitly (e.g., via .string()).

Suggested change

return cwd.parent_path().append("data").append("csv").append(file_name);

return cwd.parent_path().append("data").append("csv").append(file_name).string();

Copilot · 2026-02-25T03:26:47Z

lib/types/types.h

+    this->fields = std::move(fields);
+  }
+
+  auto RowCount() -> int { return fields[0]->Size(); }


RecordBatch::RowCount() unconditionally indexes fields[0], which will crash if a RecordBatch is constructed with zero columns (e.g., empty projection / schema inference failure). Handle the empty-fields case explicitly (return 0 or throw) before indexing.

Suggested change

auto RowCount() -> int { return fields[0]->Size(); }

auto RowCount() -> int

{

if (fields.empty())

{

return 0;

}

return fields[0]->Size();

}

Copilot · 2026-02-25T03:26:47Z

lib/types/types.h

+  std::any value;
+  int size;
+
+public:


LiteralValueVector’s arrowType, value, and size members are never initialized (no constructor), so any use of this type will be undefined behavior. Add a constructor (or in-class initializers) to fully initialize these members.

Suggested change

public:

public:

// Default constructor initializes members to a safe "empty" state.

LiteralValueVector()

: arrowType(arrow::Type::NA), value(), size(0)

{

}

// Convenience constructor to create a literal vector with a given value and size.

LiteralValueVector(arrow::Type::type type, std::any literal_value, int literal_size)

: arrowType(type), value(std::move(literal_value)), size(literal_size)

{

}

Copilot · 2026-02-25T03:26:48Z

lib/datasource/csv_datasource.cpp

+
+CsvDatasource::CsvDatasource(std::optional<std::shared_ptr<Schema>> schema, const String &file_name)
+    : schema_(std::move(schema)), file_(file_name)
+{


The CSV file stream is opened in the member initializer list, but there’s no check that file_ actually opened successfully. If the file can’t be opened, infer_schema() will read an empty line and downstream code can produce empty schemas / crashes; validate file_.is_open() (and/or file_.good()) and fail fast with a clear error.

Suggested change

{

{

if (!file_.is_open() || !file_.good())

{

throw std::runtime_error("failed to open CSV file for reading");

}

Copilot · 2026-02-25T03:26:48Z

lib/datasource/csv_datasource.cpp

+auto CsvDatasource::create_final_schema(const std::vector<String> &projection) -> std::shared_ptr<Schema>
+{
+  std::vector<std::shared_ptr<arrow::Field>> projected_fields;
+  if (!schema_.has_value())
+  {
+    throw std::runtime_error("error creating final schema");
+  }
+
+  auto schema = schema_.value();
+  for (int i = 0; i < int(schema->fields().size()); i++)
+  {
+    auto field = schema->fields()[i];
+    if (std::ranges::contains(projection, field->name()))
+    {
+      auto meta = std::unordered_map<std::string, std::string>{{"idx", std::to_string(i)}};
+      auto metadata = std::make_shared<arrow::KeyValueMetadata>(meta);
+      auto field_with_meta = field->WithMetadata(metadata);
+      projected_fields.push_back(field_with_meta);
+    }
+  }
+
+  return arrow::schema(projected_fields);
+}


create_final_schema returns a schema containing only fields present in projection. If projection is empty or contains no matching column names, this returns an empty schema, which later leads to constructing a RecordBatch with zero columns (and RowCount() currently crashes). Decide on a contract (e.g., empty projection means “all columns”, or throw on empty/no-match) and enforce it here.

Copilot · 2026-02-25T03:26:48Z

lib/datasource/csv_datasource_iter.cpp

+    arrow::StringBuilder builder;
+    auto appen_status = builder.AppendValues(items);
+    std::shared_ptr<arrow::Array> array;
+    auto stat = builder.Finish().Value(&array);
+
+    auto field_vector = std::make_shared<ArrowFieldVector>(array);
+    field_vectors.push_back(field_vector);


The iterator always builds arrow::StringArray columns (arrow::StringBuilder) regardless of the field’s declared Arrow type. This will mismatch provided schemas (e.g., Int16Type) and make GetType()/GetValue() inconsistent with the schema. Build arrays using the field’s type (and parse CSV text into that type), or constrain the datasource to string-only schemas and enforce that contract.

Copilot · 2026-02-25T03:26:49Z

lib/datasource/csv_datasource_iter.cpp

+template <typename T>
+auto CsvDatasourceIterator<T>::has_next() -> bool
+{
+  return !file_.eof();
+}


has_next() uses !file_.eof(), which is a brittle way to detect remaining data (EOF is only set after a failed read, and other stream error states aren’t considered). Prefer checking file_.peek() != EOF, or attempt a read and buffer the next row so has_next() is accurate.

Copilot · 2026-02-25T03:26:49Z

test/datasource_test.cpp

+TEST(Datasource, handle_missing_values) {}
+
+TEST(Datasource, handle_different_column_types) {}


These two tests are empty bodies, so they don’t validate any behavior and will silently pass. Either implement them (especially since missing values / type parsing are core CSV datasource behaviors) or remove them until the functionality exists.

Suggested change

TEST(Datasource, handle_missing_values) {}

TEST(Datasource, handle_different_column_types) {}

TEST(Datasource, handle_missing_values)

{

GTEST_SKIP() << "TODO: implement handling of missing values in CSV datasource.";

}

TEST(Datasource, handle_different_column_types)

{

GTEST_SKIP() << "TODO: implement handling of different column types in CSV datasource.";

}

Copilot · 2026-02-25T03:26:49Z

lib/datasource/csv_datasource_iter.cpp

+    for (int i = 0; i < int(row_items.size()); i++)
+    {
+      try
+      {
+        columns.at(std::to_string(i)).push_back(row_items.at(i));
+      } catch (const std::exception &e)
+      {
+        continue;
+      }


Missing/short rows are currently handled by catching columns.at(...) and continue, which results in different column vectors having different lengths. That can produce inconsistent RecordBatch semantics (and potential out-of-bounds on later access). Consider explicitly appending null/empty values for missing columns so all columns remain aligned by row.

Suggested change

for (int i = 0; i < int(row_items.size()); i++)

{

try

{

columns.at(std::to_string(i)).push_back(row_items.at(i));

} catch (const std::exception &e)

{

continue;

}

// Ensure that every schema column receives a value (real or empty)

// so that all column vectors remain aligned by row.

int num_columns = static_cast<int>(columns.size());

for (int i = 0; i < num_columns; ++i)

{

auto key_it = columns.find(std::to_string(i));

if (key_it == columns.end())

{

// Schema/metadata inconsistency: skip this index rather than throwing.

continue;

}

if (i < static_cast<int>(row_items.size()))

{

key_it->second.push_back(row_items[static_cast<std::size_t>(i)]);

}

else

{

// Short row: append an empty value to keep column lengths aligned.

key_it->second.push_back(String{});

}

Copilot · 2026-02-25T03:26:49Z

lib/datasource/csv_datasource_iter.cpp

+    std::string idx;
+    auto status = field->metadata()->Get("idx").Value(&idx);
+
+    auto const &items = columns.at(idx);
+
+    arrow::StringBuilder builder;
+    auto appen_status = builder.AppendValues(items);
+    std::shared_ptr<arrow::Array> array;
+    auto stat = builder.Finish().Value(&array);
+


Arrow builder results/statuses are computed but ignored (AppendValues, Finish, and earlier metadata Get). If any of these fail, the code will proceed with partially-built data and later throw/crash. Check and handle these statuses/results explicitly (propagate an error or throw with context).

jobala added 15 commits February 7, 2026 16:15

add types

424bb3c

add types and datasource

23f081f

add formatting rules

141f189

add datasource header files

c9d2cd3

add stubs

251e83c

reformat types

1f8c8e7

refactor file ownership

6d92e05

infer schema

65b94b8

test uses provided schema

5da0aa9

add support for creating final schema

279bb12

annotate field with metadata

e81cd8a

build arrow arrays

42a9ea2

return record batch

7e44649

assert that items are read in order

b57bb91

refactor

c09792e

jobala requested a review from Copilot February 25, 2026 03:19

Copilot started reviewing on behalf of jobala February 25, 2026 03:20 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Csv datasource#32

Csv datasource#32
jobala wants to merge 15 commits intomainfrom
csv-datasource

jobala commented Feb 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	return cwd.parent_path().append("data").append("csv").append(file_name);
	return cwd.parent_path().append("data").append("csv").append(file_name).string();

-public:
+public:
+  // Default constructor initializes members to a safe "empty" state.
+  LiteralValueVector()
+      : arrowType(arrow::Type::NA), value(), size(0)
+  {
+  }
+  // Convenience constructor to create a literal vector with a given value and size.
+  LiteralValueVector(arrow::Type::type type, std::any literal_value, int literal_size)
+      : arrowType(type), value(std::move(literal_value)), size(literal_size)
+  {
+  }

-{
+{
+  if (!file_.is_open() || !file_.good())
+  {
+    throw std::runtime_error("failed to open CSV file for reading");
+  }

		TEST(Datasource, handle_missing_values) {}

		TEST(Datasource, handle_different_column_types) {}

-    for (int i = 0; i < int(row_items.size()); i++)
-    {
-      try
-      {
-        columns.at(std::to_string(i)).push_back(row_items.at(i));
-      } catch (const std::exception &e)
-      {
-        continue;
-      }
+    // Ensure that every schema column receives a value (real or empty)
+    // so that all column vectors remain aligned by row.
+    int num_columns = static_cast<int>(columns.size());
+    for (int i = 0; i < num_columns; ++i)
+    {
+      auto key_it = columns.find(std::to_string(i));
+      if (key_it == columns.end())
+      {
+        // Schema/metadata inconsistency: skip this index rather than throwing.
+        continue;
+      }
+      if (i < static_cast<int>(row_items.size()))
+      {
+        key_it->second.push_back(row_items[static_cast<std::size_t>(i)]);
+      }
+      else
+      {
+        // Short row: append an empty value to keep column lengths aligned.
+        key_it->second.push_back(String{});
+      }

Conversation

jobala commented Feb 25, 2026

Overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants