Skip to content

embed_table_storage crashes (SIGKILL) on sharded datasets with Sequence() nested types #7894

@The-Obstacle-Is-The-Way

Description

@The-Obstacle-Is-The-Way

Summary

embed_table_storage crashes with SIGKILL (exit code 137) when processing sharded datasets containing Sequence() nested types like Sequence(Nifti()). Likely affects Sequence(Image()) and Sequence(Audio()) as well.

The crash occurs at the C++ level with no Python traceback.

Related Issues

Context

Discovered while uploading the Aphasia Recovery Cohort (ARC) neuroimaging dataset to HuggingFace Hub. Even after fixing the OOM issue (#7893), this crash blocked uploads.

Working implementation with workaround: arc-aphasia-bids

Reproduction

from datasets import Dataset, Features, Sequence, Value
from datasets.features import Nifti
from datasets.table import embed_table_storage

features = Features({
    "id": Value("string"),
    "images": Sequence(Nifti()),
})

ds = Dataset.from_dict({
    "id": ["a", "b"],
    "images": [["/path/to/file.nii.gz"], []],
}).cast(features)

# This works fine:
table = ds._data.table.combine_chunks()
embedded = embed_table_storage(table)  # OK

# This crashes with SIGKILL:
shard = ds.shard(num_shards=2, index=0)
shard_table = shard._data.table.combine_chunks()
embedded = embed_table_storage(shard_table)  # CRASH - no Python traceback

Key Observations

Scenario Result
Single Nifti() column Works
Sequence(Nifti()) on full dataset Works
Sequence(Nifti()) after ds.shard() CRASHES
Sequence(Nifti()) after ds.select([i]) CRASHES
Crash with empty Sequence [] YES - not file-size related

Workaround

Convert shard to pandas and recreate the Dataset to break internal Arrow references:

shard = ds.shard(num_shards=num_shards, index=i, contiguous=True)

# CRITICAL: Pandas round-trip breaks problematic references
shard_df = shard.to_pandas()
fresh_shard = Dataset.from_pandas(shard_df, preserve_index=False)
fresh_shard = fresh_shard.cast(ds.features)

# Now embedding works
table = fresh_shard._data.table.combine_chunks()
embedded = embed_table_storage(table)  # OK!

Disproven Hypotheses

Hypothesis Test Result
PyArrow 2GB binary limit Monkey-patched Nifti.pa_type to pa.large_binary() Still crashed
Memory fragmentation Called table.combine_chunks() Still crashed
File size issue Tested with tiny NIfTI files Still crashed

Root Cause Hypothesis

When ds.shard() or ds.select() creates a subset, the resulting Arrow table retains internal references/views to the parent table. When embed_table_storage processes nested struct types like Sequence(Nifti()), these references cause a crash in the C++ layer.

The pandas round-trip forces a full data copy, breaking these problematic references.

Environment

  • datasets version: main branch (post-0.22.0)
  • Platform: macOS 14.x ARM64 (may be platform-specific)
  • Python: 3.13
  • PyArrow: 18.1.0

Notes

This may ultimately be a PyArrow issue surfacing through datasets. Happy to help debug further if maintainers can point to where to look in the embedding logic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions