-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Summary
embed_table_storage crashes with SIGKILL (exit code 137) when processing sharded datasets containing Sequence() nested types like Sequence(Nifti()). Likely affects Sequence(Image()) and Sequence(Audio()) as well.
The crash occurs at the C++ level with no Python traceback.
Related Issues
- Problems with NifTI #7852 - Problems with NifTI (closed, but related embedding issues)
- PyArrow 'Memory mapping file failed: Cannot allocate memory' bug #6790 - PyArrow 'Memory mapping file failed' (potentially related)
- push_to_hub OOM: _push_parquet_shards_to_hub accumulates all shard bytes in memory #7893 - OOM issue (separate bug, but discovered together)
Context
Discovered while uploading the Aphasia Recovery Cohort (ARC) neuroimaging dataset to HuggingFace Hub. Even after fixing the OOM issue (#7893), this crash blocked uploads.
Working implementation with workaround: arc-aphasia-bids
Reproduction
from datasets import Dataset, Features, Sequence, Value
from datasets.features import Nifti
from datasets.table import embed_table_storage
features = Features({
"id": Value("string"),
"images": Sequence(Nifti()),
})
ds = Dataset.from_dict({
"id": ["a", "b"],
"images": [["/path/to/file.nii.gz"], []],
}).cast(features)
# This works fine:
table = ds._data.table.combine_chunks()
embedded = embed_table_storage(table) # OK
# This crashes with SIGKILL:
shard = ds.shard(num_shards=2, index=0)
shard_table = shard._data.table.combine_chunks()
embedded = embed_table_storage(shard_table) # CRASH - no Python tracebackKey Observations
| Scenario | Result |
|---|---|
Single Nifti() column |
Works |
Sequence(Nifti()) on full dataset |
Works |
Sequence(Nifti()) after ds.shard() |
CRASHES |
Sequence(Nifti()) after ds.select([i]) |
CRASHES |
Crash with empty Sequence [] |
YES - not file-size related |
Workaround
Convert shard to pandas and recreate the Dataset to break internal Arrow references:
shard = ds.shard(num_shards=num_shards, index=i, contiguous=True)
# CRITICAL: Pandas round-trip breaks problematic references
shard_df = shard.to_pandas()
fresh_shard = Dataset.from_pandas(shard_df, preserve_index=False)
fresh_shard = fresh_shard.cast(ds.features)
# Now embedding works
table = fresh_shard._data.table.combine_chunks()
embedded = embed_table_storage(table) # OK!Disproven Hypotheses
| Hypothesis | Test | Result |
|---|---|---|
| PyArrow 2GB binary limit | Monkey-patched Nifti.pa_type to pa.large_binary() |
Still crashed |
| Memory fragmentation | Called table.combine_chunks() |
Still crashed |
| File size issue | Tested with tiny NIfTI files | Still crashed |
Root Cause Hypothesis
When ds.shard() or ds.select() creates a subset, the resulting Arrow table retains internal references/views to the parent table. When embed_table_storage processes nested struct types like Sequence(Nifti()), these references cause a crash in the C++ layer.
The pandas round-trip forces a full data copy, breaking these problematic references.
Environment
- datasets version: main branch (post-0.22.0)
- Platform: macOS 14.x ARM64 (may be platform-specific)
- Python: 3.13
- PyArrow: 18.1.0
Notes
This may ultimately be a PyArrow issue surfacing through datasets. Happy to help debug further if maintainers can point to where to look in the embedding logic.