embed_table_storage crashes (SIGKILL) on sharded datasets with Sequence() nested types

## Summary

`embed_table_storage` crashes with SIGKILL (exit code 137) when processing sharded datasets containing `Sequence()` nested types like `Sequence(Nifti())`. Likely affects `Sequence(Image())` and `Sequence(Audio())` as well.

The crash occurs at the C++ level with no Python traceback.

### Related Issues

- #7852 - Problems with NifTI (closed, but related embedding issues)
- #6790 - PyArrow 'Memory mapping file failed' (potentially related)
- #7893 - OOM issue (separate bug, but discovered together)

### Context

Discovered while uploading the [Aphasia Recovery Cohort (ARC)](https://openneuro.org/datasets/ds004884) neuroimaging dataset to HuggingFace Hub. Even after fixing the OOM issue (#7893), this crash blocked uploads.

Working implementation with workaround: [arc-aphasia-bids](https://github.com/The-Obstacle-Is-The-Way/arc-aphasia-bids)

## Reproduction

```python
from datasets import Dataset, Features, Sequence, Value
from datasets.features import Nifti
from datasets.table import embed_table_storage

features = Features({
    "id": Value("string"),
    "images": Sequence(Nifti()),
})

ds = Dataset.from_dict({
    "id": ["a", "b"],
    "images": [["/path/to/file.nii.gz"], []],
}).cast(features)

# This works fine:
table = ds._data.table.combine_chunks()
embedded = embed_table_storage(table)  # OK

# This crashes with SIGKILL:
shard = ds.shard(num_shards=2, index=0)
shard_table = shard._data.table.combine_chunks()
embedded = embed_table_storage(shard_table)  # CRASH - no Python traceback
```

## Key Observations

| Scenario | Result |
|----------|--------|
| Single `Nifti()` column | Works |
| `Sequence(Nifti())` on full dataset | Works |
| `Sequence(Nifti())` after `ds.shard()` | **CRASHES** |
| `Sequence(Nifti())` after `ds.select([i])` | **CRASHES** |
| Crash with empty Sequence `[]` | **YES** - not file-size related |

## Workaround

Convert shard to pandas and recreate the Dataset to break internal Arrow references:

```python
shard = ds.shard(num_shards=num_shards, index=i, contiguous=True)

# CRITICAL: Pandas round-trip breaks problematic references
shard_df = shard.to_pandas()
fresh_shard = Dataset.from_pandas(shard_df, preserve_index=False)
fresh_shard = fresh_shard.cast(ds.features)

# Now embedding works
table = fresh_shard._data.table.combine_chunks()
embedded = embed_table_storage(table)  # OK!
```

## Disproven Hypotheses

| Hypothesis | Test | Result |
|------------|------|--------|
| PyArrow 2GB binary limit | Monkey-patched `Nifti.pa_type` to `pa.large_binary()` | Still crashed |
| Memory fragmentation | Called `table.combine_chunks()` | Still crashed |
| File size issue | Tested with tiny NIfTI files | Still crashed |

## Root Cause Hypothesis

When `ds.shard()` or `ds.select()` creates a subset, the resulting Arrow table retains internal references/views to the parent table. When `embed_table_storage` processes nested struct types like `Sequence(Nifti())`, these references cause a crash in the C++ layer.

The pandas round-trip forces a full data copy, breaking these problematic references.

## Environment

- datasets version: main branch (post-0.22.0)
- Platform: macOS 14.x ARM64 (may be platform-specific)
- Python: 3.13
- PyArrow: 18.1.0

## Notes

This may ultimately be a PyArrow issue surfacing through datasets. Happy to help debug further if maintainers can point to where to look in the embedding logic.

Scenario	Result
Single `Nifti()` column	Works
`Sequence(Nifti())` on full dataset	Works
`Sequence(Nifti())` after `ds.shard()`	CRASHES
`Sequence(Nifti())` after `ds.select([i])`	CRASHES
Crash with empty Sequence `[]`	YES - not file-size related

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

embed_table_storage crashes (SIGKILL) on sharded datasets with Sequence() nested types #7894

Summary

Related Issues

Context

Reproduction

Key Observations

Workaround

Disproven Hypotheses

Root Cause Hypothesis

Environment

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hypothesis	Test	Result
PyArrow 2GB binary limit	Monkey-patched `Nifti.pa_type` to `pa.large_binary()`	Still crashed
Memory fragmentation	Called `table.combine_chunks()`	Still crashed
File size issue	Tested with tiny NIfTI files	Still crashed

embed_table_storage crashes (SIGKILL) on sharded datasets with Sequence() nested types #7894

Description

Summary

Related Issues

Context

Reproduction

Key Observations

Workaround

Disproven Hypotheses

Root Cause Hypothesis

Environment

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions