Skip to content

WIP: Paginated list support to allow substring list prefix#545

Open
kylebarron wants to merge 5 commits intomainfrom
kyle/paginated-list
Open

WIP: Paginated list support to allow substring list prefix#545
kylebarron wants to merge 5 commits intomainfrom
kyle/paginated-list

Conversation

@kylebarron
Copy link
Member

@kylebarron kylebarron commented Aug 29, 2025

Currently, obstore only supports listing by path segments. So if you pass a prefix into list_with_delimiter or list, that will be assumed to be a full path segment. This means that it's currently impossible to do efficiently perform the desired query from #494:

I have tons of log files with data at the beginning for example 202506272215_blabla on S3 I can use prefix as substring, basically I can get all files for this day by my_folder/20250627* but it's not working in obstore.

object_store supports substring-based prefix listing in its PaginatedListStore API. So if I use that and provide my own pagination -> stream conversion, then I should be able to essentially match the current list API.

However, this PaginatedListStore is only implemented for S3, Azure, and GCS. It's not implemented for HTTPStore or LocalStore, because those don't have a concept of pagination. See apache/arrow-rs-object-store#388.

This means that to support ...


... or, better idea, in obstore.list we:

  • Avoid type erasure, so instead of bringing in an Arc<dyn ObjectStore>, we have essentially an enum of the different stores
  • Implement S3/GCS/Azure via PaginatedListStore, to support efficient querying of substring prefix
  • Implement LocalStore/HTTPStore via a transform on the stream from ObjectStore::list, so that we never materialize the entire stream
    • That would mean removing this implementation of PaginatedListStore
    • Keep fetching the stream until a batch with valid responses exist, so that we don't return empty batches.

For now, as a first pass, we'll only use this to improve obstore.list, while not touching list_with_delimiter. Later we can explore making that return type a stream as well.

Closes #494

@kylebarron
Copy link
Member Author

The latest two commits added an implementation of substring-match list, partially written by Claude.

  • Fix clippy lints
  • Add test using minio? I need to have a python test that runs on the paginated implementation too
  • Propagate error through the stream
    Err(_e) => {
  • Make create_filtered_stream more concise. Did I write a helper for that initially?
  • Use set equality for paths in Python tests

@TomNicholas
Copy link

@kylebarron I can try to pick this up if you don't have time - is that list above all the remaining to-dos?

///
/// Instead, we collect _all_ results and filter them in memory with the provided substring.
#[async_trait::async_trait]
impl PaginatedListStore for PyLocalStore {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I intended to remove this and/or make generic over any Arc<dyn ObjectStore>

@kylebarron
Copy link
Member Author

kylebarron commented Feb 2, 2026

Yeah, the tl;dr is that object_store has a separate trait for backends that support paginated listing.

So the goal of this PR is to change list to use those paginated stores when possible, and to fall back to a naive implementation of "list everything in the directory, then filter based on substring".

It should be straightforward to test our custom substring filter implementation based on a MemoryStore. We should be able to test the paginated list store behavior via our minio-based tests in test_s3.py.

lmk if you still have questions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wildcard prefix in the list command

2 participants