WIP: Paginated list support to allow substring list prefix#545
WIP: Paginated list support to allow substring list prefix#545kylebarron wants to merge 5 commits intomainfrom
Conversation
|
The latest two commits added an implementation of substring-match
|
|
@kylebarron I can try to pick this up if you don't have time - is that list above all the remaining to-dos? |
| /// | ||
| /// Instead, we collect _all_ results and filter them in memory with the provided substring. | ||
| #[async_trait::async_trait] | ||
| impl PaginatedListStore for PyLocalStore { |
There was a problem hiding this comment.
I think I intended to remove this and/or make generic over any Arc<dyn ObjectStore>
|
Yeah, the tl;dr is that object_store has a separate trait for backends that support paginated listing. So the goal of this PR is to change It should be straightforward to test our custom substring filter implementation based on a lmk if you still have questions |
Currently, obstore only supports listing by path segments. So if you pass a prefix into
list_with_delimiterorlist, that will be assumed to be a full path segment. This means that it's currently impossible to do efficiently perform the desired query from #494:object_storesupports substring-based prefix listing in itsPaginatedListStoreAPI. So if I use that and provide my own pagination -> stream conversion, then I should be able to essentially match the currentlistAPI.However, this
PaginatedListStoreis only implemented for S3, Azure, and GCS. It's not implemented for HTTPStore or LocalStore, because those don't have a concept of pagination. See apache/arrow-rs-object-store#388.This means that to support ...
... or, better idea, in
obstore.listwe:Arc<dyn ObjectStore>, we have essentially an enum of the different storesPaginatedListStore, to support efficient querying of substring prefixObjectStore::list, so that we never materialize the entire streamPaginatedListStoreFor now, as a first pass, we'll only use this to improve
obstore.list, while not touchinglist_with_delimiter. Later we can explore making that return type a stream as well.Closes #494