Skip to content

[Project] Research - Document Upload, Search, and “Chat with Docs” #17532

@joebudi

Description

@joebudi

Feature Name

This is a research spike to understand how we can support document search within AI workflows.

For example, allowing users to upload PDFs to budibase AI chat where they can converse with the agent re the content of the PDFs.


Goal

Define a practical, OSS-friendly architecture that lets Budibase:
1. Let builders upload/manage documents (or connect an external knowledge base).
2. Search those docs using a vector store.
3. Use the user’s configured models (via LiteLLM / AI SDK) to chat with and reason over those docs.

Deliver a clear recommendation + prototype that can become the v1 “Document Search / Chat with Docs” implementation.


Use cases

1. HR Knowledge Assistant
Who: Employees, HR staff
Scenario: “I need to know our remote-work policy, parental leave rules, and health benefit details.”
How it works:
• HR has uploaded or synced their policy PDFs into Budibase.
• Employee asks: “How many days of remote work am I allowed?”
• The assistant searches the HR documents and answers with grounded text.
Outcome: Instant retrieval of HR policies without searching SharePoint, PDFs, or internal wikis.

2. Compliance & Audit Queries
Who: Compliance team, auditors, legal
Scenario: “What does our GDPR Data Retention section say about backups?”
How it works:
• Compliance PDFs + policies loaded.
• Chat returns the exact clause and a summarized interpretation.
Outcome: Fast and accurate compliance answers without combing through hundreds of pages.

3. Security Incident Response Assistant
Who: Security team, on-call engineers
Scenario: “What’s the step-by-step procedure for a suspected credential leak?”
How it works:
• Security runbooks uploaded to dedicated workspace.
• During an incident, the responder asks the assistant.
Outcome: Faster, consistent, error-free incident handling.


Rules

  • Whatever approach we take, it must be OSS-friendly. So, if we support 3/4 vector options, 1 must be OSS.
  • Simplicity is important for v1. It's better to have 1 basic solution that's simple, that a extensible solution that's complicated.
  • Self-hostable
  • Works with user’s models via LiteLLM
  • Doesn’t lock Budibase into a proprietary RAG provider

Questions

1. Storage & Ingestion

  • Where do documents live?
    • Budibase storage (S3/minio/etc.) vs “bring your own KB” only vs hybrid.
  • How do we ingest docs?
    • Text extraction (PDF, DOCX, HTML, MD, etc.)
    • Chunking strategy (by page, heading, paragraph, tokens)
    • File size and page limits
  • Who owns the embedding step?
    • Budibase calling the user’s model (via LiteLLM) or using a dedicated embedding model (OpenAI, local, etc.)

2. Vector Store / KB Backend

  • Which OSS vector DBs should we support first?
    • Qdrant vs Weaviate vs Milvus vs pgvector
  • What are the evaluation criteria?
    • Self-hostability
    • Operational complexity
    • Performance and scaling
    • Multi-tenant story for Budibase Cloud vs self-host
    • Do we need a pluggable “Retriever” interface so we can add more stores later?

3. Retrieval + LLM Orchestration

  • Where does RAG logic live?
    • In Budibase (our own retrieval + prompt injection), or
    • In the model provider / LiteLLM (if we used their KB abstractions – probably not, given OSS goal).
  • How do we expose retrieval to:
    • Chat (builder config + runtime tool calling)
    • Agents (as a Tool in the Tool Registry)
  • What does the tool contract look like?
    • Inputs: query, filters, topK, KB id
    • Outputs: list of chunks with text, documentId, source, score, metadata
  • How do we control which documents a particular chat/agent can see?

4. UX & Configuration

  • Builder UX:
    • Upload docs UI vs “Connect external KB” vs both.
    • How to configure which KB/collection a chat/agent uses.
    • How to test retrieval from the editor (test query).
  • End-user UX:
    • Do we show sources/citations? How?
    • How do we handle failures (no results, backend offline)?
    • How do we indicate that an answer is grounded vs pure LLM?
  1. Security, Access & Multi-Tenancy
  • Access model:
    • Which docs can which users, chats, and agents access?
    • Is access enforced at Budibase layer, KB layer, or both?
  • Multi-tenant concerns:
    • How do we keep tenants isolated in shared clusters (especially Qdrant/Chroma)?
  • Data lifecycle:
    • Deletion guarantees (user deletes doc → embeddings removed).
  1. Compatibility with “User’s Models”
  • How do we ensure:
    • Retrieval works with any model reachable via LiteLLM (OpenAI, Mistral, local, etc.)?
    • We can use different embedding models than chat models if needed.
  • How do we handle context-window limits?
    • Chunk selection/truncation
    • Max tokens allocated to retrieved context

Workstreams

A. Landscape & Technical Options (2–3 providers)

Deep dive on:

  • Qdrant
  • Chroma
  • Milvus

For each:

  • Deployment story (Cloud vs self-host)
  • API surface for search & upsert
  • How to represent metadata and multi-tenant isolation
  • Supported embedding dimensions / index types

Output: Comparison table + recommendation.

B. Minimal RAG Architecture for Budibase

Design 1–2 reference architectures:

  1. “Bring Your Own RAG” v1
  • Users ingest docs into Qdrant/Chroma themselves.
  • Budibase only does retrieval via a Document Search Tool.
  1. “Budibase Uploads” v1.1
  • Budibase handles file upload → extraction → embeddings → upsert to vector DB.
  • Still fully OSS: embeddings via user’s model or a configured embedding provider.

Output: Architecture diagrams + pros/cons + recommended sequence.

C. Tool Contract & Runtime Design

Define:

  • Tool schema (document_search):
    • Inputs: query, kbId/collection, topK, filters
    • Outputs: results[] with text, docId, metadata
  • How Chat uses the tool:
    • Builder config in Chat Editor
    • Prompt guidelines to encourage tool usage
  • How Agents use the tool:
    • Allowed tools list
    • Scratchpad integration (store chunks, add to prompts)

Output: Tool spec and example payloads.


Deliverables

  1. Research findings, constraints and recommendation
  2. Implementation plan (Project with PRDs), for example:

Metadata

Metadata

Assignees

Labels

Type

Projects

Status

Q4 2025 – Oct-Dec

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions