Skip to content

Jekyll plugin to generate llms.txt and llms-full.txt#920

Merged
heypoom merged 24 commits intomasterfrom
emb-1218-deploy-llmtxt-to-metabasecom
Jan 27, 2026
Merged

Jekyll plugin to generate llms.txt and llms-full.txt#920
heypoom merged 24 commits intomasterfrom
emb-1218-deploy-llmtxt-to-metabasecom

Conversation

@heypoom
Copy link
Member

@heypoom heypoom commented Jan 23, 2026

Closes EMB-1218
Closes EMB-1223
Closes EMB-1186

Context

We want to generate a clean versioned llms.txt file and put it on the metabase.com/docs website. The idea is that developers copy the URL for llms.txt i.e. https://metabase.com/docs/llms.txt (latest) or https://metabase.com/docs/v0.57/llms.txt (for MB 57), put it in their AI coding tool of choice, and it helps them to embed Metabase or convert between types of embedding (e.g. full app embedding to modular embedding)

This PR comes from the three requirements for improving llms.txt:

  1. Developers will have to insert the LLMs.txt URL into their coding tool. It should be a clean, elegant URL from the metabase.com domain.

  2. The main llms.txt file should not be a full index of everything available in the docs. That is already indexed and discoverable on the web. The LLMs.txt file should focus on content that we think is going to be relevant to LLMs to use. For now the only use case we have in mind is coding with Metabase, so I think we should reduce the content to 1) embedding integration guides (modular embedding & SDK), 2) embedding related setup and config (auth, SSO, embedding settings), 3) REST API

  3. Update llms.txt so that when migrating apps (e.g. from EAJS to the React SDK), the LLM first infers the Metabase version by querying the Metabase API (which does not require authentication), and then uses the corresponding embedding SDK package version. The goal is to avoid cases where the model picks an incorrect SDK version (such as using @56-stable from the docs instead of the correct version) and to make this behavior part of the default guidance in llm.txt.

  4. As a bonus, it should try to prevent gotchas. Roman and I ran into Your fetchRefreshToken function must return an object with the shape { jwt: string }, but instead received ... so I added a specific prompt for that.

Behavior

We want to generate the docs/llms.txt and docs/embedding/llms-full.txt files for embedding in the docs site.

  • /docs/llms.txt are the table of content files that links to other Markdown files hosted on GitHub rawusercontent. This avoids the SEO indexing problem where we're afraid the raw markdown file itself might got indexed.
  • /docs/embedding/llms-full.txt is the complete reference that concatenates other files. right now, this is filtered to only embedding specifically

Demo

Please see my Loom video!

@heypoom heypoom changed the title add jekyll plugin to generate llms.txt and llms-full.txt Add jekyll plugin to generate llms.txt and llms-full.txt Jan 23, 2026
@heypoom heypoom changed the title Add jekyll plugin to generate llms.txt and llms-full.txt Jekyll plugin to generate llms.txt and llms-full.txt Jan 23, 2026
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Jan 23, 2026

Deploying docs-metabase-github-io with  Cloudflare Pages  Cloudflare Pages

Latest commit: f04ce1a
Status: ✅  Deploy successful!
Preview URL: https://e294595f.docs-metabase-github-io.pages.dev
Branch Preview URL: https://emb-1218-deploy-llmtxt-to-me.docs-metabase-github-io.pages.dev

View logs

@heypoom heypoom marked this pull request as ready for review January 24, 2026 04:05
Copy link
Member

@WiNloSt WiNloSt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite some changes from the previous JS version counterpart. I'm also not that familiar with Ruby, so I couldn't comment on whether the code is idiomatic. But overall, it looks good.

#
# Use prefix matching - a path matches if it starts with any of these.
# For specific files, include the full path. For directories, include trailing slash.
INCLUDED_PATHS = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who chooses that these are all the docs we're going to include in this llms.txt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end

# Generate llms.txt for each version
docs_by_version.each do |version, docs|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know well, how our doc site is generated. Like, is it generated once in a job for every version? Or if each version has its own build pipeline, and they're generated in separate processes. Because this workflow seems to assume every time the doc site is built, it will generate llms.txt for every Metabase versions we have the documents for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like, is it generated once in a job for every version?

@WiNloSt It does exactly that: once the Jekyll built is ran, the Jekyll plugin builds the pages for every version. This is how our docs site is able to have the doc for version all the way back to the first versions of Metabase: it essentially builds every version every time.

this workflow seems to assume every time the doc site is built, it will generate llms.txt for every Metabase versions we have the documents for.

You got that right, yes. It does generate llms.txt for every single version.

return if section_docs.empty?

# Sort by path for consistent ordering
section_docs.sort_by!(&:relative_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We sort the doc files before passing to either the function to generate the full, or just llms.txt in the JS version.

https://github.com/metabase/metabase/blob/ae05f0bab78e26a86a1b39d7d4aa4b98d2974515/.github/scripts/generate-llms-txt.js#L313

I think we should do the stame here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, moved the sort to outside the function in d5b870a 👍🏻

lines = []
lines << "# Metabase #{section_capitalized} - Complete Reference for AI agents"
lines << ''
lines << "> **This documentation is for Metabase #{format_version_for_display(version, latest_branch)}.**"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 This is clear which version it's targeted at.

# We add the most important context for LLMs to avoid
# confusion and pitfalls like out-of-date APIs in trained data.
def get_modular_embedding_gotcha_notes
<<~NOTES.chomp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with ruby, but it seems you already use 2 ways to declare multiline strings.

  1. <<~NOTES.chomp
  2. lines << "xxx". This seems to be operated as an array, but maybe we could just throw a big mulitiline string in there using the same approach with 1. Idk, if that's possible, or would it make things look messier.

I just can't help but notice this difference. Ignore this comment if it doesn't make sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, we can definitely just use the heredoc multi-line string. Fixed this in dedbdbb

> 2. `authProviderUri` field no longer exist.
> 3. `jwtProviderUri` is an optional field that only exists in v58+. This is used to make JWT auth faster by skipping the `GET /auth/sso` discovery request. Not needed for initial implementation.
> 4. Numeric IDs must be integers not strings, e.g. `dashboardId={1}`. When the ID is retrieved from the router as a string AND it is numeric, `parseInt` it before passing it to the SDK.
> 5. IDs can also be strings for entity ids, so you should NOT parse all IDs as numbers if entity ids are also to be expected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> 5. IDs can also be strings for entity ids, so you should NOT parse all IDs as numbers if entity ids are also to be expected.
> 5. IDs can also be strings for entity IDs, so you should NOT parse all IDs as numbers if entity IDs are also to be expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reformatted in 4fd8d52

# 3. Fallback to filename converted to title case
def extract_title(doc)
# First, try frontmatter title
return doc.data['title'] if doc.data['title'] && !doc.data['title'].empty?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return xxx if condition is a pretty cool syntax.

<<~INSTRUCTIONS.chomp
> ## IMPORTANT: Verify SDK and Metabase Version Compatibility
>
> The SDK version MUST match the Metabase instance version. Mismatched versions cause errors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> The SDK version MUST match the Metabase instance version. Mismatched versions cause errors.
> The SDK version MUST match the Metabase instance version. Mismatched versions can cause errors.

Technically, they will work despite being mismatched unless there are breaking changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 8abf0a7 🙏🏻

>
> **Step 4: Ensure versions match**
>
> - If the version mismatches, you MUST fetch the version-specific llms.txt documentation that matches the Metabase instance version: `https://metabase.com/docs/v0.{VERSION}/llms.txt` (e.g., `/docs/v0.58/llms.txt` for Metabase 58)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> - If the version mismatches, you MUST fetch the version-specific llms.txt documentation that matches the Metabase instance version: `https://metabase.com/docs/v0.{VERSION}/llms.txt` (e.g., `/docs/v0.58/llms.txt` for Metabase 58)
> - If the versions mismatch, you MUST fetch the version-specific llms.txt documentation that matches the Metabase instance version: `https://metabase.com/docs/v0.{VERSION}/llms.txt` (e.g., `/docs/v0.58/llms.txt` for Metabase 58)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f04ce1a


# Add modular embedding gotchas for v57+ (same as in llms-full.txt)
if above_version?(version, 57)
lines << get_modular_embedding_gotcha_notes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is new. We only included gochas in llms full version previously correct?

Copy link
Member Author

@heypoom heypoom Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it's new! We now expect people to use llms.txt more than llms-full.txt, so it's important that we also have it in llms.txt so they don't run into weird edge cases.

Co-authored-by: Mahatthana (Kelvin) Nomsawadi <me@bboykelvin.dev>
@heypoom heypoom merged commit a56d79c into master Jan 27, 2026
1 check passed
@heypoom heypoom deleted the emb-1218-deploy-llmtxt-to-metabasecom branch January 27, 2026 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants