Jekyll plugin to generate llms.txt and llms-full.txt#920
Conversation
Deploying docs-metabase-github-io with
|
| Latest commit: |
f04ce1a
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://e294595f.docs-metabase-github-io.pages.dev |
| Branch Preview URL: | https://emb-1218-deploy-llmtxt-to-me.docs-metabase-github-io.pages.dev |
WiNloSt
left a comment
There was a problem hiding this comment.
There are quite some changes from the previous JS version counterpart. I'm also not that familiar with Ruby, so I couldn't comment on whether the code is idiomatic. But overall, it looks good.
| # | ||
| # Use prefix matching - a path matches if it starts with any of these. | ||
| # For specific files, include the full path. For directories, include trailing slash. | ||
| INCLUDED_PATHS = [ |
There was a problem hiding this comment.
who chooses that these are all the docs we're going to include in this llms.txt?
There was a problem hiding this comment.
@WiNloSt Alberto chooses these, please see https://linear.app/metabase/issue/EMB-1223/make-llmstxt-more-lightweight. cc @albertoperdomo
| end | ||
|
|
||
| # Generate llms.txt for each version | ||
| docs_by_version.each do |version, docs| |
There was a problem hiding this comment.
I don't know well, how our doc site is generated. Like, is it generated once in a job for every version? Or if each version has its own build pipeline, and they're generated in separate processes. Because this workflow seems to assume every time the doc site is built, it will generate llms.txt for every Metabase versions we have the documents for.
There was a problem hiding this comment.
Like, is it generated once in a job for every version?
@WiNloSt It does exactly that: once the Jekyll built is ran, the Jekyll plugin builds the pages for every version. This is how our docs site is able to have the doc for version all the way back to the first versions of Metabase: it essentially builds every version every time.
this workflow seems to assume every time the doc site is built, it will generate llms.txt for every Metabase versions we have the documents for.
You got that right, yes. It does generate llms.txt for every single version.
| return if section_docs.empty? | ||
|
|
||
| # Sort by path for consistent ordering | ||
| section_docs.sort_by!(&:relative_path) |
There was a problem hiding this comment.
We sort the doc files before passing to either the function to generate the full, or just llms.txt in the JS version.
I think we should do the stame here.
There was a problem hiding this comment.
Definitely, moved the sort to outside the function in d5b870a 👍🏻
| lines = [] | ||
| lines << "# Metabase #{section_capitalized} - Complete Reference for AI agents" | ||
| lines << '' | ||
| lines << "> **This documentation is for Metabase #{format_version_for_display(version, latest_branch)}.**" |
There was a problem hiding this comment.
👍 This is clear which version it's targeted at.
| # We add the most important context for LLMs to avoid | ||
| # confusion and pitfalls like out-of-date APIs in trained data. | ||
| def get_modular_embedding_gotcha_notes | ||
| <<~NOTES.chomp |
There was a problem hiding this comment.
I'm not familiar with ruby, but it seems you already use 2 ways to declare multiline strings.
- <<~NOTES.chomp
- lines << "xxx". This seems to be operated as an array, but maybe we could just throw a big mulitiline string in there using the same approach with 1. Idk, if that's possible, or would it make things look messier.
I just can't help but notice this difference. Ignore this comment if it doesn't make sense.
There was a problem hiding this comment.
Ah yes, we can definitely just use the heredoc multi-line string. Fixed this in dedbdbb
| > 2. `authProviderUri` field no longer exist. | ||
| > 3. `jwtProviderUri` is an optional field that only exists in v58+. This is used to make JWT auth faster by skipping the `GET /auth/sso` discovery request. Not needed for initial implementation. | ||
| > 4. Numeric IDs must be integers not strings, e.g. `dashboardId={1}`. When the ID is retrieved from the router as a string AND it is numeric, `parseInt` it before passing it to the SDK. | ||
| > 5. IDs can also be strings for entity ids, so you should NOT parse all IDs as numbers if entity ids are also to be expected. |
There was a problem hiding this comment.
| > 5. IDs can also be strings for entity ids, so you should NOT parse all IDs as numbers if entity ids are also to be expected. | |
| > 5. IDs can also be strings for entity IDs, so you should NOT parse all IDs as numbers if entity IDs are also to be expected. |
| # 3. Fallback to filename converted to title case | ||
| def extract_title(doc) | ||
| # First, try frontmatter title | ||
| return doc.data['title'] if doc.data['title'] && !doc.data['title'].empty? |
There was a problem hiding this comment.
return xxx if condition is a pretty cool syntax.
| <<~INSTRUCTIONS.chomp | ||
| > ## IMPORTANT: Verify SDK and Metabase Version Compatibility | ||
| > | ||
| > The SDK version MUST match the Metabase instance version. Mismatched versions cause errors. |
There was a problem hiding this comment.
| > The SDK version MUST match the Metabase instance version. Mismatched versions cause errors. | |
| > The SDK version MUST match the Metabase instance version. Mismatched versions can cause errors. |
Technically, they will work despite being mismatched unless there are breaking changes.
| > | ||
| > **Step 4: Ensure versions match** | ||
| > | ||
| > - If the version mismatches, you MUST fetch the version-specific llms.txt documentation that matches the Metabase instance version: `https://metabase.com/docs/v0.{VERSION}/llms.txt` (e.g., `/docs/v0.58/llms.txt` for Metabase 58) |
There was a problem hiding this comment.
| > - If the version mismatches, you MUST fetch the version-specific llms.txt documentation that matches the Metabase instance version: `https://metabase.com/docs/v0.{VERSION}/llms.txt` (e.g., `/docs/v0.58/llms.txt` for Metabase 58) | |
| > - If the versions mismatch, you MUST fetch the version-specific llms.txt documentation that matches the Metabase instance version: `https://metabase.com/docs/v0.{VERSION}/llms.txt` (e.g., `/docs/v0.58/llms.txt` for Metabase 58) |
|
|
||
| # Add modular embedding gotchas for v57+ (same as in llms-full.txt) | ||
| if above_version?(version, 57) | ||
| lines << get_modular_embedding_gotcha_notes |
There was a problem hiding this comment.
I think this is new. We only included gochas in llms full version previously correct?
There was a problem hiding this comment.
Yep, it's new! We now expect people to use llms.txt more than llms-full.txt, so it's important that we also have it in llms.txt so they don't run into weird edge cases.
Co-authored-by: Mahatthana (Kelvin) Nomsawadi <me@bboykelvin.dev>
Closes EMB-1218
Closes EMB-1223
Closes EMB-1186
Context
We want to generate a clean versioned
llms.txtfile and put it on themetabase.com/docswebsite. The idea is that developers copy the URL forllms.txti.e.https://metabase.com/docs/llms.txt(latest) orhttps://metabase.com/docs/v0.57/llms.txt(for MB 57), put it in their AI coding tool of choice, and it helps them to embed Metabase or convert between types of embedding (e.g. full app embedding to modular embedding)This PR comes from the three requirements for improving llms.txt:
Developers will have to insert the LLMs.txt URL into their coding tool. It should be a clean, elegant URL from the metabase.com domain.
The main llms.txt file should not be a full index of everything available in the docs. That is already indexed and discoverable on the web. The LLMs.txt file should focus on content that we think is going to be relevant to LLMs to use. For now the only use case we have in mind is coding with Metabase, so I think we should reduce the content to 1) embedding integration guides (modular embedding & SDK), 2) embedding related setup and config (auth, SSO, embedding settings), 3) REST API
Update llms.txt so that when migrating apps (e.g. from EAJS to the React SDK), the LLM first infers the Metabase version by querying the Metabase API (which does not require authentication), and then uses the corresponding embedding SDK package version. The goal is to avoid cases where the model picks an incorrect SDK version (such as using @56-stable from the docs instead of the correct version) and to make this behavior part of the default guidance in llm.txt.
As a bonus, it should try to prevent gotchas. Roman and I ran into
Your fetchRefreshToken function must return an object with the shape { jwt: string }, but instead received ...so I added a specific prompt for that.Behavior
We want to generate the
docs/llms.txtanddocs/embedding/llms-full.txtfiles for embedding in the docs site./docs/llms.txtare the table of content files that links to other Markdown files hosted on GitHub rawusercontent. This avoids the SEO indexing problem where we're afraid the raw markdown file itself might got indexed./docs/embedding/llms-full.txtis the complete reference that concatenates other files. right now, this is filtered to only embedding specificallyDemo
Please see my Loom video!