Skip to content

HTMLSemanticPreservingSplitter elements_to_preserve replacement order issue #31568

@strawgate

Description

@strawgate

Checked other resources

  • I added a very descriptive title to this issue.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
  • I posted a self-contained, minimal, reproducible example. A maintainer can copy it and run it AS IS.

Example Code

from langchain_text_splitters import HTMLSemanticPreservingSplitter

body = """
<p>Hello1</p>
<p>Hello2</p>
<p>Hello3</p>
<p>Hello4</p>
<p>Hello5</p>
<p>Hello6</p>
<p>Hello7</p>
<p>Hello8</p>
<p>Hello9</p>
<p>Hello10</p>
<p>Hello11</p>
<p>Hello12</p>
<p>Hello13</p>
<p>Hello14</p>
"""


splitter = HTMLSemanticPreservingSplitter(
    headers_to_split_on=[],
    elements_to_preserve=["p"],
)


if __name__ == "__main__":
    print(splitter.split_text(body))

outputs

[Document(metadata={}, page_content='Hello1 Hello2 Hello3 Hello4 Hello5 Hello6 Hello7 Hello8 Hello9 Hello10 Hello20 Hello21 Hello22 Hello23')]

Error Message and Stack Trace (if applicable)

No response

Description

the PLACEHOLDER replacements need to be done in reverse order or the PLACEHOLDER names need to be padded so that PLACEHOLDER_1 doesnt match PLACEHOLDER_11

System Info

System Information

OS: Darwin
OS Version: Darwin Kernel Version 24.5.0: Tue Apr 22 19:54:25 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6020
Python Version: 3.12.10 (main, Apr 8 2025, 11:35:47) [Clang 16.0.0 (clang-1600.0.26.6)]

Package Information

langchain_core: 0.3.65
langchain: 0.3.25
langchain_community: 0.3.25
langsmith: 0.3.45
langchain_experimental: 0.3.4
langchain_text_splitters: 0.3.8

Optional packages not installed

langserve

Other Dependencies

aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
httpx: 0.28.1
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-azure-ai;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.51: Installed. No version info available.
langchain-core<1.0.0,>=0.3.58: Installed. No version info available.
langchain-core<1.0.0,>=0.3.65: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-perplexity;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.8: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.25: Installed. No version info available.
langsmith-pyo3: Installed. No version info available.
langsmith<0.4,>=0.1.125: Installed. No version info available.
langsmith<0.4,>=0.1.17: Installed. No version info available.
langsmith<0.4,>=0.3.45: Installed. No version info available.
numpy>=1.26.2;: Installed. No version info available.
numpy>=2.1.0;: Installed. No version info available.
openai-agents: Installed. No version info available.
opentelemetry-api: Installed. No version info available.
opentelemetry-exporter-otlp-proto-http: Installed. No version info available.
opentelemetry-sdk: Installed. No version info available.
orjson: 3.10.18
packaging: 24.2
packaging<25,>=23.2: Installed. No version info available.
pydantic: 2.11.5
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic>=2.7.4: Installed. No version info available.
pytest: 8.4.0
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.4
requests-toolbelt: 1.0.0
requests<3,>=2: Installed. No version info available.
rich: 13.9.4
SQLAlchemy<3,>=1.4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
zstandard: 0.23.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugRelated to a bug, vulnerability, unexpected error with an existing featurehelp wantedGood issue for contributorstext-splittersRelated to the package `text-splitters`

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions