Skip to content

Nested objects in array lose information on toonification #6

@nomenclature95

Description

@nomenclature95

Summary

When toonifying nested JSON structures, certain properties (such as arrays of objects or strings) are lost in the output, even though token counts are reduced. This results in incomplete toonified data.


Steps to Reproduce

  1. Use the following JSON input (inline below).
    The data includes a nested array of objects under offset and an array of strings under hierarchy.

    {
      "categorization": [
        {
          "id": "01.04.04.01.",
          "label": "Aspetti generali",
          "hierarchy": [
            "Prodotti",
            "Organizzazione altro e Sito Internet",
            "Aspetti generali",
            "Aspetti generali"
          ],
          "score": 900,
          "winner": true,
          "namespace": "$namespace",
          "frequency": 0,
          "offset": [
            { "start": 511, "end": 520 },
            { "start": 524, "end": 527 },
            { "start": 528, "end": 543 }
          ]
        }
      ]
    }
  2. Run toonify on this JSON.

  3. Inspect the toonified output.


Observed Behavior

  • The resulting toonified JSON omits both the offset (array of objects) and hierarchy (array of strings) properties.
  • Token count is indeed reduced, but this reduction comes from the loss of meaningful structure and data.
  • See attached screenshot for reference: Image

Expected Behavior

Toonified output should preserve all properties (including nested arrays and objects), ensuring structural and semantic fidelity while still optimizing token usage.


Environment

  • package version: toonify-1.4.0
  • IDE: PyCharm (Notebook mode)
  • Experiment Context: Quick local test to measure token impact

Additional Context

I understand that this type of JSON doesn't fit neatly into traditional tabular data structures. However, supporting more complex, nested formats would significantly improve toonify's robustness, especially for real-world datasets used with LLMs where hierarchical or relational structures are common.


Suggested Improvement

Consider extending toonify's serialization logic to:

  • Preserve nested arrays and object properties.
  • Optionally flatten or represent them symbolically (e.g., offset[start:end] shorthand) without dropping information.
  • Provide a fallback or warning when certain structures can't be safely toonified.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions