Optimize GPT2 inference: Remove redundant `autoregressive_latent_graph` and enable streaming output

Thank you for this excellent implementation. I'd like to suggest an optimization that could significantly speed up inference and enable streaming output.

Currently, there are two GPT2 graphs:

1. autoregressive: Generates speech codes (originally for CLVP to select the best result)
2. autoregressive_latent_graph: Generates latents based on the best result

Since CLVP has been removed, we can streamline this to a single GPT2 graph that directly generates latents. I've implemented this with minimal changes:

1. In `autoregressive_graph`, add after `cur = ggml_add(ctx0, cur, model.language_model_head_layer_norm_bias);`:

```cpp
// Output latents
ggml_tensor *final_output_2 = ggml_cont(
    ctx0, ggml_view_4d(ctx0, cur, 1024, 1, batch_size, 1,
                       cur->nb[1], cur->nb[2], cur->nb[3],
                       (test_dimension - 1) * sizeof(float) * 1024));

ggml_set_name(final_output_2, "output_latents");
ggml_set_output(final_output_2);
ggml_build_forward_expand(gf, final_output_2);
```

2. In the main inference loop, extract the latent:
```cpp
extract_tensor_to_vector(ggml_graph_get_tensor(gf, "output_latents"), latent);
```

Benefits:
1. Faster inference by eliminating redundant GPT2 runs
2. Enables potential streaming output of latents
3. Simplifies code structure

This optimization could significantly benefit users looking to speed up inference or implement streaming latent generation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize GPT2 inference: Remove redundant `autoregressive_latent_graph` and enable streaming output #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Optimize GPT2 inference: Remove redundant autoregressive_latent_graph and enable streaming output #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Optimize GPT2 inference: Remove redundant `autoregressive_latent_graph` and enable streaming output #18