-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Description
Thank you for this excellent implementation. I'd like to suggest an optimization that could significantly speed up inference and enable streaming output.
Currently, there are two GPT2 graphs:
- autoregressive: Generates speech codes (originally for CLVP to select the best result)
- autoregressive_latent_graph: Generates latents based on the best result
Since CLVP has been removed, we can streamline this to a single GPT2 graph that directly generates latents. I've implemented this with minimal changes:
- In
autoregressive_graph, add aftercur = ggml_add(ctx0, cur, model.language_model_head_layer_norm_bias);:
// Output latents
ggml_tensor *final_output_2 = ggml_cont(
ctx0, ggml_view_4d(ctx0, cur, 1024, 1, batch_size, 1,
cur->nb[1], cur->nb[2], cur->nb[3],
(test_dimension - 1) * sizeof(float) * 1024));
ggml_set_name(final_output_2, "output_latents");
ggml_set_output(final_output_2);
ggml_build_forward_expand(gf, final_output_2);- In the main inference loop, extract the latent:
extract_tensor_to_vector(ggml_graph_get_tensor(gf, "output_latents"), latent);Benefits:
- Faster inference by eliminating redundant GPT2 runs
- Enables potential streaming output of latents
- Simplifies code structure
This optimization could significantly benefit users looking to speed up inference or implement streaming latent generation.
balisujohn, fwsGonzo and nlapinski
Metadata
Metadata
Assignees
Labels
No labels