GGUF
conversational

GGUF model stops generation around 900-1000 tokens

#2
by High-U - opened

The GGUF version of rnj-1-instruct consistently stops generation at around 900-1000 tokens when asked to generate longer content (e.g., complete HTML applications).
The model appears to complete generation early rather than continuing to the requested length. Is this expected behavior, or is there a way to generate longer outputs with the GGUF version?

Hi, can you give an example of a prompt that behaves this way?

Generate a complete HTML and JS application that looks like a screensaver of wandering through a maze. It should be large and intricate. Write extensive unit tests for it as well.

I'm able to sometimes trigger it with this prompt, but sometimes it continues on without an issue. If you're using llama-server, in the settings there's an option to enable a "continue" button that sometimes works. Will look into this more.

After enabling the "Continue" button setting in llama-server's web UI (General β†’ Enable "Continue" button), generation behavior changed on that UI.Generation now often completes fully (1500+ tokens), though it still sometimes stops around 1000 tokens.
The environment that always stopped at around 1000 tokens was not the llama-server UI, but CURL or my UI that calls the llama-server endpoint.

Sign up or log in to comment