llama.cpp๋ฅผ ์ฌ์ฉํด gguf๋ก ๋ณํํ์ต๋๋ค.
from llama_cpp import Llama
llm = Llama(
    model_path="HyperCLOVAX-SEED-Text-Instruct-1.5B-gguf-bf16.gguf",
    n_gpu_layers=-1,
    main_gpu=0,
    n_ctx=2048
)
output = llm(
    "์ฌ๋ฏธ์๋ ์ด์ผ๊ธฐ ํ๋ ๋ง๋ค์ด์ค. 1000์ ์ด์์ด์ด์ผ ํด. ์์:", # Prompt
    max_tokens=2048,
    echo=True,
)
print(output)
geforce 3070 RTX๋ก ํ ์คํธํ์ผ๋ฉฐ, ์ฑ๋ฅ์ ๋ค์๊ณผ ๊ฐ์ต๋๋ค.
bf16, peak: 4GB
llama_perf_context_print:        load time =     210.50 ms
llama_perf_context_print: prompt eval time =     210.42 ms /    19 tokens (   11.07 ms per token,    90.30 tokens per second)
llama_perf_context_print:        eval time =   17923.17 ms /  2028 runs   (    8.84 ms per token,   113.15 tokens per second)
llama_perf_context_print:       total time =   21307.79 ms /  2047 tokens
- Downloads last month
- 13
							Hardware compatibility
						Log In
								
								to view the estimation
16-bit
	Inference Providers
	NEW
	
	
	This model isn't deployed by any Inference Provider.
	๐
			
		Ask for provider support