llama.cpp를 사용해 gguf로 변환했습니다.

from llama_cpp import Llama

llm = Llama(
    model_path="HyperCLOVAX-SEED-Text-Instruct-1.5B-gguf-bf16.gguf",
    n_gpu_layers=-1,
    main_gpu=0,
    n_ctx=2048
)

output = llm(
    "재미있는 이야기 하나 만들어줘. 1000자 이상이어야 해. 시작:", # Prompt
    max_tokens=2048,
    echo=True,

)
print(output)

geforce 3070 RTX로 테스트했으며, 성능은 다음과 같습니다.

bf16, peak: 4GB
llama_perf_context_print:        load time =     210.50 ms
llama_perf_context_print: prompt eval time =     210.42 ms /    19 tokens (   11.07 ms per token,    90.30 tokens per second)
llama_perf_context_print:        eval time =   17923.17 ms /  2028 runs   (    8.84 ms per token,   113.15 tokens per second)
llama_perf_context_print:       total time =   21307.79 ms /  2047 tokens

Downloads last month: 13

GGUF

Model size

2B params

Architecture

llama

Hardware compatibility

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sorryhyun/HyperCLOVAX-SEED-Text-Instruct-1.5B-gguf-bf16

Base model

naver-hyperclovax/HyperCLOVAX-SEED-Text-Instruct-1.5B

Quantized

(17)

this model