%pip install --quiet bitsandbytes==0.41.1 transformers==4.34.1 accelerate==0.24.0 sentencepiece==0.1.99 optimum==1.13.2 auto-gptq==0.4.2
import torch
import transformers

assert torch.cuda.is_available(), "you need cuda for this part"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "TheBloke/Llama-2-13B-GPTQ"

# Загружаем Llama токенизатор
tokenizer = transformers.LlamaTokenizer.from_pretrained(
    model_name, device_map=device
)
tokenizer.pad_token_id = tokenizer.eos_token_id

# И саму модель Llama
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    offload_state_dict=True,
)

Downloading tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565

Downloading config.json:   0%|          | 0.00/913 [00:00<?, ?B/s]

/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(

Downloading model.safetensors:   0%|          | 0.00/7.26G [00:00<?, ?B/s]

WARNING:auto_gptq.nn_modules.qlinear.qlinear_cuda_old:CUDA extension not installed.

Downloading generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

prompt = "The first discovered martian lifeform looks like"
batch = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False).to(
    device
)
print("Input batch (encoded):", batch)

output_tokens = model.generate(
    **batch, max_new_tokens=64, do_sample=True, temperature=0.8
)
# greedy inference:                                        do_sample=False)
# beam search for highest probability:                     num_beams=4)

print("\nOutput:", tokenizer.decode(output_tokens[0].cpu()))

Input batch (encoded): {'input_ids': tensor([[    1,   450,   937, 10943, 14436,   713,  2834,   689,  3430,   763]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1421: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(

Output: <s>The first discovered martian lifeform looks like a "fancy spheroid"
It’s not an alien, but it’s close.
A methane-rich meteorite from Mars is the planet’s first known lifeform.
NASA / JSC / SCIENCE PHOTO LIBR

Физтех.Статистика

Введение в анализ данных ¶

Обработка естественного языка. Генерация текста с помощью модели LLAMA.¶

Контакты

Введение в анализ данных¶

Обработка естественного языка. Генерация текста с помощью модели LLAMA.¶

Введение в анализ данных ¶