{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "k4kpdDCUGe_i" }, "source": [ "# Введение в анализ данных\n", "\n", "\n", "## Обработка естественного языка. Генерация текста с помощью модели LLAMA." ] }, { "cell_type": "markdown", "metadata": { "id": "H_h6MAj1SrtP" }, "source": [ "В предыдущем [ноутбуке](https://miptstats.github.io/courses/ad_fivt/nlp_sem.html) мы научимся строить рекуррентные нейронные сети. В этом ноутбуке мы применим большую языковую модель LLAMA-2, используя GPU.\n", "Llama 2 — это семейство современных больших языковых моделей с открытым доступом. Почитать оригинальную статью 2023 года можно здесь.\n", "\n", "Модель может принимать на вход некоторый текст и продолжать его. Заметьте, что по умолчанию языковые модели не являются *conversational*, то есть их использование отличается от моделей типа Chat-GPT, которые предназначены для интерактивного взаимодействия с пользователем. Часто одну модель выкладывают в нескольких разных конфигурациях — и с обычным, и с conversational интерфейсом." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7xeRF_hSKgzs" }, "outputs": [], "source": [ "%pip install --quiet bitsandbytes==0.41.1 transformers==4.34.1 accelerate==0.24.0 sentencepiece==0.1.99 optimum==1.13.2 auto-gptq==0.4.2\n", "import torch\n", "import transformers\n", "\n", "assert torch.cuda.is_available(), \"you need cuda for this part\"\n", "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")" ] }, { "cell_type": "markdown", "metadata": { "id": "xfYEmczvjmsk" }, "source": [ "Загрузим модель `TheBloke/Llama-2-13B-GPTQ` из Hugging Face." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 365, "referenced_widgets": [ "0a57aa3fb8af4b6d904b879ace771551", "475a554352744fed8da0ba9c92410898", "0a200547d48d488f8c44c77d2f73483b", "accd345ccfc74afaa2b4147ab7dcef9a", "3a36131f141e4da7afc3e00a12d69a86", "bf482a3f2e0c4305afbe918d3ede3ee0", "845cb6c473154f8eb553aa66f9b3fa7d", "9abb03a508d64b3ca373527ad73eee43", "29cf52c50a1a4e82957cdca500d00f23", "93b833aeda394f298cbff99f2b8e04e9", "bae1737063ee40dd8ef5e6553659cbf7", "00696440baba428490dc95ebc72e788e", "67874a98790c41c1adc8cda98c4ca950", "cab1d4211ec044229324c80a3b14c84c", "4ed6854cfdf54a8ea3c8114b8b5c6b31", "df2871c9f3ca4189b9cb7fcb18faf55c", "d7db1cb1374644708a751e8987c7f3b6", "1767702750324c6bb94c9d54350e5322", "d27c5606c07c48039f32803b1bc67dc6", "0f8939827c5a4b9b9daae69900fab24d", "280de6ba38614e1aba1dfc73d1b6fc4b", "e343c08ee7174cf19b4712950fd6f75d", "ab6941b786604960a6184de5f83d6c6f", "48fd878592b44c75b8e89012b38a7c98", "17119b5128b24cfe89b5d6e62f1773cb", "a6818b39c89b4cbd9d5ce85103cee07a", "b6186aadad97458e9974cbb5c6127e00", "393ea146c52c4a1aa96d377e4de9f801", "13d71cd481b248398131519584911180", "1a9a79adc5fb470793e06d9c322b6926", "7c75f36d1f144fe09dd02351a2f74668", "898bcad4a82642d688464550fc8025c2", "fee0b5097c2e45a89bd9d71b56472589", "3e400844696a47edb97fb85404390425", "369eecf94279469980fe1449f903dc7d", "64e707df1c954ed39f68dbb37dd5a7c6", "23f97903fdc94f9088b157c82f05688a", "e685dee405874ffe8986629bdb06bfa8", "105c86d260214e68ad024ae5f7c6257e", "f3489a41493f453fb6de30abb8c55134", "4d1df5c6848f4cffb64763619206e629", "ce8335014fb248d29ae14ea5d6146b3e", "8170e01b38c04d3b988b87c0cd0b33a9", "d8f24f11cdea4a4c9ecf3c414630dd69", "9eebf1f4fe884b67a0b04eecc99bf6b5", "07299e026d794c3ba5ddd41ba2a7bd2a", "626185f4dd974afab772dea7c58964bd", "202947386e9b4468af6b5321a1bdfb73", "84d4624212b8410aa13bbca392dfd694", "e51ef80b5d4241ae85789d31670ac9de", "a16c4bbdb48241f8a88433d32ad25823", "db91a529aa2b4eb0b5764c451a3fbb9f", "eae35522244f4e778548775a37c9cb73", "22d5ccba29934652a1e96888ec3741d9", "7e1e56111053477a8086a397e57f221d", "feff1892c44a42c787bd9467eabc1712", "c461603036b0450490912f6741c16235", "9c5ce1657e7444f7a542831d06fe533c", "13f0725bffd64a98bbb6a555d4cc4464", "a631fee570394506ada1cdcabf719fb8", "b4118e45f3b945d9b95f942dccd815cb", "326a5fb332be4e41bb75e026d25c9fdd", "c93f78e236234d3a927c9ab696915f69", "cd1400d1251d43e7be7496c3ffd67b90", "256923e9ed844d66b7b4a490a52aefcb", "8c72760ae196449dbbdb5f5ac81e10a2", "df504ac591534a5bb4184e93fefafecd", "67ec9a06b70a49d9b965e74d1fd601f7", "0dd9103b87f94f4096c1acc76ae2b24e", "c1dfc401974d4a20b553fbd46ec73cf1", "e81111dd5105409ba5a71ba192d0ce79", "bde80d3a6ac14cfcac2fd9dd7dcac1ae", "365aec1ab7ad4d8984526f9d55d4b9d6", "1e3c8879103146979caccb8a9265c1a7", "f8382986155d41c29acdf217d237561f", "0cdcd08e951844e48545d9c9b7ae2e1b", "770fdabc09b74f50ac64473ade4548a6" ] }, "id": "VMzFwx29Kgzu", "outputId": "12d0cf33-a940-422b-81fe-a8d3117b729b" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "0a57aa3fb8af4b6d904b879ace771551", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading tokenizer_config.json: 0%| | 0.00/727 [00:00. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9eebf1f4fe884b67a0b04eecc99bf6b5", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Downloading config.json: 0%| | 0.00/913 [00:00The first discovered martian lifeform looks like a \"fancy spheroid\"\n", "It’s not an alien, but it’s close.\n", "A methane-rich meteorite from Mars is the planet’s first known lifeform.\n", "NASA / JSC / SCIENCE PHOTO LIBR\n" ] } ], "source": [ "prompt = \"The first discovered martian lifeform looks like\"\n", "batch = tokenizer(prompt, return_tensors=\"pt\", return_token_type_ids=False).to(\n", " device\n", ")\n", "print(\"Input batch (encoded):\", batch)\n", "\n", "output_tokens = model.generate(\n", " **batch, max_new_tokens=64, do_sample=True, temperature=0.8\n", ")\n", "# greedy inference: do_sample=False)\n", "# beam search for highest probability: num_beams=4)\n", "\n", "print(\"\\nOutput:\", tokenizer.decode(output_tokens[0].cpu()))" ] }, { "cell_type": "markdown", "metadata": { "id": "6SJ-c-lTj27E" }, "source": [ "**Вывод:** В этом ноутбуке мы посмотрели, как можно генерировать текст с помощью предобученной языковой модели LLAMA." ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "hide_input": false, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 1 }