diff --git a/olive_quantization/README.md b/olive_quantization/README.md index b9e17df..932aedb 100644 --- a/olive_quantization/README.md +++ b/olive_quantization/README.md @@ -1,38 +1,56 @@ -# OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization [[paper](https://arxiv.org/abs/2304.07493)] +## 环境配置: -![](figures/intro_victor.png) - -## Abstract - -Transformer-based large language models (LLMs) have achieved great success with the growing model size. LLMs’ size grows by 240× every two years, which outpaces the hardware progress and makes model inference increasingly costly. Model quantization is a promising approach to mitigate the widening gap between LLM size and hardware capacity. However, the existence of outliers, values with significant magnitudes, in LLMs makes existing quantization methods less effective. Prior outlier-aware quantization schemes adopt sparsity encoding techniques to separate outliers from nor- mal values where the process requires global coordination (e.g., a global sparsity coordination list). This incurs complex encod- ing/decoding hardware logics and an extra orchestration controller for the computation between outlier and normal values. As such, it is not hardware-efficient and hence only achieves sub-optimal quantization benefits. - -We propose OliVe, an algorithm/architecture co-designed so- lution that adopts an outlier-victim pair (OVP) quantization and handles outlier values locally with low hardware overheads and high performance gains. The key insight of OliVe is that outliers are important while the normal values next to them are not. Thus those normal values (called victims) can be sacrificed to accommodate outliers. This enables a memory-aligned OVP encoding scheme, which can be efficiently integrated to the existing hardware accel- erators like systolic array and tensor core. As a result, OliVe-based accelerator surpasses the existing outlier-aware accelerator, GOBO, by 4.5× speedup and 4.0× energy reduction, respectively, with a superior model accuracy. - -## Environment -```bash +```Plain Text conda create -n OliVe python=3.8 conda activate OliVe conda install pytorch=1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch + +cd ./olive_quantization + pip install -r requirements.txt pip install ./quant ``` -## Paper's Hardware Configuration -+ AMD EPYC 7302 16-Core Processor -+ NVIDIA A40 GPU (48GB) -## Usage -### BERT / BART +## 适配LLAMA: + +配好环境后,在conda环境中更新这些包: + +```Plain Text +pip install --upgrade evaluate +pip install datasets -U +pip install --upgrade transformers==4.33 +!pip install accelerate==0.20.3 +``` + + + +## 运行Olive: + +```Plain Text +cd olive_quantization/llm +./scripts/run_all.sh +``` + +run_all.sh中的运行示例:可以按照实验要求手动修改 + +```Plain Text +CUDA_VISIBLE_DEVICES=1 ./scripts/clm_run.sh LLAMA/llama-7b c4 realnewslike ant-int-flint 4 2 46666 outlier +``` + +其中: -We adopt the BERT and BART models for the NLP task with five datasets, MNLI, CoLA, SST-2, QQP and MRPC. +LLAMA/llama-7b:是存放模型软连接的文件夹,改成opt模型的话:OPT/opt-7b -For reproducing the results in the paper, please refer to `./bert`. +c4 realnewslike:是数据集选择,选Wikitext数据集改成:wikitext wikitext-103-raw-v1 -### Large Language Models +4:是bit选择,默认是8bit,这里是4bit -We adopt the GPT-2, OPT and Bloom models for the NLP task with two datasets, wikitext and C4. +2:batch_size大小 -For reproducing the results in the paper, please refer to `./llm`. \ No newline at end of file +所有脚本参数的设定在 clm_run.sh 文件里 +实验结果在./llm/checkpoints中 +实验日志在./llm/log中 \ No newline at end of file diff --git a/olive_quantization/llm/=0.20.3 b/olive_quantization/llm/=0.20.3 new file mode 100644 index 0000000..35bcd78 --- /dev/null +++ b/olive_quantization/llm/=0.20.3 @@ -0,0 +1,7 @@ +Requirement already satisfied: accelerate in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (0.16.0) +Requirement already satisfied: numpy>=1.17 in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from accelerate) (1.24.3) +Requirement already satisfied: packaging>=20.0 in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from accelerate) (23.2) +Requirement already satisfied: psutil in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from accelerate) (5.9.6) +Requirement already satisfied: pyyaml in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from accelerate) (6.0.1) +Requirement already satisfied: torch>=1.4.0 in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from accelerate) (1.11.0) +Requirement already satisfied: typing_extensions in /home/gaozh/Software/miniconda3/envs/OliVe/lib/python3.8/site-packages (from torch>=1.4.0->accelerate) (4.7.1) diff --git a/olive_quantization/llm/LLAMA/llama-7b b/olive_quantization/llm/LLAMA/llama-7b new file mode 120000 index 0000000..cd15454 --- /dev/null +++ b/olive_quantization/llm/LLAMA/llama-7b @@ -0,0 +1 @@ +/data/gaozh/llama1-hf/7B/ \ No newline at end of file diff --git a/olive_quantization/llm/OPT/opt-125m b/olive_quantization/llm/OPT/opt-125m new file mode 120000 index 0000000..3c65738 --- /dev/null +++ b/olive_quantization/llm/OPT/opt-125m @@ -0,0 +1 @@ +/data/gaozh/opt/opt-125m/ \ No newline at end of file diff --git a/olive_quantization/llm/OPT/opt-6.7b b/olive_quantization/llm/OPT/opt-6.7b new file mode 120000 index 0000000..c0e58b2 --- /dev/null +++ b/olive_quantization/llm/OPT/opt-6.7b @@ -0,0 +1 @@ +/data/gaozh/opt/opt-6.7b/ \ No newline at end of file diff --git a/olive_quantization/llm/accuracy/accuracy.py b/olive_quantization/llm/accuracy/accuracy.py new file mode 100644 index 0000000..aa5a073 --- /dev/null +++ b/olive_quantization/llm/accuracy/accuracy.py @@ -0,0 +1,106 @@ +# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Accuracy metric.""" + +import datasets +from sklearn.metrics import accuracy_score + +import evaluate + + +_DESCRIPTION = """ +Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: +Accuracy = (TP + TN) / (TP + TN + FP + FN) + Where: +TP: True positive +TN: True negative +FP: False positive +FN: False negative +""" + + +_KWARGS_DESCRIPTION = """ +Args: + predictions (`list` of `int`): Predicted labels. + references (`list` of `int`): Ground truth labels. + normalize (`boolean`): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True. + sample_weight (`list` of `float`): Sample weights Defaults to None. + +Returns: + accuracy (`float` or `int`): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if `normalize` is set to `True`.. A higher score means higher accuracy. + +Examples: + + Example 1-A simple example + >>> accuracy_metric = evaluate.load("accuracy") + >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0]) + >>> print(results) + {'accuracy': 0.5} + + Example 2-The same as Example 1, except with `normalize` set to `False`. + >>> accuracy_metric = evaluate.load("accuracy") + >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], normalize=False) + >>> print(results) + {'accuracy': 3.0} + + Example 3-The same as Example 1, except with `sample_weight` set. + >>> accuracy_metric = evaluate.load("accuracy") + >>> results = accuracy_metric.compute(references=[0, 1, 2, 0, 1, 2], predictions=[0, 1, 1, 2, 1, 0], sample_weight=[0.5, 2, 0.7, 0.5, 9, 0.4]) + >>> print(results) + {'accuracy': 0.8778625954198473} +""" + + +_CITATION = """ +@article{scikit-learn, + title={Scikit-learn: Machine Learning in {P}ython}, + author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. + and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. + and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and + Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.}, + journal={Journal of Machine Learning Research}, + volume={12}, + pages={2825--2830}, + year={2011} +} +""" + + +@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION) +class Accuracy(evaluate.Metric): + def _info(self): + return evaluate.MetricInfo( + description=_DESCRIPTION, + citation=_CITATION, + inputs_description=_KWARGS_DESCRIPTION, + features=datasets.Features( + { + "predictions": datasets.Sequence(datasets.Value("int32")), + "references": datasets.Sequence(datasets.Value("int32")), + } + if self.config_name == "multilabel" + else { + "predictions": datasets.Value("int32"), + "references": datasets.Value("int32"), + } + ), + reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html"], + ) + + def _compute(self, predictions, references, normalize=True, sample_weight=None): + return { + "accuracy": float( + accuracy_score(references, predictions, normalize=normalize, sample_weight=sample_weight) + ) + } diff --git a/olive_quantization/llm/checkpoints/LLAMA/llama-7b/all_results.json b/olive_quantization/llm/checkpoints/LLAMA/llama-7b/all_results.json new file mode 100644 index 0000000..79c1512 --- /dev/null +++ b/olive_quantization/llm/checkpoints/LLAMA/llama-7b/all_results.json @@ -0,0 +1,9 @@ +{ + "eval_accuracy": 0.24537370580455747, + "eval_loss": 5.000532150268555, + "eval_runtime": 801.0457, + "eval_samples": 289, + "eval_samples_per_second": 0.361, + "eval_steps_per_second": 0.181, + "perplexity": 148.49215822288735 +} \ No newline at end of file diff --git a/olive_quantization/llm/checkpoints/LLAMA/llama-7b/eval_results.json b/olive_quantization/llm/checkpoints/LLAMA/llama-7b/eval_results.json new file mode 100644 index 0000000..79c1512 --- /dev/null +++ b/olive_quantization/llm/checkpoints/LLAMA/llama-7b/eval_results.json @@ -0,0 +1,9 @@ +{ + "eval_accuracy": 0.24537370580455747, + "eval_loss": 5.000532150268555, + "eval_runtime": 801.0457, + "eval_samples": 289, + "eval_samples_per_second": 0.361, + "eval_steps_per_second": 0.181, + "perplexity": 148.49215822288735 +} \ No newline at end of file diff --git a/olive_quantization/llm/checkpoints/facebook/opt-125m/README.md b/olive_quantization/llm/checkpoints/facebook/opt-125m/README.md new file mode 100644 index 0000000..9d85ca1 --- /dev/null +++ b/olive_quantization/llm/checkpoints/facebook/opt-125m/README.md @@ -0,0 +1,57 @@ +--- +license: other +tags: +- generated_from_trainer +datasets: +- wikitext +model-index: +- name: opt-125m + results: [] +--- + + + +# opt-125m + +This model is a fine-tuned version of [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) on the wikitext wikitext-103-raw-v1 dataset. +It achieves the following results on the evaluation set: +- eval_loss: 4.4711 +- eval_accuracy: 0.2692 +- eval_runtime: 37.8287 +- eval_samples_per_second: 6.424 +- eval_steps_per_second: 3.225 +- step: 0 + +## Model description + +More information needed + +## Intended uses & limitations + +More information needed + +## Training and evaluation data + +More information needed + +## Training procedure + +### Training hyperparameters + +The following hyperparameters were used during training: +- learning_rate: 5e-05 +- train_batch_size: 2 +- eval_batch_size: 2 +- seed: 42 +- distributed_type: multi-GPU +- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 +- lr_scheduler_type: linear +- num_epochs: 3.0 + +### Framework versions + +- Transformers 4.26.1 +- Pytorch 1.11.0 +- Datasets 2.15.0 +- Tokenizers 0.13.3 diff --git a/olive_quantization/llm/checkpoints/facebook/opt-125m/all_results.json b/olive_quantization/llm/checkpoints/facebook/opt-125m/all_results.json new file mode 100644 index 0000000..86496d3 --- /dev/null +++ b/olive_quantization/llm/checkpoints/facebook/opt-125m/all_results.json @@ -0,0 +1,9 @@ +{ + "eval_accuracy": 0.2692234974194353, + "eval_loss": 4.471142292022705, + "eval_runtime": 37.8287, + "eval_samples": 243, + "eval_samples_per_second": 6.424, + "eval_steps_per_second": 3.225, + "perplexity": 87.45656691585893 +} \ No newline at end of file diff --git a/olive_quantization/llm/checkpoints/facebook/opt-125m/eval_results.json b/olive_quantization/llm/checkpoints/facebook/opt-125m/eval_results.json new file mode 100644 index 0000000..86496d3 --- /dev/null +++ b/olive_quantization/llm/checkpoints/facebook/opt-125m/eval_results.json @@ -0,0 +1,9 @@ +{ + "eval_accuracy": 0.2692234974194353, + "eval_loss": 4.471142292022705, + "eval_runtime": 37.8287, + "eval_samples": 243, + "eval_samples_per_second": 6.424, + "eval_steps_per_second": 3.225, + "perplexity": 87.45656691585893 +} \ No newline at end of file diff --git a/olive_quantization/llm/run_clm.py b/olive_quantization/llm/run_clm.py index 3983421..122f102 100644 --- a/olive_quantization/llm/run_clm.py +++ b/olive_quantization/llm/run_clm.py @@ -526,6 +526,7 @@ def tokenize_function(examples): "Picking 1024 instead. You can change that default value by passing --block_size xxx." ) block_size = 1024 + tokenizer.model_max_length = block_size/4 else: if data_args.block_size > tokenizer.model_max_length: logger.warning( @@ -533,6 +534,7 @@ def tokenize_function(examples): f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}." ) block_size = min(data_args.block_size, tokenizer.model_max_length) + tokenizer.model_max_length = block_size/4 # Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size. def group_texts(examples): @@ -590,7 +592,7 @@ def preprocess_logits_for_metrics(logits, labels): logits = logits[0] return logits.argmax(dim=-1) - metric = evaluate.load("accuracy") + metric = evaluate.load("./accuracy/accuracy.py") def compute_metrics(eval_preds): preds, labels = eval_preds diff --git a/olive_quantization/llm/scripts/clm_run copy.sh b/olive_quantization/llm/scripts/clm_run copy.sh new file mode 100755 index 0000000..dd381c2 --- /dev/null +++ b/olive_quantization/llm/scripts/clm_run copy.sh @@ -0,0 +1,29 @@ +transformer_model=${1:-"gpt2"} +dataset=${2:-"wikitext"} +dataset_config=${3:-"wikitext-103-raw-v1"} +q_mode=${4:-"ant-int-flint"} +q_bit=${5:-"4"} +batch_size=${6:-"8"} +port=${7:-46666} +desc=${8:-""} +n8=${9:-"0"} + +mkdir -p ./log +mkdir -p ./log/bigscience +mkdir -p ./log/facebook + +log_name="" +if [ "$dataset" = "wikitext" ] ; then + log_name=$transformer_model"_"$dataset_config"_"$q_bit"bit_batch"$batch_size"_"$desc +else + log_name=$transformer_model"_"$dataset"_"$q_bit"bit_batch"$batch_size"_"$desc +fi + +python -u -m torch.distributed.launch --nproc_per_node=1 --master_port $port run_clm.py \ + --model_name_or_path $transformer_model \ + --dataset_name $dataset --dataset_config_name $dataset_config \ + --output_dir checkpoints/$transformer_model \ + --do_eval \ + --mode=$q_mode --wbit=$q_bit --abit=$q_bit --a_low=75 --a_up=250 --w_low=75 --w_up=250 --layer_8bit_n=$n8 \ + --eval_batch_size=$batch_size --train_batch_size=$batch_size --quantize_batch_size=$batch_size \ + 2>&1 | tee ./log/${log_name}.log \ \ No newline at end of file diff --git a/olive_quantization/llm/scripts/clm_run.sh b/olive_quantization/llm/scripts/clm_run.sh index dd381c2..1f4ebbf 100755 --- a/olive_quantization/llm/scripts/clm_run.sh +++ b/olive_quantization/llm/scripts/clm_run.sh @@ -3,14 +3,14 @@ dataset=${2:-"wikitext"} dataset_config=${3:-"wikitext-103-raw-v1"} q_mode=${4:-"ant-int-flint"} q_bit=${5:-"4"} -batch_size=${6:-"8"} +batch_size=${6:-"4"} port=${7:-46666} desc=${8:-""} n8=${9:-"0"} mkdir -p ./log -mkdir -p ./log/bigscience -mkdir -p ./log/facebook +mkdir -p ./log/LLAMA +mkdir -p ./log/OPT log_name="" if [ "$dataset" = "wikitext" ] ; then @@ -19,7 +19,7 @@ else log_name=$transformer_model"_"$dataset"_"$q_bit"bit_batch"$batch_size"_"$desc fi -python -u -m torch.distributed.launch --nproc_per_node=1 --master_port $port run_clm.py \ +torchrun --nproc_per_node=1 --master_port $port run_clm.py \ --model_name_or_path $transformer_model \ --dataset_name $dataset --dataset_config_name $dataset_config \ --output_dir checkpoints/$transformer_model \ diff --git a/olive_quantization/llm/scripts/run_all.sh b/olive_quantization/llm/scripts/run_all.sh old mode 100644 new mode 100755 index ffc240b..0acf19c --- a/olive_quantization/llm/scripts/run_all.sh +++ b/olive_quantization/llm/scripts/run_all.sh @@ -1,11 +1,6 @@ -./scripts/clm_run.sh bigscience/bloom-7b1 wikitext wikitext-103-raw-v1 ant-int-flint 4 1 46666 outlier -./scripts/clm_run.sh bigscience/bloom-7b1 c4 realnewslike ant-int-flint 4 1 46666 outlier -./scripts/clm_run.sh facebook/opt-6.7b wikitext wikitext-103-raw-v1 ant-int-flint 4 2 46666 outlier +./scripts/clm_run.sh /home/gaozh/llama-7b wikitext wikitext-103-raw-v1 ant-int-flint 4 2 46666 outlier ./scripts/clm_run.sh facebook/opt-6.7b c4 realnewslike ant-int-flint 4 2 46666 outlier -./scripts/clm_run.sh gpt2-xl wikitext wikitext-103-raw-v1 ant-int-flint 4 8 46666 outlier -./scripts/clm_run.sh gpt2-xl c4 realnewslike ant-int-flint 4 8 46666 outlier - -CUDA_VISIBLE_DEVICES=0 ./scripts/clm_run.sh bigscience/bloom-7b1 c4 realnewslike ant-int-flint 4 1 46666 outlier & -CUDA_VISIBLE_DEVICES=1 ./scripts/clm_run.sh facebook/opt-6.7b c4 realnewslike ant-int-flint 4 2 46667 outlier & -CUDA_VISIBLE_DEVICES=2 ./scripts/clm_run.sh gpt2-xl c4 realnewslike ant-int-flint 4 8 46668 outlier & \ No newline at end of file +CUDA_VISIBLE_DEVICES=0 ./scripts/clm_run.sh OPT/opt-6.7b wikitext wikitext-103-raw-v1 ant-int-flint 8 2 46666 outlier & +CUDA_VISIBLE_DEVICES=1 ./scripts/clm_run.sh LLAMA/llama-7b c4 realnewslike ant-int-flint 4 2 46666 outlier & +#CUDA_VISIBLE_DEVICES=2 ./scripts/clm_run.sh gpt2-xl c4 realnewslike ant-int-flint 4 8 46668 outlier & \ No newline at end of file