llama2相关

imoldpan · 2023 年7 月 19 日 07:45

Llama 2模型的主要特点和升级如下:

提供了7B、13B和70B参数三个规模的版本。
70B参数版本使用了分组查询注意力(GQA),提升了推理性能。
发布了专门针对聊天进行微调的Llama 2-Chat模型,效果与ChatGPT相当。
相比Llama 1,训练数据量增加40%,上下文长度加倍到4096,采用了更强的数据清理。
在多项推理、编码、知识测试的基准上,Llama 2的表现优于其他开源语言模型。
Llama 2-Chat通过强化学习从人类反馈中继续提升,注重模型的安全性和帮助性。
Llama 2主要针对英文优化,由于词表大小限制,直接应用于中文效果一般,需要进行中文特定的增强训练。

代码细节

这里指的是transformers中llama2的PR：

[Llama2] Add support for Llama 2

具体细节可以看：https://github.com/huggingface/transformers/pull/24891/files 。

代码中的变化，就是基于之前的llama代码修改了下，增加了一些配置，没有大改

Grouped Query Attention

grouped-query attention" (GQA) from Google proposes a method to convert models from multi-head attention (MHA) to GQA ) GQA claims to offer similar benefits to multi-query attention (MQA) with faster inference via reduced # key-value heads.

You can uptrain MHA to MQA or GQA
Only 5% original compute is needed to convert MHA to GQA
GQA achieves inference speed of multi-query attention
GQA performance is close to MHA, better than MQA
Uses # of key-value heads (between 1 and original #)

GQA introduces an interesting approach where we can apply inference optimizations post-pretraining. It will be exciting to see if we see open-source adoptions. LLaMA, MPT, and GPT-NeoX all currently use MHA.

结构

llama2-70b vs llama1-65b

llama2相关

代码细节

Grouped Query Attention

结构

参考