Multimodal models’ LLM part has an additional parameter --max_multimodal_len
compared to LLM-only build commands. Under the hood, max_multimodal_len
and max_prompt_embedding_table_size
are effectively the same concept, i.e., prepended/concatenated embeddings (either multimodal feature embeddings or prompt tuning embeddings) to the LLM input embeddings. The multimodal features from the visual encoder of shape [batch_size, num_visual_features, visual_hidden_dim]
is flattened as [batch_size * num_visual_features, visual_hidden_dim]
and passed like a prompt embedding table.
遇到的问题
satisfyProfile Runtime dimension does not satisfy any optimization profile
llava
ptuning_args = [
prompt_embedding_table, prompt_tasks, prompt_vocab_size
] if prompt_embedding_table is not None else []
if self.mapping.is_first_pp_rank():
hidden_states = self.vocab_embedding(input_ids, *ptuning_args)
else:
hidden_states = recv(hidden_states, self.mapping.prev_pp_rank())
input_ids
TensorRT-LLM Tensor: self.name='input_ids' self.dtype=<DataType.INT32: 3> self.shape=(-1,)
prompt_embedding_table
TensorRT-LLM Tensor: self.name='prompt_embedding_table' self.dtype=<DataType.HALF: 1> self.shape=(-1, 4096)
class PromptTuningEmbedding(Embedding):
"""
PromptTuningEmbedding handles fine-tuned prompts with virtual tokens. At runtime,
a supplementary embedding dictionary is passed. Tokens whose ids are >= vocab_size are embedded
with that additional dictionary.
The prompt tuning dictionary holds multiple tasks, and each sequence is assigned a given task.
Prompt-tuned tokens from a given sequence use the adequate task dictionary, as defined by the `tasks` input.
"""
def __init__(self,
num_embeddings,
embedding_dim,
vocab_size=None,
dtype=None,
tp_size=1,
tp_group=None,
sharding_dim=0,
tp_rank=0):
super().__init__(num_embeddings, embedding_dim, dtype, tp_size,
tp_group, sharding_dim, tp_rank)
if vocab_size is None:
vocab_size = num_embeddings
self.vocab_size = vocab_size
def forward(self, tokens, prompt_embedding_table, tasks, task_vocab_size):
"""
Pass all tokens through both normal and prompt embedding tables.
Tokens are masked so that "normal" embedding only see "normal" tokens. Same logic for "prompt" embedding.
After those two embedding, combine results based on whether the token was "normal" or "prompt-tuned".
Parameters:
tokens : Tensor
the ids to embbed, size [batch_size, seq_len]
prompt_embedding_table : Tensor
the additional embedding table for prompt-tuned tokens, size [num_tasks * num_tokens_per_task, hidden_size]
tasks: Tensor
the task required by each token, size [batch_size, seq_len]
task_vocab_size: Tensor
the number of tokens used for each task, should be equal to prompt_embedding_table's num_tokens_per_task, size [1]
Returns:
Tokens' embedding
"""
# do not use ">=" because internally the layer works with floating points
prompt_tokens_mask = tokens > (self.vocab_size - 1)
# clip tokens in the [0, vocab_size) range
normal_tokens = where(prompt_tokens_mask, self.vocab_size - 1, tokens)
normal_embeddings = embedding(normal_tokens, self.weight.value,
self.tp_size, self.tp_group,
self.sharding_dim, self.tp_rank)
# put virtual tokens in the [0, max_prompt_vocab_size) range
prompt_tokens = where(prompt_tokens_mask, tokens - self.vocab_size, 0)
# add offsets to match the concatenated embedding tables
tasks = tasks * task_vocab_size
# tasks: [batch_size, seq_len]
# prompt_tokens: [batch_size, seq_len]
prompt_tokens = prompt_tokens + tasks
prompt_embeddings = embedding(prompt_tokens, prompt_embedding_table)
# prompt_tokens_mask: [batch_size, seq_len] -> [batch_size, seq_len, 1]
# combine the correct sources of embedding: normal/prompt
return where(unsqueeze(prompt_tokens_mask, -1), prompt_embeddings,
normal_embeddings)
vLLM的支持
相关issue和pr:
参考
- Poor performance of serving vision-languange models using batching · Issue #1357 · InternLM/lmdeploy · GitHub
- https://github.com/triton-inference-server/tensorrtllm_backend/issues/344
- https://github.com/NVIDIA/TensorRT-LLM/issues/800
- https://github.com/NVIDIA/TensorRT-LLM/issues/913
- https://github.com/NVIDIA/TensorRT-LLM/issues/444
- https://github.com/NVIDIA/TensorRT-LLM/issues/461
- How to handle variable length decoder_input_ids for batch prediction in the Nougat family model? · Issue #1166 · NVIDIA/TensorRT-LLM · GitHub
- https://github.com/triton-inference-server/tensorrtllm_backend/issues/344