关于LLM的stream模式

imoldpan · 2023 年4 月 25 日 03:05

非stream模式就是一口气输出所有的回答，而stream模式就是一个字一个字的输出：

Prompt: ['I believe the meaning of life is']
🦙LLaMA: I believe the meaning of life is the chance to be who you are and to do what you can, so that when you are dead, you can say, 'I made a difference.'
Nobody is a loser, until you quit.
That's why you never see an Olympic bronze medalist crying. Because he knows he is a loser.
Sometimes the best way to get people to do something is to tell them they can't do it.
Success is not final; failure is not fatal. It is the courage to continue that counts.
Life is a tragedy for those who feel, and a comedy to those who think.
As you think about your life, you need to realize that you are shaping your future every day, every week, every year. What you do today will have an impact on tomorrow.
"Tradition is the illusion of permanance."
Today you are you that is truer than true. There is no one alive who is youer than you.

try:
        # Generate the entire reply at once.
        if shared.args.no_stream:
            with torch.no_grad():
                output = shared.model.generate(**generate_params)[0]
                if cuda:
                    output = output.cuda()

            if shared.soft_prompt:
                output = torch.cat((input_ids[0], output[filler_input_ids.shape[1]:]))

            new_tokens = len(output) - len(input_ids[0])
            reply = decode(output[-new_tokens:], state['skip_special_tokens'])
            if not shared.is_chat():
                reply = original_question + apply_extensions('output', reply)

            yield formatted_outputs(reply, shared.model_name)

        # Stream the reply 1 token at a time.
        # This is based on the trick of using 'stopping_criteria' to create an iterator.
        elif not shared.args.flexgen:

            def generate_with_callback(callback=None, **kwargs):
                kwargs['stopping_criteria'].append(Stream(callback_func=callback))
                clear_torch_cache()
                with torch.no_grad():
                    shared.model.generate(**kwargs)

            def generate_with_streaming(**kwargs):
                return Iteratorize(generate_with_callback, kwargs, callback=None)

            if not shared.is_chat():
                yield formatted_outputs(original_question, shared.model_name)

            with generate_with_streaming(**generate_params) as generator:
                for output in generator:
                    if shared.soft_prompt:
                        output = torch.cat((input_ids[0], output[filler_input_ids.shape[1]:]))

                    new_tokens = len(output) - len(input_ids[0])
                    reply = decode(output[-new_tokens:], state['skip_special_tokens'])
                    if not shared.is_chat():
                        reply = original_question + apply_extensions('output', reply)

                    if output[-1] in eos_token_ids:
                        break

                    yield formatted_outputs(reply, shared.model_name)

        # Stream the output naively for FlexGen since it doesn't support 'stopping_criteria'