[Question]: What are the definitions of the different stages? #98

crazyofapple · 2024-12-19T02:20:02Z

Describe the issue

I would like to ask politely, what are the definitions of the different stages here? How to distinguish them? I am a novice in this field

ATTN_KV_TYPES=(
"vllm;dense" # FullAttention
"vllm_minference;dense" "vllm_a_shape;dense" "vllm_tri_shape;dense" # 1) KV Cache Generation Stage
"dense;streamingllm" "dense;snapkv" "dense;pyramidkv" "dense;kivi" # 2) KV Cache Compression Stage
"vllm_blend;dense" # 3) KV Cache Retrieval Stage
"dense;quest" "dense;retr_attn" # 4) KV Cache Loading Stage
)

iofu728 · 2024-12-19T05:59:14Z

Hi @crazyofapple, thanks for your interest in our work.

These four stages are part of the KV cache lifecycle that we recently introduced in SCBench. You can find the details in Section 2 and Figure 1.

crazyofapple · 2024-12-19T06:19:59Z

Thank you. I read the iclr submission before. Thank you for your excellent code base and timely response.

crazyofapple · 2024-12-19T13:40:51Z

@iofu728
I have a few more questions that I'd like you to answer: What is the difference between vllm_minference and minfernece? What is the difference between flash_attn and dense? Are retrieval_attn in kv_type and attn_type the same thing?

choices=[
"vllm",
"vllm_minference",
"vllm_a_shape",
"vllm_tri_shape",
"vllm_blend",
"hf",
"a_shape",
"tri_shape",
"inf_llm",
"flash_attn",
"minference",
"minference_with_dense",
"minference_with_dense_sink",
"dilated1",
"dilated2",
"retrieval_attn",
"minference_with_retr_attn",
"vllm_kv",
"dense",
],

iofu728 · 2024-12-20T07:30:13Z

Hi @crazyofapple, thanks for your question, and apologies for any confusion caused by unclear code.

vllm_minference vs. minference: These are essentially the same. However, we recommend using vllm_minference, which integrates seamlessly with many optimization features in vLLM, such as TP and prefix caching. minference refers to the HF implementation.
flash_attn and dense: These two are identical, both referring to the Flash Attention-2 kernel.
retrieval_attn in kv_type and attn_type: Apologies for the oversight here. The option for attn_type should have been removed, as RetrievalAttention currently only supports kv_type="retr_attn".

Let me know if you have further questions!

crazyofapple · 2024-12-20T07:34:24Z

Thank you very much for your answer, which solved my question. BTW, I would like to ask about your VLLM version (mine is vllm==0.6.5, torch==2.5.1, python 3.12.4), because my current vllm_minference will report some errors.

crazyofapple · 2024-12-20T07:36:24Z

Also, I can't find a python package for papyfaiss. I understand your efforts to open source.

iofu728 · 2024-12-20T08:46:58Z

Hi @crazyofapple,

I’ve been testing locally with vllm==0.6.0. I’ll check the higher versions soon. Could you share your error log? I’ll double-check if the specific error matches when I look into it later.

As for papyfaiss, it will be open-sourced at https://github.com/microsoft/RetrievalAttention. It’s currently undergoing internal code review.

crazyofapple · 2024-12-20T08:59:01Z

Many thanks for the reply. I changed to your version and it worked. The following is the error message for the higher version.

`[rank0]: Traceback (most recent call last):
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1683, in execute_model
[rank0]: hidden_or_intermediate_states = model_executable(
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 568, in forward
[rank0]: model_output = self.model(input_ids, positions, kv_caches,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 168, in call
[rank0]: return self.forward(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: TypeError: minference_patch_vllm_executor..llama_model_forward_vllm() takes from 5 to 6 positional arguments but 7 were given

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]: File "ldf/MInference/scbench/run_scbench.py", line 392, in
[rank0]: pred = get_pred(
[rank0]: ^^^^^^^^^
[rank0]: File "ldf/MInference/scbench/run_scbench.py", line 125, in get_pred
[rank0]: outputs = model.test(
[rank0]: ^^^^^^^^^^^
[rank0]: File "ldf/MInference/scbench/eval_utils.py", line 1148, in test
[rank0]: result = self.llm.generate(
[rank0]: ^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/utils.py", line 1025, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 454, in generate
[rank0]: outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 1234, in _run_engine
[rank0]: step_outputs = self.llm_engine.step()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 1405, in step
[rank0]: outputs = self.model_executor.execute_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/executor/gpu_executor.py", line 88, in execute_model
[rank0]: output = self.driver_worker.execute_model(execute_model_req)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 343, in execute_model
[rank0]: output = self.model_runner.execute_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
[rank0]: raise type(err)(
[rank0]: TypeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241220-085214.pkl): minference_patch_vllm_executor..llama_model_forward_vllm() takes from 5 to 6 positional arguments but 7 were given`

crazyofapple · 2024-12-20T10:20:49Z

@iofu728 BTW, I see that the results in your paper are all from a single method. Have you tried to combine different types of methods? If so, what did you find? If not, do you think this combination will have a future? I saw the improvement of snap+minference in the original minference paper, but it seems that there is no large-scale empirical result of the combination.

iofu728 · 2024-12-23T02:08:17Z

Hi @crazyofapple,

That’s a great question! We haven’t conducted detailed research in this area, but I personally believe there’s potential for exploration and some interesting insights might be uncovered.

Currently, aside from the MInference w/ SnapKV results in MInference, there’s also MInference w/ ShadowKV provided by ShadowKV for reference.

crazyofapple added the question Further information is requested label Dec 19, 2024

iofu728 self-assigned this Dec 19, 2024

crazyofapple closed this as completed Dec 19, 2024

crazyofapple reopened this Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: What are the definitions of the different stages? #98

[Question]: What are the definitions of the different stages? #98

crazyofapple commented Dec 19, 2024

iofu728 commented Dec 19, 2024

crazyofapple commented Dec 19, 2024

crazyofapple commented Dec 19, 2024

iofu728 commented Dec 20, 2024

crazyofapple commented Dec 20, 2024

crazyofapple commented Dec 20, 2024

iofu728 commented Dec 20, 2024

crazyofapple commented Dec 20, 2024

crazyofapple commented Dec 20, 2024

iofu728 commented Dec 23, 2024

[Question]: What are the definitions of the different stages? #98

[Question]: What are the definitions of the different stages? #98

Comments

crazyofapple commented Dec 19, 2024

Describe the issue

iofu728 commented Dec 19, 2024

crazyofapple commented Dec 19, 2024

crazyofapple commented Dec 19, 2024

iofu728 commented Dec 20, 2024

crazyofapple commented Dec 20, 2024

crazyofapple commented Dec 20, 2024

iofu728 commented Dec 20, 2024

crazyofapple commented Dec 20, 2024

crazyofapple commented Dec 20, 2024

iofu728 commented Dec 23, 2024