Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: What are the definitions of the different stages? #98

Open
crazyofapple opened this issue Dec 19, 2024 · 10 comments
Open

[Question]: What are the definitions of the different stages? #98

crazyofapple opened this issue Dec 19, 2024 · 10 comments
Assignees
Labels
question Further information is requested

Comments

@crazyofapple
Copy link

Describe the issue

I would like to ask politely, what are the definitions of the different stages here? How to distinguish them? I am a novice in this field

ATTN_KV_TYPES=(
"vllm;dense" # FullAttention
"vllm_minference;dense" "vllm_a_shape;dense" "vllm_tri_shape;dense" # 1) KV Cache Generation Stage
"dense;streamingllm" "dense;snapkv" "dense;pyramidkv" "dense;kivi" # 2) KV Cache Compression Stage
"vllm_blend;dense" # 3) KV Cache Retrieval Stage
"dense;quest" "dense;retr_attn" # 4) KV Cache Loading Stage
)

@crazyofapple crazyofapple added the question Further information is requested label Dec 19, 2024
@iofu728 iofu728 self-assigned this Dec 19, 2024
@iofu728
Copy link
Contributor

iofu728 commented Dec 19, 2024

Hi @crazyofapple, thanks for your interest in our work.

These four stages are part of the KV cache lifecycle that we recently introduced in SCBench. You can find the details in Section 2 and Figure 1.

Onepage of SCBench

@crazyofapple
Copy link
Author

Thank you. I read the iclr submission before. Thank you for your excellent code base and timely response.

@crazyofapple
Copy link
Author

@iofu728
I have a few more questions that I'd like you to answer: What is the difference between vllm_minference and minfernece? What is the difference between flash_attn and dense? Are retrieval_attn in kv_type and attn_type the same thing?

choices=[
"vllm",
"vllm_minference",
"vllm_a_shape",
"vllm_tri_shape",
"vllm_blend",
"hf",
"a_shape",
"tri_shape",
"inf_llm",
"flash_attn",
"minference",
"minference_with_dense",
"minference_with_dense_sink",
"dilated1",
"dilated2",
"retrieval_attn",
"minference_with_retr_attn",
"vllm_kv",
"dense",
],

@crazyofapple crazyofapple reopened this Dec 19, 2024
@iofu728
Copy link
Contributor

iofu728 commented Dec 20, 2024

Hi @crazyofapple, thanks for your question, and apologies for any confusion caused by unclear code.

  1. vllm_minference vs. minference: These are essentially the same. However, we recommend using vllm_minference, which integrates seamlessly with many optimization features in vLLM, such as TP and prefix caching. minference refers to the HF implementation.

  2. flash_attn and dense: These two are identical, both referring to the Flash Attention-2 kernel.

  3. retrieval_attn in kv_type and attn_type: Apologies for the oversight here. The option for attn_type should have been removed, as RetrievalAttention currently only supports kv_type="retr_attn".

Let me know if you have further questions!

@crazyofapple
Copy link
Author

Thank you very much for your answer, which solved my question. BTW, I would like to ask about your VLLM version (mine is vllm==0.6.5, torch==2.5.1, python 3.12.4), because my current vllm_minference will report some errors.

@crazyofapple
Copy link
Author

Also, I can't find a python package for papyfaiss. I understand your efforts to open source.

@iofu728
Copy link
Contributor

iofu728 commented Dec 20, 2024

Hi @crazyofapple,

I’ve been testing locally with vllm==0.6.0. I’ll check the higher versions soon. Could you share your error log? I’ll double-check if the specific error matches when I look into it later.

As for papyfaiss, it will be open-sourced at https://github.com/microsoft/RetrievalAttention. It’s currently undergoing internal code review.

@crazyofapple
Copy link
Author

Many thanks for the reply. I changed to your version and it worked. The following is the error message for the higher version.

`[rank0]: Traceback (most recent call last):
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1683, in execute_model
[rank0]: hidden_or_intermediate_states = model_executable(
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 568, in forward
[rank0]: model_output = self.model(input_ids, positions, kv_caches,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 168, in call
[rank0]: return self.forward(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: TypeError: minference_patch_vllm_executor..llama_model_forward_vllm() takes from 5 to 6 positional arguments but 7 were given

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]: File "ldf/MInference/scbench/run_scbench.py", line 392, in
[rank0]: pred = get_pred(
[rank0]: ^^^^^^^^^
[rank0]: File "ldf/MInference/scbench/run_scbench.py", line 125, in get_pred
[rank0]: outputs = model.test(
[rank0]: ^^^^^^^^^^^
[rank0]: File "ldf/MInference/scbench/eval_utils.py", line 1148, in test
[rank0]: result = self.llm.generate(
[rank0]: ^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/utils.py", line 1025, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 454, in generate
[rank0]: outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 1234, in _run_engine
[rank0]: step_outputs = self.llm_engine.step()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 1405, in step
[rank0]: outputs = self.model_executor.execute_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/executor/gpu_executor.py", line 88, in execute_model
[rank0]: output = self.driver_worker.execute_model(execute_model_req)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 343, in execute_model
[rank0]: output = self.model_runner.execute_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
[rank0]: raise type(err)(
[rank0]: TypeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241220-085214.pkl): minference_patch_vllm_executor..llama_model_forward_vllm() takes from 5 to 6 positional arguments but 7 were given`

@crazyofapple
Copy link
Author

@iofu728 BTW, I see that the results in your paper are all from a single method. Have you tried to combine different types of methods? If so, what did you find? If not, do you think this combination will have a future? I saw the improvement of snap+minference in the original minference paper, but it seems that there is no large-scale empirical result of the combination.

@iofu728
Copy link
Contributor

iofu728 commented Dec 23, 2024

Hi @crazyofapple,

That’s a great question! We haven’t conducted detailed research in this area, but I personally believe there’s potential for exploration and some interesting insights might be uncovered.

Currently, aside from the MInference w/ SnapKV results in MInference, there’s also MInference w/ ShadowKV provided by ShadowKV for reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants