Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: RuntimeError encountered when trying to reproduce results in needle in a haystack #88

Open
lepangdan opened this issue Nov 26, 2024 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@lepangdan
Copy link

Describe the issue

Hi,

Thanks again for your help. I encountered an error while reproducing results in needle_in_a_haystack by running bash experiments/needle_in_a_haystack/run_needle.sh and would appreciate any insights:

[   1000   72357  143714  215071  286429  357786  429143  500500  571857
  643214  714571  785929  857286  928643 1000000]
[ 286429  357786  429143  500500  571857  643214  714571  785929  857286
  928643 1000000]
# Too long, ignore some logs
 File "/home/far/MInference/minference/modules/minference_forward.py", line 656, in forward
    part_o = self.gather_last_q_vertical_slash_topk_v4(part_q, part_k, part_v, head)
  File "/home/far/MInference/minference/modules/minference_forward.py", line 463, in gather_last_q_vertical_slash_topk_v4
    return fc(q, k, v, vertical_size, slash_size)
  File "/home/far/MInference/minference/modules/minference_forward.py", line 383, in vertical_and_slash_kernel
    slash = sum_all_diagonal_matrix(qk)[...,:-last_q + 1]
  File "/home/far/MInference/minference/modules/minference_forward.py", line 103, in sum_all_diagonal_matrix
    zero_mat = torch.zeros((b, h, n, n)).to(mat.device) # Zero matrix used for padding
  File "/home/far/MInference/minference/modules/minference_forward.py", line 103, in sum_all_diagonal_matrix
    zero_mat = torch.zeros((b, h, n, n)).to(mat.device) # Zero matrix used for padding
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

I noticed the error only occurs when starting from job 4 with the --kv_cache_cpu argument. Jobs in the range [0-4) work fine. Any suggestions on this?

Additionally, I found that the vllm module is required when performing the needle_in_a_haystack experiment. In my opinion, vllm isn't necessary for minference. Is there a specific reason for this, or something I might have missed?

Looking forward to your response!

@lepangdan lepangdan added the question Further information is requested label Nov 26, 2024
@iofu728 iofu728 self-assigned this Nov 26, 2024
@iofu728
Copy link
Contributor

iofu728 commented Nov 26, 2024

Hi @lepangdan, thanks for your feedback.

It doesn't seem to be related to vLLM. It might be due to GPU memory not being fully reclaimed yet. Could you try running the Python command separately or upgrading Triton?

python experiments/needle_in_a_haystack/needle_test.py \
    --model_name gradientai/Llama-3-8B-Instruct-Gradient-1048k \
    --max_length 1000000 \
    --min_length 1000 \
    --rounds 5 \
    --attn_type minference \
    --kv_cache_cpu \
    --output_path ./needle \
    --run_name minference_LLaMA_1M \
    --jobs 4-15

@lepangdan
Copy link
Author

lepangdan commented Nov 27, 2024

Hi @iofu728 ,

The error persists after running your mentioned command. Any further suggestions?

Additionally, could you please confirm the A100 count and total GPU memory used for running the needle experiment?

@iofu728
Copy link
Contributor

iofu728 commented Nov 28, 2024

Hi @lepangdan,

For the NIAH experiments, we used a single A100 GPU with 216GB CPU memory for inputs up to 800K tokens, while 900K and 1M tokens were tested on a setup with a single A100 GPU and 1TB CPU memory.

Could you try setting specific job ranges like “5-6” or “6-7”? Let me know if you encounter any issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants