You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks again for your help. I encountered an error while reproducing results in needle_in_a_haystack by running bash experiments/needle_in_a_haystack/run_needle.sh and would appreciate any insights:
[ 1000 72357 143714 215071 286429 357786 429143 500500 571857
643214 714571 785929 857286 928643 1000000]
[ 286429 357786 429143 500500 571857 643214 714571 785929 857286
928643 1000000]
# Too long, ignore some logs
File "/home/far/MInference/minference/modules/minference_forward.py", line 656, in forward
part_o = self.gather_last_q_vertical_slash_topk_v4(part_q, part_k, part_v, head)
File "/home/far/MInference/minference/modules/minference_forward.py", line 463, in gather_last_q_vertical_slash_topk_v4
return fc(q, k, v, vertical_size, slash_size)
File "/home/far/MInference/minference/modules/minference_forward.py", line 383, in vertical_and_slash_kernel
slash = sum_all_diagonal_matrix(qk)[...,:-last_q + 1]
File "/home/far/MInference/minference/modules/minference_forward.py", line 103, in sum_all_diagonal_matrix
zero_mat = torch.zeros((b, h, n, n)).to(mat.device) # Zero matrix used for padding
File "/home/far/MInference/minference/modules/minference_forward.py", line 103, in sum_all_diagonal_matrix
zero_mat = torch.zeros((b, h, n, n)).to(mat.device) # Zero matrix used for padding
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
I noticed the error only occurs when starting from job 4 with the --kv_cache_cpu argument. Jobs in the range [0-4) work fine. Any suggestions on this?
Additionally, I found that the vllm module is required when performing the needle_in_a_haystack experiment. In my opinion, vllm isn't necessary for minference. Is there a specific reason for this, or something I might have missed?
Looking forward to your response!
The text was updated successfully, but these errors were encountered:
It doesn't seem to be related to vLLM. It might be due to GPU memory not being fully reclaimed yet. Could you try running the Python command separately or upgrading Triton?
For the NIAH experiments, we used a single A100 GPU with 216GB CPU memory for inputs up to 800K tokens, while 900K and 1M tokens were tested on a setup with a single A100 GPU and 1TB CPU memory.
Could you try setting specific job ranges like “5-6” or “6-7”? Let me know if you encounter any issues!
Describe the issue
Hi,
Thanks again for your help. I encountered an error while reproducing results in needle_in_a_haystack by running
bash experiments/needle_in_a_haystack/run_needle.sh
and would appreciate any insights:I noticed the error only occurs when starting from job 4 with the
--kv_cache_cpu
argument. Jobs in the range [0-4) work fine. Any suggestions on this?Additionally, I found that the vllm module is required when performing the needle_in_a_haystack experiment. In my opinion, vllm isn't necessary for minference. Is there a specific reason for this, or something I might have missed?
Looking forward to your response!
The text was updated successfully, but these errors were encountered: