Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: different eval results compared to the results in paper #99

Open
unicorneeee opened this issue Dec 19, 2024 · 9 comments
Open
Assignees
Labels
question Further information is requested

Comments

@unicorneeee
Copy link

Describe the issue

hello ,when I use the env as follows:
transformers:4.47.0
torch 2.3.0
triton 2.1.0
flash_attn 2.5.8
minference use the support_tp branch
however, I set the attn_type="hf" to evalute the infinitebench, there are different results compared to the results in paper:
results self-evalute (use llama3-8B-262K)
Llama-3-8B-Instruct-262k_hf,code_debug,24.62
Llama-3-8B-Instruct-262k_hf,math_find,18.00
Llama-3-8B-Instruct-262k_hf,longdialogue_qa_eng,0.50
Could U please provide the requirments of env when U test the infinitebench? thank you!

@unicorneeee unicorneeee added the question Further information is requested label Dec 19, 2024
@iofu728 iofu728 self-assigned this Dec 20, 2024
@iofu728
Copy link
Contributor

iofu728 commented Dec 20, 2024

Hi @unicorneeee, thanks for your question.

Could you share your scripts with us?
In our experiments, we used "minference_with_dense" to avoid OOM issues in long-context scenarios. You can give that a try as well.

Let me know if it helps!

@unicorneeee
Copy link
Author

I use the run_infinitebench.sh to run the infinitebench,the code is as follow and I directly use the run_infinitebench.py without any change
CUDA_VISIBLE_DEVICES=4,5,6,7 python "$SCRIPT_DIR/run_infinitebench.py"
--task $task
--model_name_or_path gradientai/Llama-3-8B-Instruct-262k
--data_dir InfiniteBench/data/infiniteBench
--output_dir ./result_hf
--max_seq_length 160000
--rewrite
--num_eval_examples -1 --topk 1 --starting_layer 0 --attn_type hf
Is any parameter different with you?

@iofu728
Copy link
Contributor

iofu728 commented Dec 20, 2024

Hi @unicorneeee, we use attn_type=minference_with_dense to avoid OOM issues. You can try it again and let us know if it works.

@unicorneeee
Copy link
Author

Thanks for your answer! I will try it again!

@unicorneeee
Copy link
Author

Hello!
When I eval the model llama3-8B-262K with float32, use minference_with_dense have to transfer all Q K V matrix to torch.bfloat16, otherwise it will raise an error: flash_attn just support the bf16 or fp16. Therefore in the forward code just before the flash_attn function, I transfer the all matrix to bfloat16.
def dense(q, k, v, vertical_size=None, slash_size=None):
return flash_attn_func(q.transpose(1, 2).to(torch.bfloat16), k.transpose(1, 2).to(torch.bfloat16), v.transpose(1,2).to(torch.bfloat16), 0.0, softmax_scale=None, causal=q_len != 1).view(bsz, 1, q_len, self.head_dim)
Have you ever occured this issue? Is it the reason for the different result compared to the paper?
Thank you!

@iofu728
Copy link
Contributor

iofu728 commented Dec 23, 2024

Hi @unicorneeee,

I haven’t encountered this issue before. Llama3+ is already bf16 by default. https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k/blob/main/config.json#L24

By the way, are you using flash_attn or our custom triton ops?
You might want to fix this issue and give it another try.

@unicorneeee
Copy link
Author

Thanks for your answer! The model evaled in the paper is llama3-8B-262K, why this link is 1048K model? And I use the link in paper to download the 262K model, which I found that model this fp32, however the link is unavaible now?

@iofu728
Copy link
Contributor

iofu728 commented Dec 23, 2024

Hi @unicorneeee, apologies for the confusion—I just shared a random model as an example. You can refer to this configuration instead: https://huggingface.co/gradientai/Llama-3-8B-Instruct-262k/blob/main/config.json#L24.

@unicorneeee
Copy link
Author

Hello! I have use the model downloaded by the link, and find the results is different with the results in paper:(especially in code debug and EN dia task)
llama3-8B-262K_minference_with_dense,code_debug,24.62
llama3-8B-262K_minference_with_dense,math_find,18.00
llama3-8B-262K_minference_with_dense,longdialogue_qa_eng,0.50
and my code is as follow:
TASKS=("code_debug" "math_find" "longdialogue_qa_eng")

export TOKENIZERS_PARALLELISM=false
SCRIPT_DIR=$(dirname "$0")
for task in ${TASKS[@]}; do
echo $task
CUDA_VISIBLE_DEVICES=4,5,6,7 python "$SCRIPT_DIR/run_infinitebench.py"
--task $task
--model_name_or_path models/llama3-8B-262K
--data_dir data
--output_dir ./result_hf
--max_seq_length 160000
--rewrite
--num_eval_examples -1 --topk 1 --starting_layer 0 --attn_type minference_with_dense
done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants