Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: The evaluation code of scbench does not match the provided dataset. #103

Open
rainstorm12 opened this issue Dec 26, 2024 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@rainstorm12
Copy link

rainstorm12 commented Dec 26, 2024

Describe the bug

When I tested the "scbench_kv" task provided by scbench, I encountered the following problems in the compute_scores.py file during the evaluation.

[rank0]:   File "/myfile/MInference-main/scbench/compute_scores.py", line 365, in get_score_one
[rank0]:     assert task_name in NAME_TO_SCORE_GETTER, f"Invalid task name: {task_name}"
[rank0]: AssertionError: Invalid task name: scbench_kv

I found that the evaluation tasks provided in the compute_scores.py file are as follows, which do not match the test tasks of scbench.

def get_score_one(pred: str, label: str, task_name: str, model_name: str) -> float:
    """
    Computes the score for one prediction.
    Returns one float (zero and one for boolean values).
    """
    NAME_TO_SCORE_GETTER = {
        # Retrieve
        "kv_retrieval": get_score_one_kv_retrieval,
        "kv_retrieval_prefix": get_score_one_kv_retrieval,
        "kv_retrieval_both": get_score_one_kv_retrieval,
        "passkey": get_score_one_passkey,
        "number_string": get_score_one_number_string,
        # Code
        "code_run": get_score_one_code_run,
        "code_debug": get_score_one_code_debug,
        # Longbook
        "longdialogue_qa_eng": get_score_one_longdialogue_qa_eng,
        "longbook_qa_eng": get_score_one_longbook_qa_eng,
        "longbook_sum_eng": get_score_one_longbook_sum_eng,
        "longbook_choice_eng": get_score_one_longbook_choice_eng,
        "longbook_qa_chn": get_score_one_longbook_qa_chn,
        # Math
        "math_find": get_score_one_math_find,
        "math_calc": get_score_one_math_calc,
        # multi-turn nativ
        "multi_turn_summary": get_score_one_longbook_sum_eng,
        "multi_turn_vt": string_match_all,
        "multi_turn_many_shot": get_score_one_longdialogue_qa_eng,
        "multi_turn_kv_compressible": get_score_one_kv_retrieval,
    }
    assert task_name in NAME_TO_SCORE_GETTER, f"Invalid task name: {task_name}"
    score = NAME_TO_SCORE_GETTER[task_name](pred, label, model_name)
    return float(score)
@rainstorm12 rainstorm12 added the bug Something isn't working label Dec 26, 2024
@iofu728 iofu728 self-assigned this Dec 26, 2024
@iofu728 iofu728 added question Further information is requested and removed bug Something isn't working labels Dec 26, 2024
@iofu728 iofu728 changed the title [Bug]: The evaluation code of scbench does not match the provided dataset. [Question]: The evaluation code of scbench does not match the provided dataset. Dec 26, 2024
@iofu728
Copy link
Contributor

iofu728 commented Dec 26, 2024

Hi @rainstorm12, thank you for pointing out this issue.

We have already fixed it in #101.

Please fetch the updated code and let us know if you encounter any further problems!

git clone https://github.com/microsoft/MInference
pip install -e .

@rainstorm12
Copy link
Author

Thank you very much for your help! I solved my problem!
However, when I try the muli-task, I encounter a new problem. My test.sh file is as follows:

python run_scbench.py \
    --task scbench_repoqa_and_kv \
    --model_name_or_path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --data_dir ./data \
    --output_dir ./results \
    --rewrite \
    --attn_type minference \
    --kv_type dense \
    --use_chat_template \
    --trust_remote_code

and the error is as follow:

==== Evaluation scbench_repoqa_and_kv====
# examples: 88
Num eval examples: -1
Verbose: False
Max new tokens: {'scbench_repoqa': 1024, 'scbench_kv': 80}
Num of turns: 5
0it [00:00, ?it/s]# tokens before: 67598
# tokens after: 67598
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/myfile/MInference/scbench/run_scbench.py", line 397, in <module>
    pred = get_pred(
  File "/myfile/MInference/scbench/run_scbench.py", line 125, in get_pred
    outputs = model.test(
  File "/myfile/MInference/scbench/eval_utils.py", line 1246, in test
    max_length_per_turn = max_length[example["task"][idx]]
KeyError: 'multi_turn_kv'

The key 'multi_turn_kv' may not be present in Max new tokens
when I run the task of scbench_summary_with_needles, the error is as follow, which is similar to the above problem:

==== Evaluation scbench_summary_with_needles====
# examples: 70
Num eval examples: -1
Verbose: False
Max new tokens: {'scbench_summary': 800, 'scbench_passkey': 15}
Num of turns: 5
0it [00:00, ?it/s]# tokens before: 98057
# tokens after: 97962
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/myfile/MInference/scbench/run_scbench.py", line 397, in <module>
    pred = get_pred(
  File "/myfile/MInference/scbench/run_scbench.py", line 125, in get_pred
    outputs = model.test(
  File "/myfile/MInference/scbench/eval_utils.py", line 1246, in test
    max_length_per_turn = max_length[example["task"][idx]]
KeyError: 'multi_turn_passkey'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants