Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CM script failed to run harness after docker done #1998

Open
Bob123Yang opened this issue Dec 24, 2024 · 2 comments
Open

CM script failed to run harness after docker done #1998

Bob123Yang opened this issue Dec 24, 2024 · 2 comments

Comments

@Bob123Yang
Copy link

Hi @arjunsuresh

I am running the Resnet50 benchmark with the command:

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=resnet50
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=5000

failed to run harness as below and the docker was created successfully. How to resolve it?
log with dock done.txt

make: *** [Makefile:45: run_harness] Error 1

CM error: Portable CM script failed (name = benchmark-program, return code = 512)


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts
to make existing tools and native scripts more portable, interoperable
and deterministic. Thank you!
cmuser@9951fc73ce5b:~$ pwd
/home/cmuser
@arjunsuresh
Copy link
Contributor

Looks like you are having GPUs from different generation on the system. AFAIK this is not supported by Nvidia implementation.

{GPU(name='NVIDIA RTX A6000', accelerator_type=<AcceleratorType.Discrete: AliasedName(name='Discrete', aliases=(), patterns=())>, vram=Memory(quantity=47.98828125, byte_suffix=<ByteSuffix.GiB: (1024, 3)>, _num_bytes=51527024640), max_power_limit=300.0, pci_id='0x223010DE', compute_sm=86): 1, GPU(name='NVIDIA T1000 8GB', accelerator_type=<AcceleratorType.Discrete: AliasedName(name='Discrete', aliases=(), patterns=())>, vram=Memory(quantity=8.0, byte_suffix=<ByteSuffix.GiB: (1024, 3)>, _num_bytes=8589934592), max_power_limit=50.0, pci_id='0x1FF010DE', compute_sm=75): 1})), numa_conf=None, system_id='Nvidia_9951fc73ce5b')

@Bob123Yang
Copy link
Author

Thank you @arjunsuresh.

No same error prompt out after I removed T1000 and only kept A6000 in the system.

But when I run the below commands in the docker that is the same docker created by CM before system reboot:

cm run script --tags=run-mlperf,inference,_r4.1-dev
--model=resnet50
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=valid
--device=cuda
--quiet

Another error occurred:
log_in_docker_resnet50_1.txt

Traceback (most recent call last):
  File "/home/cmuser/CM/repos/local/cache/1406981516ca4974/inference/vision/classification_and_detection/tools/accuracy-imagenet.py", line 89, in <module>
    main()
  File "/home/cmuser/CM/repos/local/cache/1406981516ca4974/inference/vision/classification_and_detection/tools/accuracy-imagenet.py", line 54, in main
    with open(args.mlperf_accuracy_file, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/cmuser/CM/repos/local/cache/d54d1a74ced144d0/valid_results/9951fc73ce5b-nvidia_original-gpu-tensorrt-vdefault-default_config/resnet50/offline/accuracy/mlperf_log_accuracy.json'

CM error: Portable CM script failed (name = process-mlperf-accuracy, return code = 256)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants