CM script failed to run harness after docker done #1998

Bob123Yang · 2024-12-24T11:39:20Z

I am running the Resnet50 benchmark with the command:

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=resnet50
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=5000

failed to run harness as below and the docker was created successfully. How to resolve it?
log with dock done.txt

make: *** [Makefile:45: run_harness] Error 1

CM error: Portable CM script failed (name = benchmark-program, return code = 512)


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts
to make existing tools and native scripts more portable, interoperable
and deterministic. Thank you!
cmuser@9951fc73ce5b:~$ pwd
/home/cmuser

arjunsuresh · 2024-12-24T13:12:14Z

Looks like you are having GPUs from different generation on the system. AFAIK this is not supported by Nvidia implementation.

{GPU(name='NVIDIA RTX A6000', accelerator_type=<AcceleratorType.Discrete: AliasedName(name='Discrete', aliases=(), patterns=())>, vram=Memory(quantity=47.98828125, byte_suffix=<ByteSuffix.GiB: (1024, 3)>, _num_bytes=51527024640), max_power_limit=300.0, pci_id='0x223010DE', compute_sm=86): 1, GPU(name='NVIDIA T1000 8GB', accelerator_type=<AcceleratorType.Discrete: AliasedName(name='Discrete', aliases=(), patterns=())>, vram=Memory(quantity=8.0, byte_suffix=<ByteSuffix.GiB: (1024, 3)>, _num_bytes=8589934592), max_power_limit=50.0, pci_id='0x1FF010DE', compute_sm=75): 1})), numa_conf=None, system_id='Nvidia_9951fc73ce5b')

Bob123Yang · 2024-12-25T06:56:54Z

Thank you @arjunsuresh.

No same error prompt out after I removed T1000 and only kept A6000 in the system.

But when I run the below commands in the docker that is the same docker created by CM before system reboot:

cm run script --tags=run-mlperf,inference,_r4.1-dev
--model=resnet50
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=valid
--device=cuda
--quiet

Another error occurred:
log_in_docker_resnet50_1.txt

Traceback (most recent call last):
  File "/home/cmuser/CM/repos/local/cache/1406981516ca4974/inference/vision/classification_and_detection/tools/accuracy-imagenet.py", line 89, in <module>
    main()
  File "/home/cmuser/CM/repos/local/cache/1406981516ca4974/inference/vision/classification_and_detection/tools/accuracy-imagenet.py", line 54, in main
    with open(args.mlperf_accuracy_file, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/cmuser/CM/repos/local/cache/d54d1a74ced144d0/valid_results/9951fc73ce5b-nvidia_original-gpu-tensorrt-vdefault-default_config/resnet50/offline/accuracy/mlperf_log_accuracy.json'

CM error: Portable CM script failed (name = process-mlperf-accuracy, return code = 256)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CM script failed to run harness after docker done #1998

CM script failed to run harness after docker done #1998

Bob123Yang commented Dec 24, 2024

arjunsuresh commented Dec 24, 2024

Bob123Yang commented Dec 25, 2024

CM script failed to run harness after docker done #1998

CM script failed to run harness after docker done #1998

Comments

Bob123Yang commented Dec 24, 2024

arjunsuresh commented Dec 24, 2024

Bob123Yang commented Dec 25, 2024