Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ReshardingV3] - forknet testing and follow ups #12552

Open
10 of 16 tasks
Tracked by #11881
wacban opened this issue Dec 3, 2024 · 3 comments
Open
10 of 16 tasks
Tracked by #11881

[ReshardingV3] - forknet testing and follow ups #12552

wacban opened this issue Dec 3, 2024 · 3 comments
Assignees

Comments

@wacban
Copy link
Contributor

wacban commented Dec 3, 2024

Description

Run forknet

  • without any traffic
    • Fix errors in get_postponed_receipt_count_for_shard shard_layout.shard_ids().any(|i| i == shard_id)
    • TrieQueueIndices assertion failing ref
  • with traffic after resharding
  • with single shard tracking
    • Fix verify_path failing ref
    • Fix index out of bounds ref
  • with traffic before and after resharding
  • with heavy traffic to trigger congestion
  • with shard shuffling
  • with RPC & archival nodes (no memtries)
  • with node restarts
  • with forks
  • with missing chunks & blocks
  • with decentralised state sync
  • with multiple reshardings
@Longarithm
Copy link
Member

Longarithm commented Dec 4, 2024

My current setup

alias mirror="python3 tests/mocknet/mirror.py --chain-id mainnet --start-height 128293844 --unique-id eshardnet"
NODE_BINARY_URL=https://storage.googleapis.com/logunov/neard-1203
mirror init-neard-runner --neard-binary-url $NODE_BINARY_URL
mirror new-test \
  --epoch-length 5500 \
  --genesis-protocol-version 73 \
  --num-validators 7 \
  --num-seats 7 \
  --stateless-setup \
  --new-chain-id eshardnet \
  --gcs-state-sync \
  --yes
RUST_LOG="client=debug,chain=debug,mirror=debug,actix_web=warn,mio=warn,tokio_util=warn,actix_server=warn,actix_http=warn,resharding=debug,fork-network=info,metrics=trace,doomslug=trace,sync=debug,catchup=debug,info"
mirror --host-type nodes run-cmd --cmd "jq '.opentelemetry = \"${RUST_LOG}\" | .rust_log = \"${RUST_LOG}\"' /home/ubuntu/.near/log_config.json > tmp && mv tmp /home/ubuntu/.near/log_config.json"
mirror --host-type traffic run-cmd --cmd "jq '.opentelemetry = \"${RUST_LOG}\" | .rust_log = \"${RUST_LOG}\"' /home/ubuntu/.near/target/log_config.json > tmp && mv tmp /home/ubuntu/.near/target/log_config.json"
mirror update-config --set 'gc_num_epochs_to_keep=5'
mirror update-config --set 'p_produce_chunk=0.3'
mirror update-config --set 'resharding_config.batch_delay={"secs":0,"nanos":10000000}'
mirror start-nodes

then
while true; do result=$(curl --silent http://34.13.138.46:3030/metrics | grep 'near_current_protocol_version'); echo "$result"; if [[ $result == *"74"* ]]; then sleep 10; mirror --host-filter '.*([0-9A-Fa-f]{4}|traffic)$' start-traffic; break; fi; sleep 1; done

@Longarithm
Copy link
Member

Longarithm commented Dec 6, 2024

Current status:
Resharding works with single shard tracking, but nodes crashes in the next epochs.

Follow-ups:

Latest setup
alias mirror="python3 tests/mocknet/mirror.py --chain-id mainnet --start-height 128293844 --unique-id hshardnet"
### SEPARATE COMMAND ###
NODE_BINARY_URL=https://storage.googleapis.com/logunov/neard-1206
mirror init-neard-runner --neard-binary-url $NODE_BINARY_URL
mirror new-test \
  --epoch-length 4500 \
  --genesis-protocol-version 73 \
  --num-validators 7 \
  --num-seats 7 \
  --stateless-setup \
  --new-chain-id hshardnet \
  --gcs-state-sync \
  --yes
RUST_LOG="client=debug,chain=debug,mirror=debug,actix_web=warn,mio=warn,tokio_util=warn,actix_server=warn,actix_http=warn,resharding=debug,fork-network=info,metrics=trace,doomslug=trace,indexer=info,info"
mirror --host-type nodes run-cmd --cmd "jq '.opentelemetry = \"${RUST_LOG}\" | .rust_log = \"${RUST_LOG}\"' /home/ubuntu/.near/log_config.json > tmp && mv tmp /home/ubuntu/.near/log_config.json"
mirror --host-type traffic run-cmd --cmd "jq '.opentelemetry = \"${RUST_LOG}\" | .rust_log = \"${RUST_LOG}\"' /home/ubuntu/.near/target/log_config.json > tmp && mv tmp /home/ubuntu/.near/target/log_config.json"
mirror --host-type nodes run-cmd --cmd 'for f in /home/ubuntu/.near/epoch_configs/73.json; do jq ".validator_selection_config.shuffle_shard_assignment_for_chunk_producers = true" "$f" > tmp && mv tmp "$f"; done'
mirror --host-type traffic run-cmd --cmd 'for f in /home/ubuntu/.near/target/epoch_configs/73.json; do jq ".validator_selection_config.shuffle_shard_assignment_for_chunk_producers = true" "$f" > tmp && mv tmp "$f"; done'
mirror update-config --set 'p_produce_chunk=0.3'
mirror update-config --set 'resharding_config.batch_delay={"secs":0,"nanos":10000000}'
mirror start-nodes
mirror start-traffic

@Longarithm
Copy link
Member

Forknet survived 10 epochs with

  • CurrentEpochStateSync pre-enabled, protocol upgraded to SimpleNightshadeV4, BandwidthScheduler not enabled at all
  • Only 30% of chunks are produced to test missing chunks behaviour aggressively
  • Default transaction rate, 30 tx/s
  • Shard shuffling, single shard tracking

https://near.zulipchat.com/#narrow/channel/407288-core.2Fresharding/topic/forknet/near/489974124

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants