Deployment not adhering to update_config parallelism, stopping containers prematurely #3132

ironhalik · 2023-04-27T17:25:00Z

I have a service with following deployment config:

Service Mode:	Global
UpdateStatus:
 State:		updating
 Started:	About a minute ago
 Message:	update in progress
Placement:
UpdateConfig:
 Parallelism:	1
 On failure:	rollback
 Monitoring Period: 5s
 Max failure ratio: 0
 Update order:      stop-first

The service runs in Global mode on three nodes and has a health check configured, which waits for an application server to respond. This usually takes about a minute.

    healthcheck:
      test: curl -sfA "healthcheck" 127.0.0.1/health
      start_period: 120s
      interval: 10s
      timeout: 3s
      retries: 3

Deployment is done using

docker stack deploy --prune --with-registry-auth -c ops/docker-compose.base.yml -c ${STACK_COMPOSE_FILE} ${STACK_NAME}

The problem is that during deployment, all the containers on all three nodes get stopped one after another without waiting for other/previous containers to reach a healthy state. This, of course, results in downtime.

Here's the service state during deployment. I have three containers in a "Preparing" state, and the rest were shut down.

# docker service ls
ccfp2dk17dao   some_app                       global       0/3        registry.project.co/project/some_app/app:3d042a72

# docker service ps some_app
ID             NAME                                           IMAGE                                                 NODE           DESIRED STATE   CURRENT STATE              ERROR                       PORTS
9sb43fxulrul   some_app.kywz5ibig0bpibq39ngmle4in       registry.project.co/project/some_app/app:3d042a72   backend01   Running         Preparing 25 seconds ago
plvgnderyrft    \_ some_app.kywz5ibig0bpibq39ngmle4in   registry.project.co/project/some_app/app:1d6fef56   backend01   Shutdown        Shutdown 11 seconds ago
4q0ozinrnygr    \_ some_app.kywz5ibig0bpibq39ngmle4in   registry.project.co/project/some_app/app:192856c4   backend01   Shutdown        Shutdown 9 minutes ago
clihr34x9dbc   some_app.tw3924jw42wlrpaqwtcl94z00       registry.project.co/project/some_app/app:3d042a72   backend02   Running         Preparing 28 seconds ago
yxpzyheq00ye    \_ some_app.tw3924jw42wlrpaqwtcl94z00   registry.project.co/project/some_app/app:1d6fef56   backend02   Shutdown        Shutdown 23 seconds ago
44ldh3vfe7m8    \_ some_app.tw3924jw42wlrpaqwtcl94z00   registry.project.co/project/some_app/app:192856c4   backend02   Shutdown        Shutdown 10 minutes ago
os1r1sd92vq1   some_app.yv3n4ex7514cxigx1oyj0hka3       registry.project.co/project/some_app/app:3d042a72   backend03   Running         Preparing 58 seconds ago
ywctjs3psb2s    \_ some_app.yv3n4ex7514cxigx1oyj0hka3   registry.project.co/project/some_app/app:1d6fef56   backend03   Shutdown        Shutdown 54 seconds ago
9zf0vkeuuqny    \_ some_app.yv3n4ex7514cxigx1oyj0hka3   registry.project.co/project/some_app/app:192856c4   backend03   Shutdown        Shutdown 11 minutes ago

If I inspect one of the prematurely stopped containers, there's:

        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2023-04-27T17:01:10.930540738Z",
            "FinishedAt": "2023-04-27T17:12:22.720695291Z",
            "Health": {
                "Status": "unhealthy",
                "FailingStreak": 0,
                "Log": [
                    {
                        "Start": "2023-04-27T19:11:39.812486279+02:00",
                        "End": "2023-04-27T19:11:39.961464877+02:00",
                        "ExitCode": 0,
                        "Output": "OK!"
                    },
                    [...]
                  ]

Interestingly enough, I have another service configured identically. Same health check, same deploy config, and on the same cluster. The only difference is that it's a different app and boots up a bit quicker. It deploys correctly.

The text was updated successfully, but these errors were encountered:

foertel · 2024-01-28T22:08:47Z

all the containers on all three nodes get stopped one after another without waiting for other/previous containers to reach a healthy state.

I would expect that to happen, when you configure your upgrade strategy to be "stop-first". (-:

shawnritchie · 2024-09-17T08:01:12Z

by any chance did you resolve this one? we are encountering the same issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment not adhering to update_config parallelism, stopping containers prematurely #3132

Deployment not adhering to update_config parallelism, stopping containers prematurely #3132

ironhalik commented Apr 27, 2023

foertel commented Jan 28, 2024

shawnritchie commented Sep 17, 2024

Deployment not adhering to update_config parallelism, stopping containers prematurely #3132

Deployment not adhering to update_config parallelism, stopping containers prematurely #3132

Comments

ironhalik commented Apr 27, 2023

foertel commented Jan 28, 2024

shawnritchie commented Sep 17, 2024