Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment not adhering to update_config parallelism, stopping containers prematurely #3132

Open
ironhalik opened this issue Apr 27, 2023 · 2 comments

Comments

@ironhalik
Copy link

I have a service with following deployment config:

Service Mode:	Global
UpdateStatus:
 State:		updating
 Started:	About a minute ago
 Message:	update in progress
Placement:
UpdateConfig:
 Parallelism:	1
 On failure:	rollback
 Monitoring Period: 5s
 Max failure ratio: 0
 Update order:      stop-first

The service runs in Global mode on three nodes and has a health check configured, which waits for an application server to respond. This usually takes about a minute.

    healthcheck:
      test: curl -sfA "healthcheck" 127.0.0.1/health
      start_period: 120s
      interval: 10s
      timeout: 3s
      retries: 3

Deployment is done using

docker stack deploy --prune --with-registry-auth -c ops/docker-compose.base.yml -c ${STACK_COMPOSE_FILE} ${STACK_NAME}

The problem is that during deployment, all the containers on all three nodes get stopped one after another without waiting for other/previous containers to reach a healthy state. This, of course, results in downtime.

Here's the service state during deployment. I have three containers in a "Preparing" state, and the rest were shut down.

# docker service ls
ccfp2dk17dao   some_app                       global       0/3        registry.project.co/project/some_app/app:3d042a72

# docker service ps some_app
ID             NAME                                           IMAGE                                                 NODE           DESIRED STATE   CURRENT STATE              ERROR                       PORTS
9sb43fxulrul   some_app.kywz5ibig0bpibq39ngmle4in       registry.project.co/project/some_app/app:3d042a72   backend01   Running         Preparing 25 seconds ago
plvgnderyrft    \_ some_app.kywz5ibig0bpibq39ngmle4in   registry.project.co/project/some_app/app:1d6fef56   backend01   Shutdown        Shutdown 11 seconds ago
4q0ozinrnygr    \_ some_app.kywz5ibig0bpibq39ngmle4in   registry.project.co/project/some_app/app:192856c4   backend01   Shutdown        Shutdown 9 minutes ago
clihr34x9dbc   some_app.tw3924jw42wlrpaqwtcl94z00       registry.project.co/project/some_app/app:3d042a72   backend02   Running         Preparing 28 seconds ago
yxpzyheq00ye    \_ some_app.tw3924jw42wlrpaqwtcl94z00   registry.project.co/project/some_app/app:1d6fef56   backend02   Shutdown        Shutdown 23 seconds ago
44ldh3vfe7m8    \_ some_app.tw3924jw42wlrpaqwtcl94z00   registry.project.co/project/some_app/app:192856c4   backend02   Shutdown        Shutdown 10 minutes ago
os1r1sd92vq1   some_app.yv3n4ex7514cxigx1oyj0hka3       registry.project.co/project/some_app/app:3d042a72   backend03   Running         Preparing 58 seconds ago
ywctjs3psb2s    \_ some_app.yv3n4ex7514cxigx1oyj0hka3   registry.project.co/project/some_app/app:1d6fef56   backend03   Shutdown        Shutdown 54 seconds ago
9zf0vkeuuqny    \_ some_app.yv3n4ex7514cxigx1oyj0hka3   registry.project.co/project/some_app/app:192856c4   backend03   Shutdown        Shutdown 11 minutes ago

If I inspect one of the prematurely stopped containers, there's:

        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2023-04-27T17:01:10.930540738Z",
            "FinishedAt": "2023-04-27T17:12:22.720695291Z",
            "Health": {
                "Status": "unhealthy",
                "FailingStreak": 0,
                "Log": [
                    {
                        "Start": "2023-04-27T19:11:39.812486279+02:00",
                        "End": "2023-04-27T19:11:39.961464877+02:00",
                        "ExitCode": 0,
                        "Output": "OK!"
                    },
                    [...]
                  ]

Interestingly enough, I have another service configured identically. Same health check, same deploy config, and on the same cluster. The only difference is that it's a different app and boots up a bit quicker. It deploys correctly.

@foertel
Copy link

foertel commented Jan 28, 2024

all the containers on all three nodes get stopped one after another without waiting for other/previous containers to reach a healthy state.

I would expect that to happen, when you configure your upgrade strategy to be "stop-first". (-:

@shawnritchie
Copy link

by any chance did you resolve this one? we are encountering the same issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants