-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition around supervisor's Commander #291
base: main
Are you sure you want to change the base?
Fix race condition around supervisor's Commander #291
Conversation
Signed-off-by: Paschalis Tsilias <[email protected]>
@@ -102,21 +102,21 @@ func (c *Commander) Pid() int { | |||
|
|||
// ExitCode returns Agent process exit code if it exited or 0 if it is not. | |||
func (c *Commander) ExitCode() int { | |||
if c.cmd == nil || c.cmd.ProcessState == nil { | |||
if c.IsRunning() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I've actually flipped the condition from what it was before, as per the method's docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this potentially leave the agent process running in some cases? The NewSupervisor method starts a goroutine runAgentProcess
to launch the agent process. There's a possibility that Shutdown
might be called before c.cmd.Start()
finishes. If c.cmd.Start()
completes successfully, the process would keep running, which isn't expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, let me think about this (and if the previous code prevented that from happening). We could potentially store the running value a little earlier to avoid that IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there, sorry for taking so long to get back to you. As far as I can tell from the code, this was already the case and it isn't a regression from this PR.
To avoid this, we'd make it the caller's responsibility to call only call Stop() after IsRunning() returns true (for example having a backoff mechanism for properly cleaning up resources), or add more explicit synchronization like we do in clientcommon
opamp-go/client/internal/clientcommon.go
Lines 53 to 55 in ed38d5f
// True when stopping is in progress. | |
isStoppingFlag bool | |
isStoppingMutex sync.RWMutex |
IMO this is a separate issue than the race condition the one's fixed here, so tackling the cleanup of resoueces might be a new discussion. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think it was possible to leave the process running in the background because we are checking if c.cmd
is non-nil (which is the source of the race condition) instead of running status. I think the Stop
should be updated to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay.
Here is the case I think this still doesn't address
Start()
is called- The process begins starting up but hasn't yet reached the point where
c.running.Store(true)
is executed Shutdown()
is called during this window- Since
!c.IsRunning()
evaluates to true at this point,Shutdown()
returns immediately - This could result in the process continuing to run
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I've pushed another commit in the Stop method to set the isStoppingFlag immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, let me try to get this right.
The commander's Start is called only in two cases
a) In the Supervisor's startAgent.
b) During a restart (not pertinent to our discussion)
The startAgent method is only called via the runAgentProcess; in turn, this is only called as a new goroutine in the NewSupervisor function.
In the Shutdown method, the Commander's Stop sets the isStopping flag.
So let's say that NewSupervisor starts, and the new goroutine is launched.
Immediately, the commander's Shutdown method is called, to call the commander's Stop and set the isStoppingFlag. When the call stack gets to the commander's Start it will immediately exit with the IsStopping method without having a chance to launch any process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @srikanthccv I've put out a test in f135e3b to try and verify the behavior we want.
I couldn't find any more elegant way than the ugly DelayLogger, but I'd like to see what you think. (This test fails in main, without my fix here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's say that NewSupervisor starts, and the new goroutine is launched.
Immediately, the commander's Shutdown method is called, to call the commander's Stop and set the isStoppingFlag. When the call stack gets to the commander's Start it will immediately exit with the IsStopping method without having a chance to launch any process.
Take this case
NewSupervisor starts, and the goroutine runAgentProcess
is launched, -> startAgent
-> s.commander.Start
-> the execution reaches c.cmd.Start
if err := c.cmd.Start(); err != nil { |
And say cmd.Start
didn't return yet
Now, assume the commander's Shutdown method is called (c.cmd.Start
still didn't return) so c.running
is false. In this flow isStoppingFlag
is not helping because we would still return as IsRunning
evaluates to true.
I think if we add a waitgroup before the c.cmd.Start
and wait in shutdown should solve the issue.
Signed-off-by: Paschalis Tsilias <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #291 +/- ##
==========================================
+ Coverage 76.32% 77.06% +0.74%
==========================================
Files 25 25
Lines 1681 1696 +15
==========================================
+ Hits 1283 1307 +24
+ Misses 291 286 -5
+ Partials 107 103 -4 ☔ View full report in Codecov by Sentry. |
@tpaschalis @srikanthccv is this still in progress? |
Signed-off-by: Paschalis Tsilias <[email protected]>
Srikanth had suggested using more explicit synchronization to avoid potentially leaving a dangling process; if the Shutdown method of the supervisor is called, then the I'm looking forward to feedback! |
Signed-off-by: Paschalis Tsilias <[email protected]>
Signed-off-by: Paschalis Tsilias <[email protected]>
@tpaschalis @srikanthccv I assume you are working on this PR. Please ping me when it is ready. |
From what I'd see under normal circumstances (and also the example), starting up the supervisor which also launches the commander would happen in a separate goroutine than the one that inspects its state or calls Stop.
The current implementation has a race condition on the
cmd *exec.Cmd
field, using it both for figuring out whether the commander is running and also actually executing a new Agent process, which could potentially lead to a panic.This PR fixes a small race condition around the supervisor's Commander by utilizing an existing
running
field and turning it into aatomic.Bool
to store the commander's state.Fixes #290