Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporal kills #169

Conversation

eric-corumdigital
Copy link
Contributor

Resolves #168

Instead of starting the asynchronous effect and scheduling processing of the result, the starting of the asynchronous effect is scheduled and the result is processed immediately.

Before, a Fiber could be PENDING even though the result was received, because it is waiting on the scheduler. Now, if a Fiber is PENDING it is guaranteed the result is not yet received.

Before, when a Fiber was killed in PENDING status it was possible that the result was already received. This meant that a kill which happened after the effect would retroactively kill it — a temporal inconsistency. Now, when a Fiber is killed in PENDING status it is necessarily the case that the kill occurred before the result was received.

@eric-corumdigital
Copy link
Contributor Author

The failing test case is this:

test_regression_kill_sync_async ∷ Aff Unit
test_regression_kill_sync_async = assert "regression/kill-sync-async" do
  ref ← newRef ""
  f1 ← forkAff $ makeAff \k -> k (Left (error "Boom.")) *> mempty
  killFiber (error "Nope.") f1
  pure true

Let me explain why I think this test is nonsense.

First, we must agree that there is no guarantee as to how actions from two different fibers are to be interleaved by the scheduler. They fall as they may.

Second, we must agree that there are two fibers at play in this test. One fiber comes from forkAff, and the other fiber is the one in which forkAff was evaluated.

Therefore, it is conclusive that there is no guarantee as to whether the killFiber action occurs before or after the makeAff action. Either the error "Boom." or the error "Nope." are possible and valid outcomes.

@natefaubion
Copy link
Collaborator

I agree that the test is vague. I had to go back to the original ticket to see what it was guarding against (#156). The original issue has to do with re-entry in the case of killing a fiber that is in the scheduler, which shouldn't be an issue. So maybe the question with this test then is not which error takes priority necessarily, but whether the off-stack exception is appropriate. There are no semantics for off-stack reporting, it's just meant to be a last-ditch reporting mechanism, so I'm inclined to replace the test with something else.

// It's possible to interrupt the fiber between enqueuing and
// resuming, so we need to check that the runTick is still
// valid.
var asyncAction = step._1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tmp register is for stuff like this. I'd prefer to not introduce a var in the middle of a switch.

if (runTick !== localRunTick) {
return;
}
++runTick;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this ++runTick is necessary. There's no reentry issue here because we aren't dealing with a user-invoked callback yet.

// This is initialized here (rather than in the Fiber constructor) because
// otherwise, in V8, this Error object indefinitely hangs on to memory that
// otherwise would be garbage collected.
var early = new Error("[ParAff] Early exit");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find. Any thoughts as to why this happens? Just curious.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would guess it has something to do with stack frames and traces or whatever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<symbol> in Error is what is hanging onto the memory, whatever that is, and I do not know why Error is being referenced from the root. I figure it is some special debugging sauce in V8 being quirky.

return;
}
var issync = true;
var resolved = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think two separate pieces of state are necessary for this check. This pattern is done in the sequential code several times.
https://github.com/slamdata/purescript-aff/blob/202a3eac12a207b6f109c9708f3efb034791bd36/src/Effect/Aff.js#L776-L795

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Just so you know, I am doing this so that run(runTick) is a tail call.

@natefaubion
Copy link
Collaborator

@eric-corumdigital You had some concerns about leaks with this PR? Aside from the failing test, which I can address, was there anything about this PR that concerned you?

@felixSchl
Copy link

felixSchl commented Feb 25, 2019

I feel compelled to mention, even if it's just anecdotally, that this this pull request causes spurious off-stack errors during cancellation in a fairly large / complex project we're developing. Even if I don't have a minimal repro case, here's what I witnessed in one of my fibers that is being killed: In a bracket, I allocate an AVar v, and create a fiber f that forever takes from v. In the clean up handler of the bracket I only kill f. That results in an uncaught off-stack error (under cancellation). Maybe that gives you some clues about missed edge cases. I am happy to keep trying new iterations.

@erisco
Copy link

erisco commented Feb 28, 2019

@natefaubion The only lingering concern is that if Error objects are instantiated in an Aff computation there is the potential for it to indefinitely retain memory on V8. I have moved all Error instantiations within the Aff implementation so as to not cause a problem. However, any user of Aff can instantiate Error themselves and cause a memory leak. Why this happens is beyond me. Even in the best case of the V8 developers immediately correcting this issue that will still leave millions of unpatched V8 instances which PureScript will still have to run on.

@felixSchl What is an "off-stack error"? Also, is your program this?

bracket
  (AVar.empty >>= forkAff <<< forever <<< AVar.take)
  (killFiber (error ""))
  (\_ → pure unit)

@natefaubion By the way, that is an example of how error may be commonly used in a dangerous way.

@natefaubion
Copy link
Collaborator

In trying to write a test for this, I have a scenario where this fix might fail.

  • I have a queue of Aff callbacks (like an AVar), one for fiber f1, and one for fiber f2.
  • When f1 resumes it immediately kills f2.
  • When f2 resumes it enters a completed state.
  • The queue synchronously invokes the callback for f1, followed by f2.

What state should f2 be in? f1 invokes killFiber first, but because killFiber is async, it gets preemptively scheduled, which may result in effects being deferred (this depends on if a scheduler loop is already running or not). f2 then gets invoked and is tagged as completed. The scheduler then runs the effects in f1 to kill the fiber, but f2 is already tagged as completed.

I would think this is just another (harder to trigger) temporal violation.

@natefaubion
Copy link
Collaborator

Maybe killFiber needs to be an opaque primitive so that it isn't "just" a makeAff, and it can avoid the async scheduling situation.

@eric-corumdigital
Copy link
Contributor Author

eric-corumdigital commented Apr 1, 2019

@natefaubion What is missing is a semantics for Aff. There is no cohesive way to answer this question yet.

However, short of that, I think what is missing from your scenario is exactly how these Fibers come about. Effects of multiple Fibers have an undefined interleaving (though locking mechanisms can add some constraints). Therefore, there is no defined time ordering either. The current method of execution is to choose arbitrarily an ordering.

In other words, the state of f2 is undefined, or can be defined nondeterministically as either killed or completed. The current implementation may always choose one over the other but there is no prescription.

This scenario is different than the temporal consistency I am trying to address. Within a single Fiber, effects happen one after the other, and one of those effects are killing. This is not a concern between Fibers. It is only a concern within a single Fiber. The problem is that a kill after a Fiber reaches a success state should not retroactively kill it, especially because the effects that brought the Fiber to a success state have already happened.

Note, when a Fiber receives a kill from another Fiber is undefined and that is not an issue or anything I am trying to address here.

@felixSchl
Copy link

With respect to my previous comments, we have identified and fixed the issues in our codebase, where were implicitly relying on undefined behaviour in Aff. We've been running with this patch applied for a while now and have had no further issues. So for what it's worth, it has my blessing.

@natefaubion
Copy link
Collaborator

That's good to hear @felixSchl. I would like to provide alternative APIs in purescript-bus, since it seems to be a common offender. I will very likely merge this as a breaking change at some point after that.

@eric-corumdigital
Copy link
Contributor Author

@felixSchl good to hear. Have you run on V8? Are you instantiating Error objects repeatedly over time? I have replicated this issue reliably on three different V8 versions. Two Chrome versions and a Node.js version across two different machines. It would be interesting if it was not happening to you and I'd like to see your program.

I am on this issue again and may soon have to isolate the exact circumstances that cause a problem in V8. That is the only way it is practically possible to devise a workaround, if one exists.

@natefaubion
Copy link
Collaborator

I don't think that error objects leaking memory has anything to do with the specifics of this PR, would it?

@felixSchl
Copy link

By the way, the tests are failing with this change https://travis-ci.com/purescript-contrib/purescript-aff/builds/111644315#L1934.

@natefaubion
Copy link
Collaborator

Yes, as discussed above, there is a test that currently relies on undefined behavior in order to pass.

@eric-corumdigital
Copy link
Contributor Author

I don't think that error objects leaking memory has anything to do with the specifics of this PR, would it?

Possibly my investigation was flawed. It may have been that Error objects were simply large objects in comparison to all else, and so a memory leak of a different cause was simply most obvious when many Error objects were instantiated.

I have been instantiating Error objects within Aff computations for a couple months now and have not observed memory leak problems. What I did learn is that I had a wrong and lacking understanding of supervisor contexts, and that was a source of memory leaking. Possibly that is the cause of all the memory leaks I have seen with respect to this patch.

I now think this patch is not a memory leak risk. Also @felixSchl has evidently been using it without issue, so that makes two of us.

@eric-corumdigital eric-corumdigital mentioned this pull request Aug 3, 2019
@safareli safareli mentioned this pull request Sep 25, 2019
@safareli
Copy link
Contributor

safareli commented Sep 25, 2019

It's a bit hard to understand what implications of this PR are. It would be great if there were some examples outlining issues of current behavior and implications of this change. If there was a way to even encode that as a test that would be even better.


Also If this #168 is related to #166 can you please add more details on how?

@natefaubion
Copy link
Collaborator

natefaubion commented Sep 26, 2019

@safareli The implications of this PR are that this potentially changes the order of fiber scheduling, so if you have written code that depends on a specific ordering of interleaved fiber effects, you will likely have issues. I don't think there's really a way to write it as a test, because I don't consider scheduling strategy (how multiple concurrent fiber effects are interleaved) to be part of Aff's guarantees. That is, Aff the implementation must be able to freely choose a scheduling strategy. The problem is that in a single-threaded runtime, it's extremely easy and tempting to rely on an observable ordering since it's deterministic.

The simplest example is:

example = do
  fiber <- forkAff blork
  killFiber (error "Boom") fiber

Can we reason about the effects here and the state of fiber? You might say "it depends on what blork does", in which case you would fall into the trap of relying on observable ordering. The answer is that we've given up control of ordering to the runtime by using forkAff, and the runtime can choose to run this now, in the next tick, or (potentially, but hopefully not) never depending on the strategy chosen.

So what actually happens now? Well it depends on what blork does! The current evaluation strategy is basically:

  • Eagerly evaluate all effects up to an async barrier (where an async barrier means we invoke the async effect)
  • Upon resolution of an async effect (invoking the callback), schedule evaluation to continue not now but some time in the current tick.

If blork is defined in only in terms of pure operations and liftEffects, then the state of fiber will be that it either failed or completed.

If blork is defined in terms of makeAff, then you can observe that it initiates the async effects of blork, and may initiate the canceler. Why "may"? What if execution of blork results the operational equivalent of makeAff \k -> k (Right unit) $> doesntMatter? That is, the callback is invoked before yielding a canceler? This can happen for all sorts of reasons in practice. With the current strategy, the continuation is put into a queue to run sometime in the current tick. This means there is a window where a fiber has "resolved" (the continuation has been invoked), but has not transitioned it's internal state to completed.

So in this case, with the above example:

  • blork is initiated, and the completion is callback is immediately invoked.
  • The continuation is scheduled.
  • Evaluation yields to the parent fiber, and the fiber is killed.

So should the final state of the fiber be completed or killed? The callback was invoked first, so you might expect completed, but because it was scheduled, and the kill is not, then the final state is killed. This is a problem in that generalBracket lets you observe the state of the inner block (completed, failed, killed). If we wrapped blork in generalBracket, we could observe that we invoked the completion callback, but that the final state is that it was killed. This is potentially unintuitive because it makes it difficult to reason about the robustness of various abstractions. This effectively means you can kill "between binds", and there is no way to atomically transition between fiber states. This is a question about Aff semantics, and I don't have a specific answer, unfortunately. But there are a lot of options:

  1. Transitions between binds are always atomic. That is, if the effect in one bind completes, we atomically transition to the next bind. Is this portable?
  2. Transitions between binds are not atomic. This makes it difficult to build some abstractions, but is it fundamentally impossible?
  3. Maybe certain operations offer atomic transitions, such as through bracket, or some other mask primitive.

This PR effectively does 1. So, say we want binds to always transition atomically. For the implementation as is, the options are:

  1. Async resolution does not schedule. An async action that immediately invokes its callback is observably equivalent to a liftEffect. This is what old Aff did, and the problem here is that a fiber can easily starve the system. If we wanted to introduce fairness, then we would have to expose a yield operation for authors to manually say "its ok to run other stuff now", and it would be the author's responsibility to put these in the correct place.
  2. Everything always schedules, including liftEffect. This means you can interleave synchronous effects, and liftEffect (a *> b) = liftEffect a *> liftEffect b is not something that observationally holds. Note this pseudo-law is not documented, and is not portable.
  3. Schedule async actions before initiating the effect, and make resolution synchronous. This keeps the same kind of fairness as it exists now, but we just do it in a different place. This eliminates the weird temporary scheduled-but-not-completed state we have now.

I am not considering options that do things like "just schedule killFiber" because it isn't a general solution to fairness and I'm not confident that it wouldn't just break in another way. This PR implements 3, which is a minor change to the actual implementation. All of these solutions change the scheduling strategy, so any code that results on observable interleaving of effects (such as using bus with the read-fork-loop pattern and not expecting messages to drop, or expecting to receive them or run in a specific order) will probably break.

This solution is not the only solution of course, so I'm happy to hear other thoughts.

@safareli
Copy link
Contributor

safareli commented Sep 26, 2019

couple days ago as I was looking at purescript-contrib/purescript-concurrent-queues#4 and this purescript-contrib/purescript-aff-bus#17 (comment). I was thinking on them and I come up with this module: Effect.Aff.Finally
.

This is how it can be used with the the AVarLock from #166 (comment) :

newtype AVarLock a = AVarLock (AVar a)

modify :: forall a b. AVarLock a -> (a -> Aff (Tuple a b)) -> Aff b
modify (AVarLock var) f = F.runFinallyM do
  a <- AVar.take var # F.finallyAff do
    // onResult handler will be execute if take succeeds even on current version of Aff.
    F.onResult (flip AVar.put var)
 res <- f a # F.finallyAff do
    // if `f` succeeds then we put `fst` of  it's result
    F.onResult (fst >>> flip AVar.put var)
    // if it fails then we put back initial value
    F.onError (const $ AVar.put a var)
  pure $ snd res     

I tried to use it with Bus too you can take a look at it here.

This approach might be considered as a bit hacky, but I think it sort of provides somewhat nice workaround for #166 and #168:

  • It's possible to nest FinallyM computations.
  • There is no the "unkillable acquisition by default" issue, it's opt in using invincible.
  • All this with current version of Aff, even changes of this PR is not needed.
  • Just a library solution, without any internals of Aff leaking, (Regular PS code and public functions are used).

Would love to see what you think on this!

If folks find this useful will publish as a lib and we can continue discussion on API specifics once there are more tests and we are sure it works 100%.

@safareli
Copy link
Contributor

safareli commented Sep 26, 2019

@eric-corumdigital @erisco would especially be interested in this:

-- | As acquisition cab fail `killed` and `failed` accept `Maybe a`
-- |  instead of `a` like regular BracketConditions.
type BracketConditions' a b =
  { killed  Error  Maybe a  Aff Unit
  , failed  Error  Maybe a  Aff Unit
  , completed  b  a  Aff Unit
  }

-- | Like normal generalBracket but acquisition computation
-- | can be killed or it can fail.
generalBracket'   a b. Aff a  BracketConditions' a b  (a  Aff b)  Aff b
generalBracket' acquisition b computation = runFinallyM do
  handle <- acquisition # finallyAff do
    {current, final} <- ask
    lift case current of
      Resolved (Left err) -> b.failed err Nothing
      Resolved (Right handle) -> final # runExists case _ of
        Resolved (Left err) -> b.failed err (Just handle)
        Resolved (Right res) ->
          -- Note this can be written without `unsafeCoerce`
          -- it's just easier like this as proof of concept.
          b.completed (unsafeCoerce res) handle
        Killed err -> b.killed err (Just handle)
      Killed err -> b.killed err Nothing
  lift $ computation handle

and using that here is how AVarLock implementation looks:

newtype AVarLock a = AVarLock (AVar a)

modify :: forall a b. AVarLock a -> (a -> Aff (Tuple a b)) -> Aff b
modify (AVarLock var) f = snd <$> generalBracket'
  (AVar.take var)
  { killed: \_ mbVal -> for_ mbVal \val -> AVar.put val var
  , failed: \_ mbVal -> for_ mbVal \val -> AVar.put val var
  , completed: \res _ -> AVar.put (fst res) var
  }
  f

@eric-corumdigital
Copy link
Contributor Author

I suspect this is never going to be merged and that is okay, so I am closing the PR. Obviously the commit history is a disaster and so would need to be cleaned up, should anyone want to see this back again.

What I have learned is that this patch is myopic. It cares about a level of detail that is missing in other ways and places. I think Aff should continue to be the fast and loose and mostly working creation that it is. At least, I have failed to retrofit it in a backwards compatible and elegant manner.

My plan is to create a new library that learns from Aff but isn't bounded by it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

ASYNC Killable Even After Completed
5 participants