Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IKeyManager / XmlKeyManager - Fail intermittently in load balanced scenarios #50440

Closed
1 task done
Dhugal opened this issue Aug 30, 2023 · 5 comments
Closed
1 task done
Labels
area-dataprotection Includes: DataProtection

Comments

@Dhugal
Copy link

Dhugal commented Aug 30, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

The asp.net core data protection API's appear to fail intermittently in some load-balanced scenarios.
Specifically key generation across instances does not involve locking so keys generated on different instances are not always shared correctly.

In our case, we're using the asp.net core data protection APIs in a load-balanced environment (Azure App Service with scaled-out instances running asp.net core 7 app).

Occasionally we're hitting an issue with multiple instances starting simultaneously and doing the following:

  1. Read the existing Keys - there may be none or the existing keys may have expired.
  2. Each instance creates a new key and adds it to the store - and then reads the store again.
  3. The keys added by the latter instances are not available on earlier ones.

As we're using these for auth cookies, this means that authenticated users hitting some instances are not validated and therefore have to log back into the application.

The following shows an example what I believe is happening in a simple scenario of 2 instances:

Instance A Instance B
Read Keys (No Valid Keys Found)
Read Keys (No Valid Keys Found)
Create Key A
Read Keys
Create Key B
Read Keys
Key A loaded Keys A & B loaded

This means a user authenticated on Instance B may get issued a cookie using Key B and their subsequent requests to Instance A (via a load balancer) fail authentication (but requests to B work).

We're trying to mitigate this by implementing a cross-instance lock (in our case we use a DistrubutedLock, backed by SQL Server).

The problem is that we don't appear to have a good place to lock the process of reading and writing the new key, as they're separate methods. We can wrap the call to read the keys or create a key in a lock but that doesn't help as we can still get the overlap shown above between these two steps.

What we'd like to be able to do is apply a lock around the process which both reads the keys and generates a new one if required, so that the next instance will always get the key generated by the first one to start and will therefore not generate another one.

Instance A Instance B
Aquire Lock
Read Keys (No Valid Keys Found) Attempt to Aquire Lock
Create Key A Wait for lock
Read Keys Wait for lock
Release Lock Wait for lock
Aquire Lock
Read Keys
Release Lock
Key A loaded Key A loaded

The following section of the KeyRingProvider implements locking, which implies this scenario has been considered, but it doesn't cater for multiple instance scenarios:

https://github.com/dotnet/aspnetcore/blob/2dc991393c29a65df82efdb75e8467b7ace5bb74/src/DataProtection/DataProtection/src/KeyManagement/KeyRingProvider.cs#L180C17-L180C17

Expected Behavior

I believe this could be solved if the IKeyManager (and the default XmlKeyManager) provided a means to wrap/lock the process of reading the keys and determining if they're valid as well as generating a new one.

E.G. IKeyManager.ProcessingKeys (which calls GetAllKeys and, when required, also calls CreateNewKey)

We could then override this and place a lock around the process.

Steps To Reproduce

No response

Exceptions (if any)

No response

.NET Version

7.0.400

Anything else?

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-dataprotection Includes: DataProtection label Aug 30, 2023
@Dhugal Dhugal changed the title IKeyManager / XMLManager - Fail intermittently in load balanced scenarios IKeyManager / XmlKeyManager - Fail intermittently in load balanced scenarios Aug 30, 2023
@pacorreia
Copy link

Any updates on this? Since August no reply from anyone

@amcasey
Copy link
Member

amcasey commented Jan 23, 2024

Sorry we didn't respond sooner - this is one of a family of issues and more of the chatter has happened on others.

Are you seeing this only on initialization, or while the app is running too?

  • Initialization is tricky because Instance B doesn't necessarily want to block requests until Instance A is ready.
  • In steady state, the expected behavior is that, while A and B might simultaneously notice that a new key is required, they would each generate a key to be activated far enough into the future that all other instances should have picked it up before it begins to be used

Depending on how your orchestration works, one option might be to run with a single instance until the first request has been processed, at which point any new instances will be able to see the key generated by the first instance. (Obviously, this is a quick-and-dirty mitigation and not a real solution.)

We're working on a more general approach, which you can track via #52915 and #52916.

@amcasey
Copy link
Member

amcasey commented Jan 23, 2024

Here's the issue describing how the steady-state refresh can fail: #52678

@ghost
Copy link

ghost commented Jan 26, 2024

Thanks for contacting us.

We're moving this issue to the .NET 9 Planning milestone for future evaluation / consideration. We would like to keep this around to collect more feedback, which can help us with prioritizing this work. We will re-evaluate this issue, during our next planning meeting(s).
If we later determine, that the issue has no community involvement, or it's very rare and low-impact issue, we will close it - so that the team can focus on more important and high impact issues.
To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

@dotnet-policy-service dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024
@wtgodbe wtgodbe removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024
@dotnet-policy-service dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024
@wtgodbe wtgodbe removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 13, 2024
@dotnet dotnet deleted a comment from dotnet-policy-service bot Feb 13, 2024
@dotnet dotnet deleted a comment from dotnet-policy-service bot Feb 13, 2024
@amcasey
Copy link
Member

amcasey commented Oct 4, 2024

I'm going to close this issue. Please post additional feedback in #36157 so we can track it centrally and don't drop any.

@amcasey amcasey closed this as completed Oct 4, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Nov 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-dataprotection Includes: DataProtection
Projects
None yet
Development

No branches or pull requests

4 participants