Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Improve the Data Protection experience for apps with multiple instances #53654

Closed
amcasey opened this issue Jan 26, 2024 · 17 comments
Closed
Assignees
Labels
area-dataprotection Includes: DataProtection Epic Groups multiple user stories. Can be grouped under a theme.
Milestone

Comments

@amcasey
Copy link
Member

amcasey commented Jan 26, 2024

The golden path for Data Protection is having a single app instance, ideally on Windows. In that case, pulling anti-forgery, Razor pages, etc into the app, will automatically pull in Data Protection with its default settings. Keys will be stored locally in an XML file on disk and protected with (e.g.) Windows DPAPI. Along the golden path, app developers don't have to think about, let alone configure, Data Protection.

Things get more complicated when there are multiple app instances. For scalability and resilience, it's desirable for any app instance to be able to handle requests from any user session, so each app instance needs to be able to use keys generated by any other instance. Unfortunately, a mechanism for sharing keys is not something Data Protection can figure out on its own, so explicit configuration is required. Still more unfortunately, common approaches like storing the keys in Azure occasionally encounter races and communication failures that result in key-not-found errors.

What are we going to do about it? In 9.0, we're going to focus on:

  1. Making existing implementations more resilient against communication errors and initialization races.
  2. Evaluating a different approach in which a single, external component is responsible for generating keys and pushing them to shared storage where app instances can share read-only access to them.
  3. Documenting best practices for multi-instance apps.
@amcasey amcasey added area-dataprotection Includes: DataProtection Epic Groups multiple user stories. Can be grouped under a theme. labels Jan 26, 2024
@amcasey amcasey added this to the 9.0.0 milestone Jan 26, 2024
@amcasey amcasey self-assigned this Jan 26, 2024
@amcasey
Copy link
Member Author

amcasey commented Jan 26, 2024

A list of key-not-found issues: #36157
New pattern issues: #52915, #52916

@josephdecock
Copy link
Contributor

I'm really heartened to see this epic, because in my work on IdentityServer, the by far most common support topic is data protection. The problems that I see are all to do with this:

Along the golden path, app developers don't have to think about, let alone configure, Data Protection.

My users are on the golden path on their local dev machines, but not when they deploy! And IdentityServer uses the data protection services itself to protect sensitive data at rest, some of which can be very long lived (e.g., signing keys, refresh token data, etc), so data protection issues are especially pronounced in that context. I'm really hopeful to see what all comes of this.

@amcasey
Copy link
Member Author

amcasey commented Feb 22, 2024

After staring at the code for a while, this seems like a key source of these problems: #21096

Data Protection has a number of catch-and-swallow blocks, but this seems like the only one that wouldn't result in negligible harm (like losing a trace-level log message). Interestingly, a Warning is logged when this happens (which is how I found that issue), so it would be interesting to know if people are seeing this in prod.

[LoggerMessage(12, LogLevel.Warning, "Key {KeyId:B} is ineligible to be the default key because its {MethodName} method failed.", EventName = "KeyIsIneligibleToBeTheDefaultKeyBecauseItsMethodFailed")]
public static partial void KeyIsIneligibleToBeTheDefaultKeyBecauseItsMethodFailed(this ILogger logger, Guid keyId, string methodName, Exception exception);

@amcasey
Copy link
Member Author

amcasey commented Feb 22, 2024

Assuming the above is correct, we probably need:

  1. better handling for CanCreateAuthenticatedEncryptor failures
  2. some sort of retry logic for when the failure is transient

As noted in #21096, the existing recovery mechanism (swallow and eventually generate a new key) is probably helpful in apps with session affinity (not recommended) or a single instance, so we need to try to avoid making things worse for those apps.

amcasey added a commit to amcasey/aspnetcore that referenced this issue Feb 28, 2024
It used to be the case that part of IDefaultKeyResolver.ResolveDefaultKeyPolicy's job was to determine whether the current default key was close enough to expiration that a new one ought to be generated.  This didn't make sense as the definition of "too close" depended upon the refresh period and propagation time of the ICacheableKeyRingProvider.  That is, the IDefaultKeyResolver had to make assumptions about how often it would be polled for changed.

The old logic was also very subtle and, as far as I was able to determine, slightly incorrect.  Formerly, the presence of any key activated prior to the current default key's expiration date and not expiring during the next propagation cycle was considered an acceptable replacement.  Several things seem strange about this:

1. The logic for finding a successor key is not the same as the logic for finding a preferred key (e.g. CanCreateAuthenticatedEncryptor is not checked)
2. The propagation window is counted forward from the current time, rather than backward from the expiration time
3. It's not immediately clear what happens if the successor key is unexpired at the end of the propagation window but expired before the default key's expiration time (maybe that's impossible or maybe that would be caught next refresh?)
4. As mentioned above, it doesn't seem like the resolver should know about the refresh period or make assumptions about how often it's called

Now, the ICacheableKeyRingProvider is responsible for determining whether the returned default key is close enough to expiration that a new key should be generated.  It checks whether the current time is within one propagation cycle of the expiration time, padding by an extra refresh period to account for the fact that we don't know where in the refresh cycle expiration will fall (i.e. so that we never generate a new key _less_ than a full propagation cycle ahead of when it's needed).

Part of dotnet#53654
amcasey added a commit that referenced this issue Mar 1, 2024
* Move computation of ShouldGenerateNewKey to KeyRingProvider

It used to be the case that part of IDefaultKeyResolver.ResolveDefaultKeyPolicy's job was to determine whether the current default key was close enough to expiration that a new one ought to be generated.  This didn't make sense as the definition of "too close" depended upon the refresh period and propagation time of the ICacheableKeyRingProvider.  That is, the IDefaultKeyResolver had to make assumptions about how often it would be polled for changed.

The old logic was also very subtle and, as far as I was able to determine, slightly incorrect.  Formerly, the presence of any key activated prior to the current default key's expiration date and not expiring during the next propagation cycle was considered an acceptable replacement.  Several things seem strange about this:

1. The logic for finding a successor key is not the same as the logic for finding a preferred key (e.g. CanCreateAuthenticatedEncryptor is not checked)
2. The propagation window is counted forward from the current time, rather than backward from the expiration time
3. It's not immediately clear what happens if the successor key is unexpired at the end of the propagation window but expired before the default key's expiration time (maybe that's impossible or maybe that would be caught next refresh?)
4. As mentioned above, it doesn't seem like the resolver should know about the refresh period or make assumptions about how often it's called

Now, the ICacheableKeyRingProvider is responsible for determining whether the returned default key is close enough to expiration that a new key should be generated.  It checks whether the current time is within one propagation cycle of the expiration time, padding by an extra refresh period to account for the fact that we don't know where in the refresh cycle expiration will fall (i.e. so that we never generate a new key _less_ than a full propagation cycle ahead of when it's needed).

Part of #53654

* Don't repeat the second resolution after key generation

* Update comment

* Add explanatory comment

* Make comment more explicit
@amcasey
Copy link
Member Author

amcasey commented Mar 8, 2024

Telemetry: #54451

@ProTip
Copy link

ProTip commented Mar 27, 2024

Further to josephdecock's point I do not like the implications of point 2 per the description. Namely needing a second, external system to handle key generation and rotation.

This seems like something that could be resolved through distributed locking either in the IXmlRepository or through some distributed locking service provided to DP. ie the EF Core storage already has access to the DB. It could either handle locking itself or configure DP with a db-backed lock provider.

If the external component is the way forward I'd likely implement this as a background timer task that runs on each node and handles the locking.

Distributed locking is so useful and common(and with many good backing options via Redis, Dynamo, K8s, etc) that it would almost make sense to introduce the concept into Asp.Net Core.. But I digress 😄

@amcasey
Copy link
Member Author

amcasey commented Mar 27, 2024

The default application discriminator is the content root path, which can vary between a local instance and a deployed instance (e.g. to App Service). It would be nice if we could do something about this foot-gun.

@amcasey
Copy link
Member Author

amcasey commented Mar 27, 2024

@ProTip I'm not sure I fully understand your comment, so please let me know if I've missed your point.

I think you're saying that we could/should resolve key-write races using some sort of coordination mechanism. While introducing a new component does add complexity, I think it has some advantages over keeping the logic in the app instances:

  1. The key ring can be seeded before the app starts running so that instances don't have to wait for it to be read and updated.
  2. No coordination code is required since there is only one writer
  3. The interface we use to communicate with the backing store, IXmlRepository, is very restrictive and we can't change it without breaking numerous existing implementations

Note that there are no plans to eliminate the existing mechanisms, so you can continue to have each node do its own key generation and you're free to implement your own IXmlRepository with distributed locking (though, again, the interface is not very conducive to sophisticated implementations).

@ProTip
Copy link

ProTip commented Mar 27, 2024

Heya, yes I think you grok what I'm saying fully.

I think it's a great option to manually handle key generation and rotation however:

The key ring can be seeded before the app starts running so that instances don't have to wait for it to be read and updated.

Can't this be done anyway based on implementer requirements? I think a short delay when firing up a brand new application with an empty keyring isn't a concern for the vast majority.

No coordination code is required since there is only one writer

But it won't work as expected OOTB..

The interface we use to communicate with the backing...

I'm not as familiar with the interfaces or issues here. However this coordination logic may be able to be moved into the core DP code that reaches out to the IXmlRepository and work with a configured distributed lock provider. The extension methods that configure specific IXmlRepository implementations could also, optionally, register a bundled lock provider implementation.

Ultimately it feels a bit yucky that DP, which is supposed to just transparently work, would be the only(?) foundational Asp.Net Core component that introduces a requirement on an external service to function as expected(I'm tempted to say correctly) in a multi-node environment.

@amcasey
Copy link
Member Author

amcasey commented Mar 28, 2024

I think it's a great option to manually handle key generation and rotation however:

#53539 and possibly #53860 are the only changes I'm expecting to aspnetcore to support this new pattern. Without additional configuration, I wouldn't expect either to have any effect.

The key ring can be seeded before the app starts running so that instances don't have to wait for it to be read and updated.

Can't this be done anyway based on implementer requirements? I think a short delay when firing up a brand new application with an empty keyring isn't a concern for the vast majority.

It's not just the delay. Instances don't know how many other instances there are, so it's hard for them to ensure they're using a key available to all other instances (i.e. so that sessions can migrate between them).

No coordination code is required since there is only one writer

But it won't work as expected OOTB..

Tell me more about what it means for things to work out of the box? It sounds like you don't mean "without user configuration", because the app developer would need to establish a shared storage location and point all the app instances at it.

I'm not as familiar with the interfaces or issues here. However this coordination logic may be able to be moved into the core DP code that reaches out to the IXmlRepository and work with a configured distributed lock provider. The extension methods that configure specific IXmlRepository implementations could also, optionally, register a bundled lock provider implementation.

I'm not sure what design you have in mind, but it sounds like it would require instances to either know about each other or know about a shared coordination component. Is that right?

Ultimately it feels a bit yucky that DP, which is supposed to just transparently work, would be the only(?) foundational Asp.Net Core component that introduces a requirement on an external service to function as expected(I'm tempted to say correctly) in a multi-node environment.

None of this is expected to be required. At present, it is expected to be sample code demonstrating a way app developers can set up their application that we think will make them more reliable. There may also be a nuget package or something, but that's TBD.

Obviously, we're a long way from actually shipping and things could change.

@lukasz-zoglowek
Copy link

@amcasey will the work also address the instabilities/transient failures described here: #33071
We discussed that back in January, and I think the proposed solution was to not create new keys if the remote store is not accessible

@amcasey
Copy link
Member Author

amcasey commented Apr 24, 2024

@lukasz-zoglowek Sorry, you'll have to refresh my memory - a link to the conversation would help.

Having said that, #54299 should have addressed the fact that those timestamps can be out of order (which was harmless) and, yes, the goal of this epic is to eliminate most (but not all) of those cases. Not creating a new key when a remote store is not accessible is not viable as a general solution because it can leave the app without a key to use but #54711 added some retry support and #54490 made many potentially-flaky calls unnecessary. My expectations is that the combination of those changes will substantially mitigate the issue. I believe all of that work made it into Preview 4 and your feedback would be greatly appreciated.

@amcasey
Copy link
Member Author

amcasey commented Apr 29, 2024

Reliability should be substantially better in 9.0 Preview 4. If you continue to see key-not-found errors in Preview 4, please tag me.

@adityamandaleeka
Copy link
Member

@amcasey Is this epic done? It feels like the key deletion one isn't necessary to gate this on?

@md-redwan-hossain
Copy link

AES encryption of the DP key will be a very welcoming feature. Currently, the only way to encrypt the key when persisted via EF or Redis is to use a certificate. Why I can't just use AES encryption to encrypt and decrypt the key?

@amcasey
Copy link
Member Author

amcasey commented Jul 18, 2024

@md-redwan-hossain Data Protection is remarkably pluggable - I'd be surprised to learn you can't do that by providing your own encryptor. Regardless, this issue probably isn't the best place to get eyes on the question - can you please file a new one?

@amcasey
Copy link
Member Author

amcasey commented Jul 18, 2024

Key deletion is in and the docs PR is looking good.

@amcasey amcasey closed this as completed Jul 18, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Aug 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-dataprotection Includes: DataProtection Epic Groups multiple user stories. Can be grouped under a theme.
Projects
None yet
Development

No branches or pull requests

7 participants
@adityamandaleeka @ProTip @josephdecock @amcasey @md-redwan-hossain @lukasz-zoglowek and others