-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability for OTLP to indicate which records were rejected for retryable reasons #470
Comments
If I got it right, in your example Something like:
Given a request such as, where X represents ingestion failures:
Then we would receive a (pseudo) response as: {
"rejected_log_records": 3,
"error_message": "some logs could not be ingested..",
"retryable_resource_logs": [
{
"resource_logs": 1,
"scope_logs": 1,
"logs": [1, 2]
},
{
"resource_logs": 1,
"scope_logs": 2,
"logs": [1]
},
{
"resource_logs": 2,
"scope_logs": 1,
"logs": [1]
}
]
} For the record, some old discussion about this, while the initial Partial Success feature was introduced: #390 (comment) We also need to think if we would like to make this generic, so other signals can also benefit from it. |
Thanks for taking a look @joaopgrassi.
I meant for this to be a list of indices, not the actual logs. So the pseudo response for my proposed structure would be: {
"rejected_log_records": 3,
"error_message": "some logs could not be ingested..",
"retryable_resource_logs": [
{
"resource_logs": 1,
"retryable_scope_logs": [1, 2]
},
{
"resource_logs": 2,
"retryable_scope_logs": [1]
},
{
"resource_logs": 3,
"retryable_scope_logs": [3]
}
]
} There would never be more than one To me, the more important aspect of this discussion is whether or not this is something that OTLP should support. And I definitely agree that we'd want to duplicate the pattern to every other signal as well, similar to how partial success was added 👍 |
But in your example, it's missing the scope as well no? Like
Log 1 and 2 refers to the resource with index 1, but they belong to which scope? There could be more inside resource 1.
I'm also having a hard time coming up with alternatives, not sure we have another possibility. If things had "ids" that would be one way. To me, the tricky part is the hierarchy of the structure. That makes it all way more complicated, for both parties (receivers and exporters). For logs we have Another point is, for exporters, it could be "easy" to add a configuration to enable such detailed retry behavior, but for a receiver/server I wonder how that would be. I would not imagine a server always doing such complicated handling for all requests, specially when it's under pressure, but then how do a server decides when to perform such checks and response? |
Moving the discussion to https://github.com/open-telemetry/opentelemetry-proto/ where the OTLP spec is being moved right now. |
This needs a more elaborate proposal that also needs to explain how the interoperability of old and new versions is ensured. |
What are you trying to achieve?
For logging use cases, it is desirable to deliver logs with as little data loss as possible, without duplicating data. This is more critical for logs than metrics or traces, where a single log message may contain critical information about the behavior of the system being observed. A dropped metric or trace is rarely a critical issue. Metrics will be populated on the next reporting interval and traces are typically highly sampled anyways.
What did you expect to see?
An ability for an OLTP client to know which records should be resent in order to meet delivery guarantees. This should be considered optional for clients / exporters to implement, but highly encouraged for a robust logging implementation.
Additional context.
The OTLP protocol was enhanced in open-telemetry/opentelemetry-specification#2454 to allow server implementations to indicate if all data had been accepted, but it does not indicate which records were rejected or why they were rejected. This issue is to restart the conversation about adding this capability to the protocol.
In Elasticsearch's architecture, some data in a single bulk ingestion request may be written, while other data may not be. Elasticsearch indicates this with a status code on each individual event: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html#bulk-api-response-body
Proposed path forward
We should strive to provide the minimal amount of information for this case in the protocol, in order to simplify server and client support for this feature:
A first attempt at this:
The text was updated successfully, but these errors were encountered: