-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes a mistake in Dedup function #125
Conversation
…within the definition of the Aggregation operator
My opinion is that errors should participate in the de-deduplication. That is,
I don't think this matters, since the only visible effects of errors are ones where the cause of the errors is lost (e.g. I think we should create some tests to improve coverage in this area. There's an existing syntax test for |
Errors (of the aggregated expression) have different impacts e.g. for
|
@hartig and I ran across this case while working on a custom aggregate that has special handling for errors, so selfishly I'd like to try for a resolution here that did not special-case I think any impacts errors have on existing aggregates (like |
I see your point. Other possibilities are bag or multiset. |
It still has some "special" - that is probably inevitable in some form unless there is a mapping of Can we explore that as a separate issue/PR if you can give an example of that special handling or errors?
No, and it seems intuitive. |
It's special cased, but still sues the machinery of
So I think either way we need to be clear about what Dedeup/Distinct does with multiple errors in its input sequence.
Sure. If we handle that as a separate issue, then I'm OK with this current PR getting merged. |
…al case of COUNT(DISTINCT *)
Now, with this issue aside, we can focus on the question that I asked earlier (which is not relevant for
and consider the following query (where
Since there is no At the point when the definition of the evaluation semantics of the query arrives at the Aggregation operator, the multiset of these three solution mappings has been turned into a sequence Ψ, containing the three mappings in an arbitrary order. For the sake of the example, assume the sequence is Ψ = [μ1, μ2, μ3]. Then, the corresponding sequence M(Ψ) in the definition of the Aggregation operator becomes: M(Ψ) = [("o"), (error), (error)]. Now, the question is, should Dedup(M(Ψ)) be
The definition as given currently in this PR says the former, whereas @kasei argues for the latter (and I am fine with that as well). @afs @Tpt @rubensworks what do you think it should be? |
Not really. The elements e1, ..., en in the "lists" that ListEval returns have an order. |
If we go for option 2, then we'll need a way to check for equality between errors, which may lead us into a rabbit-hole of what it means for errors to be equal to each other (by type of error, by stack trace,... ?) |
Not necessarily. We can also simply |
Yet, I have a different argument in favor of the first option: If we go for the second option, we are closing the door for any set function that would require the first option (i.e., once the sequence Dedup(M(Ψ)) without duplicate errors is passed to such a set function, the function cannot anymore recover the duplicated errors). In contrast, if we go for the first option, set functions that require the second option are still possible (that is, the de-duplication of errors can still be defined as part of such a set function). |
That's true, but only because the user has opted in to this behavior with the use of But going with option 1 means that to allow handling of option-2-style semantics in a custom aggregate, you'd need to either have it be opt-in via the aggregate itself (e.g. |
In my mind, this isn't necessary. Since this is about the behavior of the language, the semantics can't be tied to any particular implementation, meaning nothing like stack trace could be a part of the "equality" of errors. Since SPARQL lacks any sort of rich error typing system (the spec links to XPath which for our purposes only distinguishes type errors and dynamic errors), error handling elsewhere in the language is just based on the presence of an error and not its type or other details. I think |
I don't think so. If we go for option 2, then custom set functions won't even get to see the duplicated errors because the Dedup function is applied before the resulting sequence of lists is passed to the set function. |
Maybe we're talking past each other. I'm under the impression that The proposed spec text is:
My point is that if the set function wants to handle duplicated errors in some custom way, queries can just omit the |
Ah! It wasn't obvious to me that this is what you meant. Thanks for the clarification. I agree. |
@Tpt I see that you approved this PR. Does this means you prefer option 1 from the two options described in my earlier comment (see above)? @afs Same question. |
Are we agreed that with the spec-defined aggregates, this makes no observable difference? (because If DISTINCT is to be handled at the general level, not per aggregate function, I prefer that errors are deduped. There is one runtime "error" in SPARQL - expression eval error. (Enumerating the possibilities seemed implementation-extension and optimization-limiting IIRC but it was a while ago.) I'd like to a see a concrete example of how DISTINCT would be used where each error is considered different. When treating errors differently, the "distinct" might vary by error-class,"undef", "divide by zero", even custom reasons "empty string not acceptable here" and include "ignore". If a custom aggregrate wants to keep control, it should not rely on DISTINCT but handle the different situations itself. If there is a need to allow duplicate errors while having DISTINCT values, then maybe it would be better to have DISTINCT as a flag and put the execution decision into the aggregate definition itself. It might work to allow a custom aggregate to "override" Dedup. |
Yes!
Okay. Then you are aligned with @kasei and, as I wrote somewhere above, I am fine with changing the definition of the Dedup function in this PR to make the deduplication of errors an explicit part of this definition. Give me a few minutes to create and push the commit that implements this change. |
That's a great question. I have no strong opinion, after thinking about it I woud tend to think that deduplication is better (so +1 to @kasei and @afs) but I am not sure how the deduplication should be done: should all errors merged in a single error (current SPARQL behavior that does not distinguish between different expression evaluation errors) or using an implementation-defined definition of error (so, the number of deduplicated errors is based on the implementation of errors by the implementor)? Using the first definition might lead to a breaking change if we introduce specialized expression evaluation errors in the future (but that sounds very unlikely...). |
I have pushed the commit with the change (see commit e84412b). With this change, the behavior of the Dedup function in my example above is option 2 now; i.e., errors are deduplicated.
Good question. By the current definition of the Dedup function (i.e., after the commit that I just pushed), one of the deduplicated errors survives, but it does not matter which. The particular part of the definition that is relevant for this behavior is the second property in the definition (i.e., "Every list in Dedup(M(Ψ)) is contained in M(Ψ).") |
… result in replacing or merging of duplicate errors
Just to clarify this aspect a bit more, by that definition, a query processor must pick one of the deduplicated errors and use that error in the result of the deduplication. In other words, by this definition, query processors are not permitted to create a new error during the deduplication, where this new error could be some form of a merge of the errors that are deduplicated. Assuming that this aspect of the definition is probably too restrictive, I have changed the definition a bit more (see commit 05521cf). Now it is possible that the result of a deduplication contains a new error as a replacement of all the errors the were deduplicated. |
I think this is the effect in the latest commit.
There is only one error value in SPARQL. It does not carry a reason. That would get lost in As @kasei mentioned, if a customer aggregate wants to be smart in some way, which I think is more of a corner case, it can get all the evaluations. |
I am not sure what you mean by "this".
Interesting. That is an aspect of the spec that was not clear to me, and certainly not obvious. I mean, I would say I know most parts of the document quite well, but I don't think I ever came across a definition or statement that explicitly introduces this one error value. In contrast, most of the subsections about the various built-in functions talk about "rais[ing] an error" or "produc[ing] an error" (notice the indefinite article), which does not sound like returning the one error value. Also, already the very first paragraph of Section 17 (Expressions and Testing Values) talks about "errors" (plural!) and refers to different kinds of errors defined by XQuery, and then there are different places that refer to "a type error" (indefinite article again). All of these things give me the impression that there can be all sorts of errors and these may be thrown/raised as an effect of evaluating expressions, rather than the existence of a specific single value that captures all errors and that is explicitly returned in the way any RDF term may be returned by the evaluation of an expression. Of course, I know that there is the third value, "error (E)", that is considered in the three-valued logic used for FILTER evaluation, but I was always assuming that "error (E)" is simply a symbol used for the purpose of defining the two operators, logical-and and logical-or. Anyways, are you okay with the definition of Dedup as given after the latest commit in this PR?
I don't think that an exception mechanism needs to be added. |
Yes. |
I suppose it is strictly there is only one kind of error (type, not token). The functional forms do something with them, but the error situation is everywhere in the evaluation. |
Thanks @afs ! :-) @rubensworks you had some concerns related to the option that this PR implements now. Are you okay with the responses to this concern and with the PR as it is now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hartig I agree with the text as-is. As implementers can now choose to create their own custom error to represent deduplicated errors, my previous concern has been resolved.
Perfect @rubensworks . Thanks! Then we are all in agreement and I can merge this PR. |
I just realized that I made another smaller mistake in PR #98. This PR is meant to fix this mistake.
The mistake is in the definition of the (new) Dedup function that PR #98 introduces as a helper function within the definition of the Aggregation operator. In particular, the issue is that the definition of Dedup assumes that M(Ψ) is a sequence of RDF terms, which is not the case. Instead, M(Ψ) is a sequence of lists that are produced by the ListEval function, where each such list may contain RDF terms and errors (that may have occurred when evaluating any of the expressions that is given as an argument to the corresponding aggregate).
The fix that I am proposing in this PR is twofold: First, it makes explicit that M(Ψ) is a sequence of (a specific type of) lists. Second, in the part of the definition of Dedup that focuses on the removal of duplicates (i.e., the third property in the definition of Dedup), I have now made clear what it means for two such lists to be the same (i.e., duplicates). While the first of these two things is probably not controversial, the second one is a bit more delicate. The issue here is: what to do with lists that contain errors?
The stance that the PR in its current form takes regarding this question is that lists with errors cannot be the same lists. While I think that this is a reasonable option, I also notice that there is a sentence directly below the definition of the ListEval function that says that "errors may be used to group." For the sake of consistency, I would be okay with changing the corresponding part of this PR to say that two lists are considered the same not only if they have the same RDF terms in the same positions but also if they have errors in the same positions. The only thing that worries me about this is that it means to throw all types of errors in the same bucket (in fact, I have the same worry about the aforementioned sentence below the definition of ListEval). Opinions?
As a side node, I have to say that I do not like the term "list" in this context. I would prefer to use the term "tuple" instead. That is, the ListEval function produces a sequence of tuples that consists of RDF terms and errors, and thus the Dedup function operates over sequences of tuples and sequences of tuples are then passed to the set functions.
Preview | Diff