Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes a mistake in Dedup function #125

Merged
merged 4 commits into from
Oct 13, 2023
Merged

Fixes a mistake in Dedup function #125

merged 4 commits into from
Oct 13, 2023

Conversation

hartig
Copy link
Contributor

@hartig hartig commented Oct 8, 2023

I just realized that I made another smaller mistake in PR #98. This PR is meant to fix this mistake.

The mistake is in the definition of the (new) Dedup function that PR #98 introduces as a helper function within the definition of the Aggregation operator. In particular, the issue is that the definition of Dedup assumes that M(Ψ) is a sequence of RDF terms, which is not the case. Instead, M(Ψ) is a sequence of lists that are produced by the ListEval function, where each such list may contain RDF terms and errors (that may have occurred when evaluating any of the expressions that is given as an argument to the corresponding aggregate).

The fix that I am proposing in this PR is twofold: First, it makes explicit that M(Ψ) is a sequence of (a specific type of) lists. Second, in the part of the definition of Dedup that focuses on the removal of duplicates (i.e., the third property in the definition of Dedup), I have now made clear what it means for two such lists to be the same (i.e., duplicates). While the first of these two things is probably not controversial, the second one is a bit more delicate. The issue here is: what to do with lists that contain errors?

The stance that the PR in its current form takes regarding this question is that lists with errors cannot be the same lists. While I think that this is a reasonable option, I also notice that there is a sentence directly below the definition of the ListEval function that says that "errors may be used to group." For the sake of consistency, I would be okay with changing the corresponding part of this PR to say that two lists are considered the same not only if they have the same RDF terms in the same positions but also if they have errors in the same positions. The only thing that worries me about this is that it means to throw all types of errors in the same bucket (in fact, I have the same worry about the aforementioned sentence below the definition of ListEval). Opinions?

As a side node, I have to say that I do not like the term "list" in this context. I would prefer to use the term "tuple" instead. That is, the ListEval function produces a sequence of tuples that consists of RDF terms and errors, and thus the Dedup function operates over sequences of tuples and sequences of tuples are then passed to the set functions.


Preview | Diff

…within the definition of the Aggregation operator
@hartig hartig added the spec:substantive Change in the spec affecting its normative content (class 3) –see also spec:bug, spec:new-feature label Oct 8, 2023
@hartig hartig requested review from kasei, afs, rubensworks and Tpt October 8, 2023 11:02
@kasei
Copy link
Contributor

kasei commented Oct 8, 2023

My opinion is that errors should participate in the de-deduplication. That is, Dedup([ (error), (error) ]) should produce [ (error) ].

The only thing that worries me about this is that it means to throw all types of errors in the same bucket

I don't think this matters, since the only visible effects of errors are ones where the cause of the errors is lost (e.g. COUNT(*), or the absence of a variable binding).

I think we should create some tests to improve coverage in this area. There's an existing syntax test for COUNT(DISTINCT *) (syntax-query/manifest#test_7), but no eval tests as far as I can tell.

@afs
Copy link
Contributor

afs commented Oct 8, 2023

Errors (of the aggregated expression) have different impacts e.g. for Sum the summation is invalid which is caused by passing the error so as long as one error is present,

COUNT(DISTINCT *) may be better handled in the "special case" for COUNT(*).

@kasei
Copy link
Contributor

kasei commented Oct 8, 2023

@hartig and I ran across this case while working on a custom aggregate that has special handling for errors, so selfishly I'd like to try for a resolution here that did not special-case COUNT(DISTINCT *).

I think any impacts errors have on existing aggregates (like Sum, as you mention) would be unaffected by the use of DISTINCT, since the current impact is based on whether there are any errors present, not on the difference in cardinality of those errors. Can you think of any existing cases that would be negatively impacted by a definition of Dedup that included deduplication of errors?

@afs
Copy link
Contributor

afs commented Oct 8, 2023

do not like the term "list" in this context. I would prefer to use the term "tuple" instead.

I see your point. Other possibilities are bag or multiset.

@afs
Copy link
Contributor

afs commented Oct 8, 2023

@hartig and I ran across this case while working on a custom aggregate that has special handling for errors, so selfishly I'd like to try for a resolution here that did not special-case COUNT(DISTINCT *).

COUNT(*) is currently a special case because it is the cardinality of the rows. That could be unspecial'ed by having a expression meaning for *. Maybe an expression that returns a bag of bindings which might make COUNT(DISTINCT *) work.

It still has some "special" - that is probably inevitable in some form unless there is a mapping of * to RDF term(s).

Can we explore that as a separate issue/PR if you can give an example of that special handling or errors?

I think any impacts errors have on existing aggregates (like Sum, as you mention) would be unaffected by the use of DISTINCT, since the current impact is based on whether there are any errors present, not on the difference in cardinality of those errors. Can you think of any existing cases that would be negatively impacted by a definition of Dedup that included deduplication of errors?

No, and it seems intuitive.

@kasei
Copy link
Contributor

kasei commented Oct 8, 2023

COUNT(*) is currently a special case because it is the cardinality of the rows. That could be unspecial'ed by having an expression meaning for *. Maybe an expression that returns a bag of bindings which might make COUNT(DISTINCT *) work.

It still has some "special" - that is probably inevitable in some form unless there is a mapping of * to RDF term(s).

It's special cased, but still sues the machinery of Dedup (née Distinct):

when COUNT is used with the expression * the value of F will be … card[Distinct(Ω)] if the DISTINCT keyword is present.

So I think either way we need to be clear about what Dedeup/Distinct does with multiple errors in its input sequence.

Can we explore that as a separate issue/PR if you can give an example of that special handling or errors?

Sure. If we handle that as a separate issue, then I'm OK with this current PR getting merged.

@hartig
Copy link
Contributor Author

hartig commented Oct 8, 2023

COUNT(DISTINCT *) is indeed different, and I see now that PR #98 made (yet another) mistake by using the new Dedup function in the definition of this special case. For this special case, using the Distinct operator was indeed correct because the given argument is just Ψ rather than M(Ψ). I have just fixed this mistake in commit 48b19ee (added to this PR).

Now, with this issue aside, we can focus on the question that I asked earlier (which is not relevant for COUNT(*), nor for COUNT(DISTINCT *)). Perhaps I rephrase the question based on a concrete example. Consider the following RDF graph (given as Turtle file, prefix declarations omitted).

ex:s1  ex:pa  ex:o1 .
ex:o1  ex:pb  "o" .

ex:s2  ex:pa  ex:o2 .

ex:s3  ex:pa  ex:o3 .

and consider the following query (where SomeSetFct is some set function).

SELECT ( SomeSetFct(DISTINCT ?z) AS ?cnt ) WHERE {
   ?x  ex:pa  ?y .
   OPTIONAL {
      ?y  ex:pb  ?z .
   }
}

Since there is no GROUP BY, we just have a single group, consisting of the following three solution mappings:
μ1 = { ?x -> ex:s1, ?y -> ex:o1, ?z -> "o" },
μ2 = { ?x -> ex:s2, ?y -> ex:o2 },
μ3 = { ?x -> ex:s3, ?y -> ex:o3 }.

At the point when the definition of the evaluation semantics of the query arrives at the Aggregation operator, the multiset of these three solution mappings has been turned into a sequence Ψ, containing the three mappings in an arbitrary order. For the sake of the example, assume the sequence is Ψ = [μ1, μ2, μ3]. Then, the corresponding sequence M(Ψ) in the definition of the Aggregation operator becomes: M(Ψ) = [("o"), (error), (error)].

Now, the question is, should Dedup(M(Ψ)) be

  1. [("o"), (error), (error)] or
  2. [("o"), (error)] ?

The definition as given currently in this PR says the former, whereas @kasei argues for the latter (and I am fine with that as well). @afs @Tpt @rubensworks what do you think it should be?

@hartig
Copy link
Contributor Author

hartig commented Oct 8, 2023

@afs

do not like the term "list" in this context. I would prefer to use the term "tuple" instead.

I see your point. Other possibilities are bag or multiset.

Not really. The elements e1, ..., en in the "lists" that ListEval returns have an order.

@rubensworks
Copy link
Member

Now, the question is, should Dedup(M(Ψ)) be

  1. [("o"), (error), (error)] or
  2. [("o"), (error)] ?

If we go for option 2, then we'll need a way to check for equality between errors, which may lead us into a rabbit-hole of what it means for errors to be equal to each other (by type of error, by stack trace,... ?)
Intuitively, I therefore feel like option 1 might be the safer one. (but I don't have a strong opinion on this)

@hartig
Copy link
Contributor Author

hartig commented Oct 9, 2023

[@rubensworks] If we go for option 2, then we'll need a way to check for equality between errors, which may lead us into a rabbit-hole of what it means for errors to be equal to each other (by type of error, by stack trace,... ?)

Not necessarily. We can also simply throughthrow all errors into a single bucket (which I guess is also the intention for the aforementioned statement that "errors may be used to group").

@hartig
Copy link
Contributor Author

hartig commented Oct 9, 2023

Yet, I have a different argument in favor of the first option: If we go for the second option, we are closing the door for any set function that would require the first option (i.e., once the sequence Dedup(M(Ψ)) without duplicate errors is passed to such a set function, the function cannot anymore recover the duplicated errors). In contrast, if we go for the first option, set functions that require the second option are still possible (that is, the de-duplication of errors can still be defined as part of such a set function).

@kasei
Copy link
Contributor

kasei commented Oct 9, 2023

If we go for the second option, we are closing the door for any set function that would require the first option (i.e., once the sequence Dedup(M(Ψ)) without duplicate errors is passed to such a set function, the function cannot anymore recover the duplicated errors).

That's true, but only because the user has opted in to this behavior with the use of DISTINCT. Custom aggregates are free to define their own semantics for handling duplicated errors when passed non-distinct data.

But going with option 1 means that to allow handling of option-2-style semantics in a custom aggregate, you'd need to either have it be opt-in via the aggregate itself (e.g. SELECT (myCustomSetFuncDistinctErrors(?x) AS ?agg) { … }) or define new scalarvals to opt-in to that behavior (e.g. SELECT (myCustomSetFunc(?x ; DISINCT_ERRORS=true) AS ?agg) { … }. Both of these seem unpalatable to me.

@kasei
Copy link
Contributor

kasei commented Oct 9, 2023

If we go for option 2, then we'll need a way to check for equality between errors, which may lead us into a rabbit-hole of what it means for errors to be equal to each other (by type of error, by stack trace,... ?)

In my mind, this isn't necessary. Since this is about the behavior of the language, the semantics can't be tied to any particular implementation, meaning nothing like stack trace could be a part of the "equality" of errors. Since SPARQL lacks any sort of rich error typing system (the spec links to XPath which for our purposes only distinguishes type errors and dynamic errors), error handling elsewhere in the language is just based on the presence of an error and not its type or other details. I think DISTINCT handling in aggregates should be the same.

@hartig
Copy link
Contributor Author

hartig commented Oct 9, 2023

If we go for the second option, we are closing the door for any set function that would require the first option (i.e., once the sequence Dedup(M(Ψ)) without duplicate errors is passed to such a set function, the function cannot anymore recover the duplicated errors).

That's true, but only because the user has opted in to this behavior with the use of DISTINCT. Custom aggregates are free to define their own semantics for handling duplicated errors when passed non-distinct data.

I don't think so. If we go for option 2, then custom set functions won't even get to see the duplicated errors because the Dedup function is applied before the resulting sequence of lists is passed to the set function.

@kasei
Copy link
Contributor

kasei commented Oct 9, 2023

I don't think so. If we go for option 2, then custom set functions won't even get to see the duplicated errors because the Dedup function is applied before the resulting sequence of lists is passed to the set function.

Maybe we're talking past each other. I'm under the impression that Dedup is only involved here if the aggregation in the query string has the DISTINCT keyword before the args (e.g. agg(DISTINCT ?x)).

The proposed spec text is:

F(Ψ) = func(M(Ψ), scalarvals), for non-DISTINCT
F(Ψ) = func(Dedup(M(Ψ)), scalarvals), for DISTINCT

My point is that if the set function wants to handle duplicated errors in some custom way, queries can just omit the DISTINCT keyword and get the full sequence of input without any deduplication. Do you agree with that? Have I misunderstood what you're saying?

@hartig
Copy link
Contributor Author

hartig commented Oct 9, 2023

Ah! It wasn't obvious to me that this is what you meant. Thanks for the clarification. I agree.

@hartig
Copy link
Contributor Author

hartig commented Oct 10, 2023

@Tpt I see that you approved this PR. Does this means you prefer option 1 from the two options described in my earlier comment (see above)?

@afs Same question.

@afs
Copy link
Contributor

afs commented Oct 10, 2023

Are we agreed that with the spec-defined aggregates, this makes no observable difference? (because COUNT(*) does not have errors).

If DISTINCT is to be handled at the general level, not per aggregate function, I prefer that errors are deduped.
(I confess that's how I read the PR at "Dedup(M(Ψ)) is free of duplicates." - seems I fell into the list/tuple trap!)

There is one runtime "error" in SPARQL - expression eval error. (Enumerating the possibilities seemed implementation-extension and optimization-limiting IIRC but it was a while ago.)

I'd like to a see a concrete example of how DISTINCT would be used where each error is considered different.

When treating errors differently, the "distinct" might vary by error-class,"undef", "divide by zero", even custom reasons "empty string not acceptable here" and include "ignore".

If a custom aggregrate wants to keep control, it should not rely on DISTINCT but handle the different situations itself.

If there is a need to allow duplicate errors while having DISTINCT values, then maybe it would be better to have DISTINCT as a flag and put the execution decision into the aggregate definition itself. It might work to allow a custom aggregate to "override" Dedup.

@hartig
Copy link
Contributor Author

hartig commented Oct 10, 2023

@afs

Are we agreed that with the spec-defined aggregates, this makes no observable difference? (because COUNT(*) does not have errors).

Yes!

If DISTINCT is to be handled at the general level, not per aggregate function, I prefer that errors are deduped.
[...]
If a custom aggregrate wants to keep control, it should not rely on DISTINCT but handle the different situations itself.
[...]

Okay. Then you are aligned with @kasei and, as I wrote somewhere above, I am fine with changing the definition of the Dedup function in this PR to make the deduplication of errors an explicit part of this definition. Give me a few minutes to create and push the commit that implements this change.

@Tpt
Copy link
Contributor

Tpt commented Oct 10, 2023

@Tpt I see that you approved this PR. Does this means you prefer option 1 from the two options described in my #125 (comment)?

That's a great question. I have no strong opinion, after thinking about it I woud tend to think that deduplication is better (so +1 to @kasei and @afs) but I am not sure how the deduplication should be done: should all errors merged in a single error (current SPARQL behavior that does not distinguish between different expression evaluation errors) or using an implementation-defined definition of error (so, the number of deduplicated errors is based on the implementation of errors by the implementor)? Using the first definition might lead to a breaking change if we introduce specialized expression evaluation errors in the future (but that sounds very unlikely...).

@hartig
Copy link
Contributor Author

hartig commented Oct 10, 2023

I have pushed the commit with the change (see commit e84412b). With this change, the behavior of the Dedup function in my example above is option 2 now; i.e., errors are deduplicated.

[@Tpt] ... I am not sure how the deduplication should be done: should all errors merged in a single error ... or ... ?

Good question. By the current definition of the Dedup function (i.e., after the commit that I just pushed), one of the deduplicated errors survives, but it does not matter which. The particular part of the definition that is relevant for this behavior is the second property in the definition (i.e., "Every list in Dedup(M(Ψ)) is contained in M(Ψ).")

… result in replacing or merging of duplicate errors
@hartig
Copy link
Contributor Author

hartig commented Oct 11, 2023

[@Tpt] ... I am not sure how the deduplication should be done: should all errors merged in a single error ... or ... ?

[@hartig] Good question. By the current definition of the Dedup function (i.e., after the commit that I just pushed), one of the deduplicated errors survives, but it does not matter which. The particular part of the definition that is relevant for this behavior is the second property in the definition (i.e., "Every list in Dedup(M(Ψ)) is contained in M(Ψ).")

Just to clarify this aspect a bit more, by that definition, a query processor must pick one of the deduplicated errors and use that error in the result of the deduplication. In other words, by this definition, query processors are not permitted to create a new error during the deduplication, where this new error could be some form of a merge of the errors that are deduplicated.

Assuming that this aspect of the definition is probably too restrictive, I have changed the definition a bit more (see commit 05521cf). Now it is possible that the result of a deduplication contains a new error as a replacement of all the errors the were deduplicated.

@afs
Copy link
Contributor

afs commented Oct 12, 2023

I think this is the effect in the latest commit.

Error remains a special value. It's special in that any function involving the error evals to an error. There are some special "function-like" operations that handle error in some way.

There is only one error value in SPARQL. It does not carry a reason. That would get lost in function(error) anyway. We don't have to have an exception mechanism. (SPARQL could - it just doesn't at the moment).

As @kasei mentioned, if a customer aggregate wants to be smart in some way, which I think is more of a corner case, it can get all the evaluations.

@hartig
Copy link
Contributor Author

hartig commented Oct 13, 2023

@afs

I think this is the effect in the latest commit.

I am not sure what you mean by "this".

Error remains a special value. It's special in that any function involving the error evals to an error. [...] There is only one error value in SPARQL.

Interesting. That is an aspect of the spec that was not clear to me, and certainly not obvious. I mean, I would say I know most parts of the document quite well, but I don't think I ever came across a definition or statement that explicitly introduces this one error value. In contrast, most of the subsections about the various built-in functions talk about "rais[ing] an error" or "produc[ing] an error" (notice the indefinite article), which does not sound like returning the one error value. Also, already the very first paragraph of Section 17 (Expressions and Testing Values) talks about "errors" (plural!) and refers to different kinds of errors defined by XQuery, and then there are different places that refer to "a type error" (indefinite article again). All of these things give me the impression that there can be all sorts of errors and these may be thrown/raised as an effect of evaluating expressions, rather than the existence of a specific single value that captures all errors and that is explicitly returned in the way any RDF term may be returned by the evaluation of an expression.

Of course, I know that there is the third value, "error (E)", that is considered in the three-valued logic used for FILTER evaluation, but I was always assuming that "error (E)" is simply a symbol used for the purpose of defining the two operators, logical-and and logical-or.

Anyways, are you okay with the definition of Dedup as given after the latest commit in this PR?

We don't have to have an exception mechanism. (SPARQL could - it just doesn't at the moment).

I don't think that an exception mechanism needs to be added.

@afs
Copy link
Contributor

afs commented Oct 13, 2023

Anyways, are you okay with the definition of Dedup as given after the latest commit in this PR?

Yes.

@afs
Copy link
Contributor

afs commented Oct 13, 2023

I know that there is the third value, "error (E)", that is considered in the three-valued logic used for FILTER evaluation, but I was always assuming that "error (E)" is simply a symbol used for the purpose of defining the two operators, logical-and and logical-or.

I suppose it is strictly there is only one kind of error (type, not token).

The functional forms do something with them, but the error situation is everywhere in the evaluation.
function(1/0), function(?unset), strlen(1.0e-2) -- error-ing gets everywhere :-)

@hartig
Copy link
Contributor Author

hartig commented Oct 13, 2023

Thanks @afs ! :-)

@rubensworks you had some concerns related to the option that this PR implements now. Are you okay with the responses to this concern and with the PR as it is now?

Copy link
Member

@rubensworks rubensworks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hartig I agree with the text as-is. As implementers can now choose to create their own custom error to represent deduplicated errors, my previous concern has been resolved.

@hartig
Copy link
Contributor Author

hartig commented Oct 13, 2023

Perfect @rubensworks . Thanks!

Then we are all in agreement and I can merge this PR.

@hartig hartig merged commit cefbd53 into main Oct 13, 2023
1 check passed
@hartig hartig deleted the FixOfDedupFunction branch October 13, 2023 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:substantive Change in the spec affecting its normative content (class 3) –see also spec:bug, spec:new-feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants