Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 17350 - Formatting fractions of a second for POSIXt #83

Open
hturner opened this issue Nov 18, 2024 · 20 comments
Open

Bug 17350 - Formatting fractions of a second for POSIXt #83

hturner opened this issue Nov 18, 2024 · 20 comments
Labels
I/O Issues related to the input/output system LatinR 2024 needs patch Implement the agreed fix and prepare a patch for review R Issue should require knowledge of R only

Comments

@hturner
Copy link
Member

hturner commented Nov 18, 2024

The original report for Bug 17350 queried whether fractions of a second for a POSIXct were displayed correctly when printed. The discussion started to pin down some issues which I explore further here.

We'll use Sebastian Meyer's (@bastistician's) example code:

x <- as.POSIXct("2009-08-03 12:01:59") + 0:3/10

As Sebastian notes, print.POSIXct(x) calls format.POSIXct(x) which is a wrapper around format.POSIXlt(). If we check the seconds after converting x to POSIXct, we see there are lots of digits due to the floating point arithmetic (i.e. adding 0, 0.1, 0.2, 0.3 does not give exactly 59.0, 59.1, 59.2, 59.3).

secs <- as.POSIXlt(x)$sec
print(secs, digits = 15)
# [1] 59.0000000000000 59.0999999046326 59.2000000476837 59.2999999523163

Main Bug: inconsistent formatting of digits for fractional seconds

As documented in ?format.POSIXlt the default format is "%Y-%m-%d %H:%M:%S" when the global option digits.secs is NULL (or 0). In other words, no fractional seconds are displayed.

options(digits.secs = 0)
format(x)
# [1] "2009-08-03 12:01:59" "2009-08-03 12:01:59" "2009-08-03 12:01:59" "2009-08-03 12:01:59"

If we use the "%OS" format, 0:6 digits are used for the fractional seconds, with the number taken from the global option digits.secs, e.g.

options(digits.secs = 3)
format(x, "%Y-%m-%d %H:%M:%OS")
# [1] "2009-08-03 12:01:59.000" "2009-08-03 12:01:59.099" "2009-08-03 12:01:59.200"
# [4] "2009-08-03 12:01:59.299"

If we have set the global option digit.sec > 0 and use the default format, however, we only get 1 decimal place:

options(digits.secs = 3)
format(x)
# [1] "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.2" "2009-08-03 12:01:59.2"

The same happens if the global option is at the default and we set digits = 3

options(digits.secs = 0)
format(x, digits = 3)
# [1] "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.2" "2009-08-03 12:01:59.2"

This is due to a block of code in format.POSIXlt() (L385-400) that is only run when the default format is used (format == "") and adjusts the number of digits to avoid trailing zeroes.

The task here is to discuss how format.POSIXlt(x) could be modified to be more consistent and to prepare a corresponding patch.

Side issue: digits ignored when "%OS" format used

A side issue here is that when specifying "%OS" format directly, only the global digit.secs option has an effect:

options(digits.secs = 3)
format(x, "%Y-%m-%d %H:%M:%OS", digits = 2)
# [1] "2009-08-03 12:01:59.000" "2009-08-03 12:01:59.099" "2009-08-03 12:01:59.200"
# [4] "2009-08-03 12:01:59.299"
options(digits.secs = 0)
format(x, "%Y-%m-%d %H:%M:%OS", digits = 2)
# [1] "2009-08-03 12:01:59" "2009-08-03 12:01:59" "2009-08-03 12:01:59"
# [4] "2009-08-03 12:01:59"

This is inconsistent with how options in R usually behave, e.g.

options(digits = 2)
print(secs)
# [1] 59 59 59 59
print(secs, digits = 15)
# 59.0000000000000 59.0999999046326 59.2000000476837 59.2999999523163

and is due to the C code looking up the option to get the digits datetime.c#L986, rather than taking an argument.

So, an extension to this task could be adjusting the R or C code to respect the digits argument.

Other

Potentially the help file for format.POSIXlt() needs to be updated to document the behaviour of the function, e.g. the help for the format and digits arguments could be reviewed.

@hturner hturner added needs patch Implement the agreed fix and prepare a patch for review I/O Issues related to the input/output system LatinR 2024 R Issue should require knowledge of R only labels Nov 18, 2024
@eddelbuettel
Copy link

eddelbuettel commented Nov 18, 2024

This is a good and gnarly issue. (And my younger self appears to have been in the bugreport, albeit on a side-topic not fully groking the original issue -- it happens.) The subsequent July 2020 analysis by @bastistician is very good as is @mmaechler's followup. Implementing this in its simplest form, namely trusting options(digits.secs) and not looping to determine np works:

> options(digits.secs = 6)
> x <- as.POSIXct("2009-08-03 12:01:59") + 0:3/10
> format(x)  # we now get full precision as opposed to the truncation
[1] "2009-08-03 12:01:59.000000" "2009-08-03 12:01:59.099999" 
[3] "2009-08-03 12:01:59.200000" "2009-08-03 12:01:59.299999"
> format(x, format = "%Y-%m-%d %H:%M:%OS")
[1] "2009-08-03 12:01:59.000000" "2009-08-03 12:01:59.099999" 
[3] "2009-08-03 12:01:59.200000" "2009-08-03 12:01:59.299999"
> 

Selecting, say, three digits either as the digits argument or via the format string also works and still demonstration fractional error due to the well-known floating-point representation issues:

> format(x, digits = 3)
[1] "2009-08-03 12:01:59.000" "2009-08-03 12:01:59.099"
[3] "2009-08-03 12:01:59.200" "2009-08-03 12:01:59.299"
> format(x, format = "%Y-%m-%d %H:%M:%OS3")
[1] "2009-08-03 12:01:59.000" "2009-08-03 12:01:59.099"
[3] "2009-08-03 12:01:59.200" "2009-08-03 12:01:59.299"

(Part of me is now wondering of the secs component should be rounded to digits precision. But that may be a different question. The next comment looks into this.)

For completeness, code for that would be as follows commenting out secs and the np calculation using it.

format.POSIXlt <- function(x, format = "", usetz = FALSE,
                           digits = getOption("digits.secs"), ...)
{
    if(!inherits(x, "POSIXlt")) stop("wrong class")
    if(any(f0 <- format == "")) {
        ## need list [ method here.
        times <- unlist(unclass(x)[1L:3L])[f0]
        #secs <- x$sec[f0]; secs <- secs[is.finite(secs)]
        np <- if(is.null(digits)) 0L else min(6L, digits)
        #if(np >= 1L) # no unnecessary trailing '0' :
        #    for (i in seq_len(np)- 1L)
        #        if(all( abs(secs - round(secs, i)) < 1e-6 )) {
        #            np <- i
        #            break
        #        }
        format[f0] <-
            if(all(times[is.finite(times)] == 0)) "%Y-%m-%d"
            else if(np == 0L) "%Y-%m-%d %H:%M:%S"
            else paste0("%Y-%m-%d %H:%M:%OS", np)
    }
    .Internal(format.POSIXlt(x, format, usetz))
}

@eddelbuettel
Copy link

eddelbuettel commented Nov 18, 2024

If we added rounding, for example via

if (!is.null(digits)) x$sec <- round(x$sec, digits)

before the other processing that the default output (i.e. no format specified) is a truncation-free single digit supressing the trailing digits as was desired:

[1] "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.1" 
[3] "2009-08-03 12:01:59.2" "2009-08-03 12:01:59.3"

and the full-precision format is also rounded:

[1] "2009-08-03 12:01:59.000000" "2009-08-03 12:01:59.100000"
[3] "2009-08-03 12:01:59.200000" "2009-08-03 12:01:59.300000"

For completeness, code for that would be as follows showing the added line

format.POSIXlt <- function(x, format = "", usetz = FALSE,
                           digits = getOption("digits.secs"), ...)
{
    if(!inherits(x, "POSIXlt")) stop("wrong class")
    if (!is.null(digits)) x$sec <- round(x$sec, digits)
    if(any(f0 <- format == "")) {
        ## need list [ method here.
        times <- unlist(unclass(x)[1L:3L])[f0]
        secs <- x$sec[f0]; secs <- secs[is.finite(secs)]
        np <- if(is.null(digits)) 0L else min(6L, digits)
        if(np >= 1L) # no unnecessary trailing '0' :
            for (i in seq_len(np)- 1L)
                if(all( abs(secs - round(secs, i)) < 1e-6 )) {
                    np <- i
                    break
                }
        format[f0] <-
            if(all(times[is.finite(times)] == 0)) "%Y-%m-%d"
            else if(np == 0L) "%Y-%m-%d %H:%M:%S"
            else paste0("%Y-%m-%d %H:%M:%OS", np)
    }
    .Internal(format.POSIXlt(x, format, usetz))
}

@hturner
Copy link
Member Author

hturner commented Nov 19, 2024

I'm getting different results from you @eddelbuettel. Here with R --vanilla to be safe:

R.Version()$version.string
#> [1] "R version 4.4.2 (2024-10-31)"
R.Version()$platform
#> [1] "aarch64-apple-darwin20"
# chunk 1 from https://github.com/r-devel/r-dev-day/issues/83#issuecomment-2484226381
options(digits.secs = 6)
x <- as.POSIXct("2009-08-03 12:01:59") + 0:3/10
format(x)
#> [1] "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.2"
#> [4] "2009-08-03 12:01:59.2"
format(x, format = "%Y-%m-%d %H:%M:%OS")
#> [1] "2009-08-03 12:01:59.000000" "2009-08-03 12:01:59.099999"
#> [3] "2009-08-03 12:01:59.200000" "2009-08-03 12:01:59.299999"
# chunk 2 from https://github.com/r-devel/r-dev-day/issues/83#issuecomment-2484226381
format(x, digits = 3)
#> [1] "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.2"
#> [4] "2009-08-03 12:01:59.2"
format(x, format = "%Y-%m-%d %H:%M:%OS3")
#> [1] "2009-08-03 12:01:59.000" "2009-08-03 12:01:59.099"
#> [3] "2009-08-03 12:01:59.200" "2009-08-03 12:01:59.299"
# chunk 2 resetting `digits.secs` first
options(digits.secs = 0)
format(x, digits = 3)
#> [1] "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.2"
#> [4] "2009-08-03 12:01:59.2"
format(x, format = "%Y-%m-%d %H:%M:%OS3")
#> [1] "2009-08-03 12:01:59.000" "2009-08-03 12:01:59.099"
#> [3] "2009-08-03 12:01:59.200" "2009-08-03 12:01:59.299"

Created on 2024-11-19 with reprex v2.1.1

Can you check and if still getting differences, try to work out why?

@eddelbuettel
Copy link

eddelbuettel commented Nov 19, 2024

Sure but I applied my draft patch against (by convention) R-devel, not R-release.

R-release

edd@rob:~$ Rscript --vanilla /tmp/r/heather_reprex.R 
[1] "R version 4.4.2 (2024-10-31)"
[1] "x86_64-pc-linux-gnu"
[1] "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.2"
[4] "2009-08-03 12:01:59.2"
[1] "2009-08-03 12:01:59.000000" "2009-08-03 12:01:59.099999"
[3] "2009-08-03 12:01:59.200000" "2009-08-03 12:01:59.299999"
[1] "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.2"
[4] "2009-08-03 12:01:59.2"
[1] "2009-08-03 12:01:59.000" "2009-08-03 12:01:59.099"
[3] "2009-08-03 12:01:59.200" "2009-08-03 12:01:59.299"
[1] "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.2"
[4] "2009-08-03 12:01:59.2"
[1] "2009-08-03 12:01:59.000" "2009-08-03 12:01:59.099"
[3] "2009-08-03 12:01:59.200" "2009-08-03 12:01:59.299"
edd@rob:~$ 

R-devel with draft patch

edd@rob:~$ RDscript --vanilla /tmp/r/heather_reprex.R 
[1] "R Under development (unstable) (2024-11-18 r87347)"
[1] "x86_64-pc-linux-gnu"
[1] "2009-08-03 12:01:59.000000" "2009-08-03 12:01:59.099999"
[3] "2009-08-03 12:01:59.200000" "2009-08-03 12:01:59.299999"
[1] "2009-08-03 12:01:59.000000" "2009-08-03 12:01:59.099999"
[3] "2009-08-03 12:01:59.200000" "2009-08-03 12:01:59.299999"
[1] "2009-08-03 12:01:59.000" "2009-08-03 12:01:59.099"
[3] "2009-08-03 12:01:59.200" "2009-08-03 12:01:59.299"
[1] "2009-08-03 12:01:59.000" "2009-08-03 12:01:59.099"
[3] "2009-08-03 12:01:59.200" "2009-08-03 12:01:59.299"
[1] "2009-08-03 12:01:59.000" "2009-08-03 12:01:59.099"
[3] "2009-08-03 12:01:59.200" "2009-08-03 12:01:59.299"
[1] "2009-08-03 12:01:59.000" "2009-08-03 12:01:59.099"
[3] "2009-08-03 12:01:59.200" "2009-08-03 12:01:59.299"
edd@rob:~$ 

@hturner
Copy link
Member Author

hturner commented Nov 19, 2024

Ah okay, I missed that those results were with your patch.

@eddelbuettel
Copy link

eddelbuettel commented Nov 19, 2024

No worries. That is actually a nice summary of the change because a) R-release clearly shows how bad the truncation is and b) that the patch helps some and how it have me the idea to also round - but that may be a step too far.

I ran the patch through CI at r-svn as well getting a cherished ✔️ there.

@hturner
Copy link
Member Author

hturner commented Nov 19, 2024

I think your first patch is closer to what we want here.

We want to avoid rounding, because as noted by @mmaechler in comment 7 of the bug report, the C code always truncates rather than rounds to avoid bug 14579, i.e. to avoid invalid times like "19:59:60.000".

So I think the bit left to fix is the issue of trailing digits. The comment in the code says the intention is to avoid trailing zeros, similar to how the trailing zeros are not shown in the first case here:

> print(1.00000, digits = 6)
[1] 1
> print(1.00001, digits = 6)
[1] 1.00001

However, in our example, the code is not working correctly, because it is dropping non-zero digits. I haven't worked it through but I guess this is due to the R code assuming rounding and the C code actually truncating.

@eddelbuettel
Copy link

eddelbuettel commented Nov 19, 2024

I think your first patch is closer to what we want here.

Yes of course. That is the patch I tested in r-svn.

However, in our example, the code is not working correctly,

I know you do not have easy access to the patched build but can you help me follow along? Which input produces which wrong output and which expected output where we looking for?

The C code in datetime is a little beyond what I have appetite for. So all I may have accomplished here is to illustrate that the suggestion by @mmaechler to maybe just stick with the chosen digits is an improvement.

Returning one digit when digits.sevs is six is indeed bad. It can always be overrules by explicit giving at argument, %OS6 will return six digits so it is not fatal but this is still an irritant of a corner case.

@hturner
Copy link
Member Author

hturner commented Nov 19, 2024

I don't think we need to touch the C code. I'll try to come up with some examples in time for the Dev Day session.

@eddelbuettel
Copy link

Sounds good. I started hard and long at the np calculation you highlight and I only saw one approach of not doing all this ... which is suboptimal even if it moves the needly in the right direction for some inputs. Not addressing the underlying truncation in the C ties our hands a little too.

@hturner
Copy link
Member Author

hturner commented Nov 19, 2024

Okay, the trailing zeros behaviour isn't quite what I expected in R-release.

Example 1

y <- as.POSIXlt(c("2009-08-03 12:01:59.0001", "2009-08-03 12:01:59.1001",
                  "2009-08-03 12:01:59.2001", "2009-08-03 12:01:59.3001"))
print(as.POSIXlt(y)$sec, digits = 6)
#> [1] 59.0001 59.1001 59.2001 59.3001

format(y, format = "%Y-%m-%d %H:%M:%OS3")
#> [1] "2009-08-03 12:01:59.000" "2009-08-03 12:01:59.100"
#> [3] "2009-08-03 12:01:59.200" "2009-08-03 12:01:59.300"
options(digits.secs = 3)
format(y)
#> [1] "2009-08-03 12:01:59.000" "2009-08-03 12:01:59.100"
#> [3] "2009-08-03 12:01:59.200" "2009-08-03 12:01:59.300"

I was expecting at least the second case there to drop the trailing zeros, because

print(1.1001, digits = 4)
#> [1] 1.1

and with format = "", we go through loop to adjust np.

Example 2

z <- as.POSIXlt(c("2009-08-03 12:01:59.000", "2009-08-03 12:01:59.100",
                  "2009-08-03 12:01:59.200", "2009-08-03 12:01:59.300"))
print(as.POSIXlt(z)$sec, digits = 6)
#> [1] 59.0 59.1 59.2 59.3

format(z, format = "%Y-%m-%d %H:%M:%OS3")
#> [1] "2009-08-03 12:01:59.000" "2009-08-03 12:01:59.100"
#> [3] "2009-08-03 12:01:59.200" "2009-08-03 12:01:59.300"
options(digits.secs = 3)
format(z)
#> [1] "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.1" "2009-08-03 12:01:59.2"
#> [4] "2009-08-03 12:01:59.3"

Here we have an inconsistency in R-release. I guess that your patch will make these both like the first case. If that's true, we have two options:

  • Update your patch to modify np to drop trailing zeros (e.g. by working out the desired np and then updating the format to "%OS<np>" before passing to the C code)
  • Propose your patch as the way forward, since the trailing zero behaviour isn't fully documented anyway - the only hint I see in ?format.POSIXct is (my emphasis):

    If options("digits.secs") is set, up to the specified number of digits will be printed for seconds.

@hturner
Copy link
Member Author

hturner commented Nov 19, 2024

Example 0

This is what I meant by my earlier comment

However, in our example, the code is not working correctly, because it is dropping non-zero digits.

options(digits.secs = 7)
x <- as.POSIXct("2009-08-03 12:01:59") + 0:3/10
print(as.POSIXlt(x)$sec, digits = 9)
#> [1] 59.0000000 59.0999999 59.2000000 59.3000000
format(x)
#> [1] "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.0" "2009-08-03 12:01:59.2"
#> [4] "2009-08-03 12:01:59.2"

I think the format command should have the same digits as the print in this case. This is for R-release, so I mean the current code to adjust np is not working correctly if the aim is only to drop trailing zeros.

@eddelbuettel
Copy link

I was expecting at least the second case there to drop the trailing zeros, because

No this is as coded. You provided a format string which makes f0 <- format == "") false so the entire code block does not apply!. The comparison to print may what is desired but what never realized.

@hturner
Copy link
Member Author

hturner commented Nov 19, 2024

I do not provide a format string in the second call to format in Example 1.

@eddelbuettel
Copy link

I am getting lost here and I fear we are talking past each other.

This is for R-release, so I mean the current code to adjust np is not working correctly if the aim is only to drop trailing zeros.

Yes. That is what I have been saying a few times by now.

@eddelbuettel
Copy link

eddelbuettel commented Nov 19, 2024

I do not provide a format string in the second call to format in Example 1.

You do here:

image

That zeros are being dropped in the subsequent call is AFAICT the bug documted in 2017 and discussed in 2020.

@hturner
Copy link
Member Author

hturner commented Nov 19, 2024

Let's talk it through in the Dev Day session and see if we can get on the same page!

@eddelbuettel
Copy link

While CI is running the diff (with the whitespace changes as the indentation seems wrong to me) is:

Index: src/library/base/R/datetime.R
===================================================================
--- src/library/base/R/datetime.R       (revision 87347)
+++ src/library/base/R/datetime.R       (working copy)
@@ -382,21 +382,14 @@
                            digits = getOption("digits.secs"), ...)
 {
     if(!inherits(x, "POSIXlt")) stop("wrong class")
-    if(any(f0 <- format == "")) {
+    if(any(f0 <- format == "" | grepl("%OS$", format))) {
         ## need list [ method here.
-    times <- unlist(unclass(x)[1L:3L])[f0]
-    secs <- x$sec[f0]; secs <- secs[is.finite(secs)]
+        times <- unlist(unclass(x)[1L:3L])[f0]
         np <- if(is.null(digits)) 0L else min(6L, digits)
-        if(np >= 1L) # no unnecessary trailing '0' :
-            for (i in seq_len(np)- 1L)
-                if(all( abs(secs - round(secs, i)) < 1e-6 )) {
-                    np <- i
-                    break
-                }
-    format[f0] <-
-        if(all(times[is.finite(times)] == 0)) "%Y-%m-%d"
-        else if(np == 0L) "%Y-%m-%d %H:%M:%S"
-        else paste0("%Y-%m-%d %H:%M:%OS", np)
+        format[f0] <-
+            if(all(times[is.finite(times)] == 0)) "%Y-%m-%d"
+            else if(np == 0L) "%Y-%m-%d %H:%M:%S"
+            else paste0("%Y-%m-%d %H:%M:%OS", np)
     }
     .Internal(format.POSIXlt(x, format, usetz))
 }
Index: src/library/base/man/strptime.Rd
===================================================================
--- src/library/base/man/strptime.Rd    (revision 87347)
+++ src/library/base/man/strptime.Rd    (working copy)
@@ -39,7 +39,7 @@
     methods is
     \code{"\%Y-\%m-\%d \%H:\%M:\%S"} if any element has a time
     component which is not midnight, and \code{"\%Y-\%m-\%d"}
-    otherwise.  If \code{\link{options}("digits.secs")} is set, up to
+    otherwise.  If \code{\link{options}("digits.secs")} is set, 
     the specified number of digits will be printed for seconds.}
   \item{\dots}{further arguments to be passed from or to other methods.}
   \item{usetz}{logical.  Should the time zone abbreviation be appended

@eddelbuettel
Copy link

eddelbuettel commented Nov 19, 2024

GH pull request is r-devel/r-svn#187, second commit at r-devel/r-svn@fa7b99b contains today's refinement, above is the diff we will likely submitt to bugzilla. CI is complete for 6 of 8 right now.

And patch submitted to bugzilla too.

@hturner
Copy link
Member Author

hturner commented Nov 20, 2024

Patch updated following Martin Maechler's feedback, tested on the PR and now submitted to Bugzilla.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I/O Issues related to the input/output system LatinR 2024 needs patch Implement the agreed fix and prepare a patch for review R Issue should require knowledge of R only
Projects
None yet
Development

No branches or pull requests

2 participants