Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stable all.requests #5

Merged
merged 97 commits into from
Oct 7, 2024
Merged

Stable all.requests #5

merged 97 commits into from
Oct 7, 2024

Conversation

max-ostapenko
Copy link
Contributor

@max-ostapenko max-ostapenko commented Sep 2, 2024

Schema changes

Intermediate table

all.requests_stable

Run reprocessing jobs with all_requests_stable tag.

After reprocessing

DROP TABLE `all.requests`;

CREATE TABLE `all.requests`
COPY `all_dev.requests_stable`;

@max-ostapenko
Copy link
Contributor Author

Test run

Original and processed table

@max-ostapenko
Copy link
Contributor Author

Would really appreciate review here, as I'm almost ready with backfill PR #10.

@tunetheweb
Copy link
Member

Would really appreciate review here, as I'm almost ready with backfill PR #10.

Hey @max-ostapenko sorry for radio silence here. I mistakenly presumed this was discussed on the last HTTP Archive maintainers call (that I missed) but just found out that @rviscomi missed this too so think we both presumed the other was dealing with this.

I think it would be good to discuss this on a call (whether the next planned one on 15th October or before). Questions/concerns I have include:

  • Is this definitely all the changes we plan?
  • What happens while this is running? Is the table not accessible? Or will it continue to be accessible from the run as each data is back-populated?
  • How long it will take to run?
  • Any impact on the next crawl (assuming it's not complete by then)?
  • Any impact on Web Almanac and if we're better to wait until majority of analysis is completed there?

@max-ostapenko
Copy link
Contributor Author

I think it would be good to discuss this on a call (whether the next planned one on 15th October or before). Questions/concerns I have include:

I would definitely like to discuss it before the next crawl. And notify users of the upcoming changes beforehand.

  • Is this definitely all the changes we plan?

Please drop here any issues or comments that I've missed.

  • What happens while this is running? Is the table not accessible? Or will it continue to be accessible from the run as each data is back-populated?

The reprocessing will write into all_dev.requests_stable.

Once the reprocessing is done we can:

  • run checks on the new table,
  • test all the production queries changes,
  • notify users with the exact date,
  • update Almanac queries from this and last years,

and eventually replace the live table with CREATE TABLE COPY DDL. So no downtime.

  • How long it will take to run?

We can only estimate by running the first month.

In September 2024 it took 6h for requests tables without any transformations.
Let's assume it would take 2 times longer for this one - 12h.
So ~14 days.

  • Any impact on the next crawl (assuming it's not complete by then)?

Ideally we start this before the new crawl starts (so that we can already have it synced with new schema and cleaned objects to just run INSERT INTO SELECT *).

If anything not yet synced in WPT agent before the crawl, we can always cleanup the last crawl data within the pipeline.

  • Any impact on Web Almanac and if we're better to wait until majority of analysis is completed there?

Queries reading from struct columns need to be adjusted to a new schema to be valid.
All the analysis PRs are still pending, I'll need to go through them, and push the review where the articles were written.
Then we merge adjustments to the published queries at the moment we replace the live table. It will give a smooth experience for readers.

@tunetheweb
Copy link
Member

We could pull _font_details stuff into it's own column? However I think it's niche enough (and the payload will be small enough with above changes) that it's OK to leave in the payload column. WDYT @rviscomi ?

    "_font_details": {
        "table_sizes": {
            "GDEF": 100,
            "GPOS": 3264,
            "GSUB": 458,
            "OS/2": 96,
...
        "counts": {
            "num_cmap_codepoints": 215,
            "num_glyphs": 238
        }
    },

@tunetheweb
Copy link
Member

We need to look at native JSON columns as there were some that couldn't be processed in that and so had to use JavaScript JSON columns. See: HTTPArchive/httparchive.org#923 (comment). Is this going to be a problem if we move to native JSON columns?

@tunetheweb
Copy link
Member

Ignore. SAFE.PARSE_JSON(payload, wide_number_mode => 'round') is the answer as you already implemented.

Copy link
Member

@tunetheweb tunetheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one small request to leave the crawlids and pagesid in summary.

definitions/output/all/reprocess_requests.js Outdated Show resolved Hide resolved
@rviscomi
Copy link
Member

rviscomi commented Oct 7, 2024

We could pull _font_details stuff into it's own column? However I think it's niche enough (and the payload will be small enough with above changes) that it's OK to leave in the payload column. WDYT @rviscomi ?

Agreed. It could work, but I'm happy to leave it as is.

@max-ostapenko max-ostapenko merged commit 94718f9 into main Oct 7, 2024
3 checks passed
@max-ostapenko max-ostapenko deleted the dev1 branch October 7, 2024 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants