Skip to content

Commit

Permalink
Backfills & deprecating legacy tables (#10)
Browse files Browse the repository at this point in the history
* deprecated

* backfill draft

* cleanup

* null placeholders

* sql fix

* fix month range

* literal table names

* backfill tested

* dates reset

* requests_summary

* requests backfill for mid month

* remove legacy pipelines

* checked against new schema

* adjusted to a new schema

* backfill_pages

* legacy removed

* remove legacy datasets

* metrics sorted

* parse features

* lint

* jscpd off

* update js variable names

* other cm format

* pages completed

* summary_pages completed

* without other headers

* fix

* fix

* fix

* actual reprocessing queries

* fix

* requests complete

* fix casts

* wptid from summary

* Update definitions/output/all/backfill_requests.js

* summary update

* only valid other headers

* move tables

* fix json parsing

* fix summary metrics

* crawl pipeline updated

* update dependents

* response_bodies adjustment

* lint
  • Loading branch information
max-ostapenko authored Nov 1, 2024
1 parent 5f52ce3 commit 38d9f01
Show file tree
Hide file tree
Showing 26 changed files with 2,130 additions and 524 deletions.
1 change: 1 addition & 0 deletions .github/workflows/linter.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,6 @@ jobs:
env:
DEFAULT_BRANCH: main
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
VALIDATE_JSCPD: false
VALIDATE_JAVASCRIPT_PRETTIER: false
VALIDATE_MARKDOWN_PRETTIER: false
13 changes: 8 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are

### Crawl results

Tag: `crawl_results_all`
Tag: `crawl_complete`

- httparchive.all.pages
- httparchive.all.parsed_css
- httparchive.all.requests
- httparchive.crawl.pages
- httparchive.crawl.parsed_css
- httparchive.crawl.requests

### Core Web Vitals Technology Report

Expand All @@ -39,6 +39,9 @@ Consumers:

Tag: `crawl_results_legacy`

- httparchive.all.pages
- httparchive.all.parsed_css
- httparchive.all.requests
- httparchive.lighthouse.YYYY_MM_DD_client
- httparchive.pages.YYYY_MM_DD_client
- httparchive.requests.YYYY_MM_DD_client
Expand All @@ -51,7 +54,7 @@ Tag: `crawl_results_legacy`

1. [crawl-complete](https://console.cloud.google.com/cloudpubsub/subscription/detail/dataformTrigger?authuser=7&project=httparchive) PubSub subscription

Tags: ["crawl_results_all", "blink_features_report", "crawl_results_legacy"]
Tags: ["crawl_complete", "blink_features_report", "crawl_results_legacy"]

2. [bq-poller-cwv-tech-report](https://console.cloud.google.com/cloudscheduler/jobs/edit/us-east4/bq-poller-cwv-tech-report?authuser=7&project=httparchive) Scheduler

Expand Down
3 changes: 2 additions & 1 deletion definitions/extra/test_env.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
const date = constants.currentMonth
operate('test')

// List of resources to be copied to the test environment. Comment out the ones you don't need.
const resourcesList = [
Expand All @@ -15,7 +16,7 @@ const resourcesList = [
resourcesList.forEach(resource => {
operate(
`test_table ${resource.datasetId}_dev_dev_${resource.tableId}`
).queries(`
).dependencies(['test']).queries(`
CREATE SCHEMA IF NOT EXISTS ${resource.datasetId}_dev;
DROP TABLE IF EXISTS ${resource.datasetId}_dev.dev_${resource.tableId};
Expand Down
2 changes: 1 addition & 1 deletion definitions/output/all/pages.js
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ publish('pages', {
clusterBy: ['client', 'is_root_page', 'rank'],
requirePartitionFilter: true
},
tags: ['crawl_results_all']
tags: ['crawl_results_legacy']
}).preOps(ctx => `
DELETE FROM ${ctx.self()}
WHERE date = '${constants.currentMonth}';
Expand Down
21 changes: 5 additions & 16 deletions definitions/output/all/parsed_css.js
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,10 @@ publish('parsed_css', {
clusterBy: ['client', 'is_root_page', 'rank', 'page'],
requirePartitionFilter: true
},
tags: ['crawl_results_all']
tags: ['crawl_results_legacy']
}).preOps(ctx => `
DELETE FROM ${ctx.self()}
WHERE date = '${constants.currentMonth}';
`).query(ctx => `
SELECT *
FROM ${ctx.ref('crawl_staging', 'parsed_css')}
WHERE date = '${constants.currentMonth}'
AND client = 'desktop'
${constants.devRankFilter}
`).postOps(ctx => `
INSERT INTO ${ctx.self()}
SELECT *
FROM ${ctx.ref('crawl_staging', 'parsed_css')}
WHERE date = '${constants.currentMonth}'
AND client = 'mobile'
${constants.devRankFilter};
DROP SNAPSHOT TABLE IF EXISTS ${ctx.self()};
CREATE SNAPSHOT TABLE ${ctx.self()}
CLONE ${ctx.ref('crawl', 'parsed_css')};
`)
274 changes: 0 additions & 274 deletions definitions/output/all/reprocess_pages.js

This file was deleted.

Loading

0 comments on commit 38d9f01

Please sign in to comment.