Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfills & deprecating legacy tables #10

Merged
merged 59 commits into from
Nov 1, 2024
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
99fc4b6
deprecated
max-ostapenko Sep 19, 2024
0dbb96c
backfill draft
max-ostapenko Sep 19, 2024
1c30ac3
cleanup
max-ostapenko Sep 19, 2024
78e4a23
null placeholders
max-ostapenko Sep 19, 2024
6738785
sql fix
max-ostapenko Sep 19, 2024
4d69fc6
fix month range
max-ostapenko Sep 19, 2024
52a8eec
literal table names
max-ostapenko Sep 19, 2024
a7b7a53
backfill tested
max-ostapenko Sep 27, 2024
6514d89
Merge branch 'main' into main
max-ostapenko Sep 27, 2024
96fee15
dates reset
max-ostapenko Sep 27, 2024
a82927c
requests_summary
max-ostapenko Sep 27, 2024
c504878
requests backfill for mid month
max-ostapenko Sep 29, 2024
9dc4cf0
remove legacy pipelines
max-ostapenko Sep 29, 2024
c316e25
checked against new schema
max-ostapenko Sep 29, 2024
e3cf47b
adjusted to a new schema
max-ostapenko Sep 29, 2024
8832ffe
backfill_pages
max-ostapenko Sep 29, 2024
06d6cb4
legacy removed
max-ostapenko Sep 30, 2024
9ba236d
remove legacy datasets
max-ostapenko Sep 30, 2024
a57df2d
Merge branch 'main' into main
max-ostapenko Sep 30, 2024
57da6fb
metrics sorted
max-ostapenko Sep 30, 2024
3683a89
parse features
max-ostapenko Sep 30, 2024
c55adb6
Merge branch 'main' into main
max-ostapenko Sep 30, 2024
6866120
Merge branch 'main' into fiscal-owl
max-ostapenko Oct 14, 2024
2870012
lint
max-ostapenko Oct 14, 2024
992802f
jscpd off
max-ostapenko Oct 14, 2024
23c29b9
update js variable names
max-ostapenko Oct 14, 2024
14b9585
other cm format
max-ostapenko Oct 14, 2024
b176ee3
Merge branch 'fiscal-owl' into fiscal-owl
max-ostapenko Oct 14, 2024
4138af1
Merge branch 'main' into fiscal-owl
max-ostapenko Oct 18, 2024
4cafc6f
pages completed
max-ostapenko Oct 19, 2024
d1dfd49
summary_pages completed
max-ostapenko Oct 19, 2024
23a522d
Merge branch 'main' into main
max-ostapenko Oct 19, 2024
3940d6a
Merge branch 'fiscal-owl' into fiscal-owl
max-ostapenko Oct 19, 2024
1244e95
without other headers
max-ostapenko Oct 20, 2024
4179197
Merge branch 'fiscal-owl' of https://github.com/HTTPArchive/dataform …
max-ostapenko Oct 20, 2024
e55d8b4
fix
max-ostapenko Oct 20, 2024
c7afc11
fix
max-ostapenko Oct 20, 2024
e03a353
fix
max-ostapenko Oct 20, 2024
4a6101a
actual reprocessing queries
max-ostapenko Oct 20, 2024
4eb39ae
fix
max-ostapenko Oct 20, 2024
8d54b1b
requests complete
max-ostapenko Oct 20, 2024
a38efe0
fix casts
max-ostapenko Oct 20, 2024
86fff73
wptid from summary
max-ostapenko Oct 20, 2024
e030acc
Update definitions/output/all/backfill_requests.js
max-ostapenko Oct 20, 2024
d9ce5fb
Merge branch 'fiscal-owl' into fiscal-owl
max-ostapenko Oct 20, 2024
c8e2343
summary update
max-ostapenko Oct 21, 2024
6032434
only valid other headers
max-ostapenko Oct 21, 2024
bc0a104
Merge branch 'main' into main
max-ostapenko Oct 21, 2024
193027e
Merge branch 'fiscal-owl' into fiscal-owl
max-ostapenko Oct 21, 2024
e61df0a
move tables
max-ostapenko Oct 21, 2024
b2e7b7d
fix json parsing
max-ostapenko Oct 21, 2024
14816f8
fix summary metrics
max-ostapenko Oct 22, 2024
2898c82
crawl pipeline updated
max-ostapenko Oct 22, 2024
d94ca11
update dependents
max-ostapenko Oct 22, 2024
01101db
response_bodies adjustment
max-ostapenko Nov 1, 2024
47ebb36
Merge branch 'main' into main
max-ostapenko Nov 1, 2024
ff5b06e
Merge branch 'fiscal-owl' into fiscal-owl
max-ostapenko Nov 1, 2024
530ecd4
Merge branch 'fiscal-owl' of https://github.com/HTTPArchive/dataform …
max-ostapenko Nov 1, 2024
5ac19db
lint
max-ostapenko Nov 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/linter.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,6 @@ jobs:
env:
DEFAULT_BRANCH: main
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
VALIDATE_JSCPD: false
VALIDATE_JAVASCRIPT_PRETTIER: false
VALIDATE_MARKDOWN_PRETTIER: false
321 changes: 321 additions & 0 deletions definitions/output/all/backfill_pages.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,321 @@
const iterations = []
const clients = constants.clients

let midMonth
for (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rviscomi please list (if you can remember) the backfill table cases that need to be handled separately.

For example I see some non-standard crawl dates discussed in HTTPArchive/data-pipeline#114.

let date = '2016-01-01';
date >= '2016-01-01';
date = constants.fnPastMonth(date)
) {
clients.forEach((client) => {
iterations.push({
date,
client
})
})

if (date <= '2018-12-01') {
midMonth = new Date(date)
midMonth.setDate(15)

clients.forEach((client) => {
iterations.push({
date: midMonth.toISOString().substring(0, 10),
client
})
})
}
}

iterations.forEach((iteration, i) => {
operate(`backfill_pages ${iteration.date} ${iteration.client}`).tags([
'backfill_pages'
]).dependencies([
i === 0 ? '' : `backfill_pages ${iterations[i - 1].date} ${iterations[i - 1].client}`
]).queries(ctx => `
DELETE FROM all_dev.pages_stable
WHERE date = '${iteration.date}'
AND client = '${iteration.client}';

CREATE TEMPORARY FUNCTION GET_OTHER_CUSTOM_METRICS(
jsonObject JSON,
keys ARRAY<STRING>
) RETURNS JSON
LANGUAGE js AS """
try {
let other_metrics = {};
keys.forEach(function(key) {
other_metrics[key.substr(1)] = JSON.parse(jsonObject[key]);
});
return other_metrics;
} catch (e) {
return null;
}
""";

CREATE TEMP FUNCTION GET_FEATURES(payload JSON)
RETURNS ARRAY<STRUCT<feature STRING, id STRING, type STRING>> LANGUAGE js AS
'''
function getFeatureNames(featureMap, featureType) {
try {
return Object.entries(featureMap).map(([key, value]) => {
// After Feb 2020 keys are feature IDs.
if (value.name) {
return {'feature': value.name, 'type': featureType, 'id': key};
}
// Prior to Feb 2020 keys fell back to IDs if the name was unknown.
if (idPattern.test(key)) {
return {'feature': '', 'type': featureType, 'id': key.match(idPattern)[1]};
}
// Prior to Feb 2020 keys were names by default.
return {'feature': key, 'type': featureType, 'id': ''};
});
} catch (e) {
return [];
}
}

let blinkFeatureFirstUsed = payload._blinkFeatureFirstUsed;
if (!blinkFeatureFirstUsed) return [];

var idPattern = new RegExp('^Feature_(\\\\d+)$');
return getFeatureNames(blinkFeatureFirstUsed.Features, 'default')
.concat(getFeatureNames(blinkFeatureFirstUsed.CSSFeatures, 'css'))
.concat(getFeatureNames(blinkFeatureFirstUsed.AnimatedCSSFeatures, 'animated-css'));
''';

INSERT INTO all_dev.pages_stable
SELECT
DATE('${iteration.date}') AS date,
'${iteration.client}' AS client,
pages.url AS page,
TRUE AS is_root_page,
pages.url AS root_page,
crux.rank AS rank,
STRING(payload.testID) AS wptid,
JSON_REMOVE(
payload,
'$._metadata',
'$._detected',
'$._detected_apps',
'$._detected_technologies',
'$._detected_raw',
'$._custom',
'$._00_reset',
'$._a11y',
'$._ads',
'$._almanac',
'$._aurora',
'$._avg_dom_depth',
'$._cms',
'$._Colordepth',
'$._cookies',
'$._crawl_links',
'$._css-variables',
'$._css',
'$._doctype',
'$._document_height',
'$._document_width',
'$._Dpi',
'$._ecommerce',
'$._element_count',
'$._event-names',
'$._fugu-apis',
'$._generated-content',
'$._has_shadow_root',
'$._Images',
'$._img-loading-attr',
'$._initiators',
'$._inline_style_bytes',
'$._javascript',
'$._lib-detector-version',
'$._localstorage_size',
'$._markup',
'$._media',
'$._meta_viewport',
'$._num_iframes',
'$._num_scripts_async',
'$._num_scripts_sync',
'$._num_scripts',
'$._observers',
'$._origin-trials',
'$._parsed_css',
'$._performance',
'$._privacy-sandbox',
'$._privacy',
'$._pwa',
'$._quirks_mode',
'$._Resolution',
'$._responsive_images',
'$._robots_meta',
'$._robots_txt',
'$._sass',
'$._security',
'$._sessionstorage_size',
'$._structured-data',
'$._third-parties',
'$._usertiming',
'$._valid-head',
'$._well-known',
'$._wpt_bodies',
'$._blinkFeatureFirstUsed',
'$._CrUX'
) AS payload,
TO_JSON( STRUCT(
SpeedIndex,
TTFB,
_connections,
bytesAudio,
bytesCSS,
bytesFlash,
bytesFont,
bytesGif,
bytesHtml,
bytesHtmlDoc,
bytesImg,
bytesJpg,
bytesJS,
bytesJson,
bytesOther,
bytesPng,
bytesSvg,
bytesText,
bytesTotal,
bytesVideo,
bytesWebp,
bytesXml,
cdn,
payload._CrUX,
fullyLoaded,
gzipSavings,
gzipTotal,
maxDomainReqs,
maxage0,
maxage1,
maxage30,
maxage365,
maxageMore,
maxageNull,
numCompressed,
numDomElements,
numDomains,
numErrors,
numGlibs,
numHttps,
numRedirects,
onContentLoaded,
onLoad,
renderStart,
reqAudio,
reqCSS,
reqFlash,
reqFont,
reqGif,
reqHtml,
reqImg,
reqJpg,
reqJS,
reqJson,
reqOther,
reqPng,
reqSvg,
reqText,
reqTotal,
reqVideo,
reqWebp,
reqXml,
visualComplete
)) AS summary,
STRUCT<
a11y JSON,
cms JSON,
cookies JSON,
css_variables JSON,
ecommerce JSON,
element_count JSON,
javascript JSON,
markup JSON,
media JSON,
origin_trials JSON,
performance JSON,
privacy JSON,
responsive_images JSON,
robots_txt JSON,
security JSON,
structured_data JSON,
third_parties JSON,
well_known JSON,
wpt_bodies JSON,
other JSON
>(
payload._a11y,
payload._cms,
payload._cookies,
payload["_css-variables"],
payload._ecommerce,
payload._element_count,
payload._javascript,
payload._markup,
payload._media,
payload["_origin-trials"],
payload._performance,
payload._privacy,
payload._responsive_images,
payload._robots_txt,
payload._security,
payload["_structured-data"],
payload["_third-parties"],
payload["_well-known"],
payload._wpt_bodies,
GET_OTHER_CUSTOM_METRICS(
payload,
["_Colordepth", "_Dpi", "_Images", "_Resolution", "_almanac", "_avg_dom_depth", "_css", "_doctype", "_document_height", "_document_width", "_event-names", "_fugu-apis", "_has_shadow_root", "_img-loading-attr", "_initiators", "_inline_style_bytes", "_lib-detector-version", "_localstorage_size", "_meta_viewport", "_num_iframes", "_num_scripts", "_num_scripts_async", "_num_scripts_sync", "_pwa", "_quirks_mode", "_sass", "_sessionstorage_size", "_usertiming"]
)
) AS custom_metrics,
NULL AS lighthouse,
GET_FEATURES(pages.payload) AS features,
tech.technologies AS technologies,
pages.payload._metadata AS metadata
FROM (
SELECT
* EXCEPT(payload),
SAFE.PARSE_JSON(payload, wide_number_mode => 'round') AS payload
FROM pages.${constants.fnDateUnderscored(iteration.date)}_${iteration.client} ${constants.devTABLESAMPLE}
) AS pages

LEFT JOIN summary_pages.${constants.fnDateUnderscored(iteration.date)}_${iteration.client} AS summary_pages ${constants.devTABLESAMPLE}
ON pages.url = summary_pages.url

LEFT JOIN (
SELECT DISTINCT
CONCAT(origin, '/') AS page,
experimental.popularity.rank AS rank
FROM ${ctx.resolve('chrome-ux-report', 'experimental', 'global')}
WHERE yyyymm = ${constants.fnPastMonth(iteration.date).substring(0, 7).replace('-', '')}
) AS crux
ON pages.url = crux.page

LEFT JOIN (
SELECT
page,
ARRAY_AGG(technology) AS technologies
FROM(
SELECT
url AS page,
STRUCT<
technology STRING,
categories ARRAY<STRING>,
info ARRAY<STRING>
>(
app,
ARRAY_AGG(category),
ARRAY_AGG(info)
) AS technology
FROM technologies.${constants.fnDateUnderscored(iteration.date)}_${iteration.client} ${constants.devTABLESAMPLE}
GROUP BY page, app
)
GROUP BY page
) AS tech
ON pages.url = tech.page;
`)
})
Loading