-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backfills & deprecating legacy tables #10
Merged
Merged
Changes from 41 commits
Commits
Show all changes
59 commits
Select commit
Hold shift + click to select a range
99fc4b6
deprecated
max-ostapenko 0dbb96c
backfill draft
max-ostapenko 1c30ac3
cleanup
max-ostapenko 78e4a23
null placeholders
max-ostapenko 6738785
sql fix
max-ostapenko 4d69fc6
fix month range
max-ostapenko 52a8eec
literal table names
max-ostapenko a7b7a53
backfill tested
max-ostapenko 6514d89
Merge branch 'main' into main
max-ostapenko 96fee15
dates reset
max-ostapenko a82927c
requests_summary
max-ostapenko c504878
requests backfill for mid month
max-ostapenko 9dc4cf0
remove legacy pipelines
max-ostapenko c316e25
checked against new schema
max-ostapenko e3cf47b
adjusted to a new schema
max-ostapenko 8832ffe
backfill_pages
max-ostapenko 06d6cb4
legacy removed
max-ostapenko 9ba236d
remove legacy datasets
max-ostapenko a57df2d
Merge branch 'main' into main
max-ostapenko 57da6fb
metrics sorted
max-ostapenko 3683a89
parse features
max-ostapenko c55adb6
Merge branch 'main' into main
max-ostapenko 6866120
Merge branch 'main' into fiscal-owl
max-ostapenko 2870012
lint
max-ostapenko 992802f
jscpd off
max-ostapenko 23c29b9
update js variable names
max-ostapenko 14b9585
other cm format
max-ostapenko b176ee3
Merge branch 'fiscal-owl' into fiscal-owl
max-ostapenko 4138af1
Merge branch 'main' into fiscal-owl
max-ostapenko 4cafc6f
pages completed
max-ostapenko d1dfd49
summary_pages completed
max-ostapenko 23a522d
Merge branch 'main' into main
max-ostapenko 3940d6a
Merge branch 'fiscal-owl' into fiscal-owl
max-ostapenko 1244e95
without other headers
max-ostapenko 4179197
Merge branch 'fiscal-owl' of https://github.com/HTTPArchive/dataform …
max-ostapenko e55d8b4
fix
max-ostapenko c7afc11
fix
max-ostapenko e03a353
fix
max-ostapenko 4a6101a
actual reprocessing queries
max-ostapenko 4eb39ae
fix
max-ostapenko 8d54b1b
requests complete
max-ostapenko a38efe0
fix casts
max-ostapenko 86fff73
wptid from summary
max-ostapenko e030acc
Update definitions/output/all/backfill_requests.js
max-ostapenko d9ce5fb
Merge branch 'fiscal-owl' into fiscal-owl
max-ostapenko c8e2343
summary update
max-ostapenko 6032434
only valid other headers
max-ostapenko bc0a104
Merge branch 'main' into main
max-ostapenko 193027e
Merge branch 'fiscal-owl' into fiscal-owl
max-ostapenko e61df0a
move tables
max-ostapenko b2e7b7d
fix json parsing
max-ostapenko 14816f8
fix summary metrics
max-ostapenko 2898c82
crawl pipeline updated
max-ostapenko d94ca11
update dependents
max-ostapenko 01101db
response_bodies adjustment
max-ostapenko 47ebb36
Merge branch 'main' into main
max-ostapenko ff5b06e
Merge branch 'fiscal-owl' into fiscal-owl
max-ostapenko 530ecd4
Merge branch 'fiscal-owl' of https://github.com/HTTPArchive/dataform …
max-ostapenko 5ac19db
lint
max-ostapenko File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,321 @@ | ||
const iterations = [] | ||
const clients = constants.clients | ||
|
||
let midMonth | ||
for ( | ||
let date = '2016-01-01'; | ||
date >= '2016-01-01'; | ||
date = constants.fnPastMonth(date) | ||
) { | ||
clients.forEach((client) => { | ||
iterations.push({ | ||
date, | ||
client | ||
}) | ||
}) | ||
|
||
if (date <= '2018-12-01') { | ||
midMonth = new Date(date) | ||
midMonth.setDate(15) | ||
|
||
clients.forEach((client) => { | ||
iterations.push({ | ||
date: midMonth.toISOString().substring(0, 10), | ||
client | ||
}) | ||
}) | ||
} | ||
} | ||
|
||
iterations.forEach((iteration, i) => { | ||
operate(`backfill_pages ${iteration.date} ${iteration.client}`).tags([ | ||
'backfill_pages' | ||
]).dependencies([ | ||
i === 0 ? '' : `backfill_pages ${iterations[i - 1].date} ${iterations[i - 1].client}` | ||
]).queries(ctx => ` | ||
DELETE FROM all_dev.pages_stable | ||
WHERE date = '${iteration.date}' | ||
AND client = '${iteration.client}'; | ||
|
||
CREATE TEMPORARY FUNCTION GET_OTHER_CUSTOM_METRICS( | ||
jsonObject JSON, | ||
keys ARRAY<STRING> | ||
) RETURNS JSON | ||
LANGUAGE js AS """ | ||
try { | ||
let other_metrics = {}; | ||
keys.forEach(function(key) { | ||
other_metrics[key.substr(1)] = JSON.parse(jsonObject[key]); | ||
}); | ||
return other_metrics; | ||
} catch (e) { | ||
return null; | ||
} | ||
"""; | ||
|
||
CREATE TEMP FUNCTION GET_FEATURES(payload JSON) | ||
RETURNS ARRAY<STRUCT<feature STRING, id STRING, type STRING>> LANGUAGE js AS | ||
''' | ||
function getFeatureNames(featureMap, featureType) { | ||
try { | ||
return Object.entries(featureMap).map(([key, value]) => { | ||
// After Feb 2020 keys are feature IDs. | ||
if (value.name) { | ||
return {'feature': value.name, 'type': featureType, 'id': key}; | ||
} | ||
// Prior to Feb 2020 keys fell back to IDs if the name was unknown. | ||
if (idPattern.test(key)) { | ||
return {'feature': '', 'type': featureType, 'id': key.match(idPattern)[1]}; | ||
} | ||
// Prior to Feb 2020 keys were names by default. | ||
return {'feature': key, 'type': featureType, 'id': ''}; | ||
}); | ||
} catch (e) { | ||
return []; | ||
} | ||
} | ||
|
||
let blinkFeatureFirstUsed = payload._blinkFeatureFirstUsed; | ||
if (!blinkFeatureFirstUsed) return []; | ||
|
||
var idPattern = new RegExp('^Feature_(\\\\d+)$'); | ||
return getFeatureNames(blinkFeatureFirstUsed.Features, 'default') | ||
.concat(getFeatureNames(blinkFeatureFirstUsed.CSSFeatures, 'css')) | ||
.concat(getFeatureNames(blinkFeatureFirstUsed.AnimatedCSSFeatures, 'animated-css')); | ||
'''; | ||
|
||
INSERT INTO all_dev.pages_stable | ||
SELECT | ||
DATE('${iteration.date}') AS date, | ||
'${iteration.client}' AS client, | ||
pages.url AS page, | ||
TRUE AS is_root_page, | ||
pages.url AS root_page, | ||
crux.rank AS rank, | ||
STRING(payload.testID) AS wptid, | ||
JSON_REMOVE( | ||
payload, | ||
'$._metadata', | ||
'$._detected', | ||
'$._detected_apps', | ||
'$._detected_technologies', | ||
'$._detected_raw', | ||
'$._custom', | ||
'$._00_reset', | ||
'$._a11y', | ||
'$._ads', | ||
'$._almanac', | ||
'$._aurora', | ||
'$._avg_dom_depth', | ||
'$._cms', | ||
'$._Colordepth', | ||
'$._cookies', | ||
'$._crawl_links', | ||
'$._css-variables', | ||
'$._css', | ||
'$._doctype', | ||
'$._document_height', | ||
'$._document_width', | ||
'$._Dpi', | ||
'$._ecommerce', | ||
'$._element_count', | ||
'$._event-names', | ||
'$._fugu-apis', | ||
'$._generated-content', | ||
'$._has_shadow_root', | ||
'$._Images', | ||
'$._img-loading-attr', | ||
'$._initiators', | ||
'$._inline_style_bytes', | ||
'$._javascript', | ||
'$._lib-detector-version', | ||
'$._localstorage_size', | ||
'$._markup', | ||
'$._media', | ||
'$._meta_viewport', | ||
'$._num_iframes', | ||
'$._num_scripts_async', | ||
'$._num_scripts_sync', | ||
'$._num_scripts', | ||
'$._observers', | ||
'$._origin-trials', | ||
'$._parsed_css', | ||
'$._performance', | ||
'$._privacy-sandbox', | ||
'$._privacy', | ||
'$._pwa', | ||
'$._quirks_mode', | ||
'$._Resolution', | ||
'$._responsive_images', | ||
'$._robots_meta', | ||
'$._robots_txt', | ||
'$._sass', | ||
'$._security', | ||
'$._sessionstorage_size', | ||
'$._structured-data', | ||
'$._third-parties', | ||
'$._usertiming', | ||
'$._valid-head', | ||
'$._well-known', | ||
'$._wpt_bodies', | ||
'$._blinkFeatureFirstUsed', | ||
'$._CrUX' | ||
) AS payload, | ||
TO_JSON( STRUCT( | ||
SpeedIndex, | ||
TTFB, | ||
_connections, | ||
bytesAudio, | ||
bytesCSS, | ||
bytesFlash, | ||
bytesFont, | ||
bytesGif, | ||
bytesHtml, | ||
bytesHtmlDoc, | ||
bytesImg, | ||
bytesJpg, | ||
bytesJS, | ||
bytesJson, | ||
bytesOther, | ||
bytesPng, | ||
bytesSvg, | ||
bytesText, | ||
bytesTotal, | ||
bytesVideo, | ||
bytesWebp, | ||
bytesXml, | ||
cdn, | ||
payload._CrUX, | ||
fullyLoaded, | ||
gzipSavings, | ||
gzipTotal, | ||
maxDomainReqs, | ||
maxage0, | ||
maxage1, | ||
maxage30, | ||
maxage365, | ||
maxageMore, | ||
maxageNull, | ||
numCompressed, | ||
numDomElements, | ||
numDomains, | ||
numErrors, | ||
numGlibs, | ||
numHttps, | ||
numRedirects, | ||
onContentLoaded, | ||
onLoad, | ||
renderStart, | ||
reqAudio, | ||
reqCSS, | ||
reqFlash, | ||
reqFont, | ||
reqGif, | ||
reqHtml, | ||
reqImg, | ||
reqJpg, | ||
reqJS, | ||
reqJson, | ||
reqOther, | ||
reqPng, | ||
reqSvg, | ||
reqText, | ||
reqTotal, | ||
reqVideo, | ||
reqWebp, | ||
reqXml, | ||
visualComplete | ||
)) AS summary, | ||
STRUCT< | ||
a11y JSON, | ||
cms JSON, | ||
cookies JSON, | ||
css_variables JSON, | ||
ecommerce JSON, | ||
element_count JSON, | ||
javascript JSON, | ||
markup JSON, | ||
media JSON, | ||
origin_trials JSON, | ||
performance JSON, | ||
privacy JSON, | ||
responsive_images JSON, | ||
robots_txt JSON, | ||
security JSON, | ||
structured_data JSON, | ||
third_parties JSON, | ||
well_known JSON, | ||
wpt_bodies JSON, | ||
other JSON | ||
>( | ||
payload._a11y, | ||
payload._cms, | ||
payload._cookies, | ||
payload["_css-variables"], | ||
payload._ecommerce, | ||
payload._element_count, | ||
payload._javascript, | ||
payload._markup, | ||
payload._media, | ||
payload["_origin-trials"], | ||
payload._performance, | ||
payload._privacy, | ||
payload._responsive_images, | ||
payload._robots_txt, | ||
payload._security, | ||
payload["_structured-data"], | ||
payload["_third-parties"], | ||
payload["_well-known"], | ||
payload._wpt_bodies, | ||
GET_OTHER_CUSTOM_METRICS( | ||
payload, | ||
["_Colordepth", "_Dpi", "_Images", "_Resolution", "_almanac", "_avg_dom_depth", "_css", "_doctype", "_document_height", "_document_width", "_event-names", "_fugu-apis", "_has_shadow_root", "_img-loading-attr", "_initiators", "_inline_style_bytes", "_lib-detector-version", "_localstorage_size", "_meta_viewport", "_num_iframes", "_num_scripts", "_num_scripts_async", "_num_scripts_sync", "_pwa", "_quirks_mode", "_sass", "_sessionstorage_size", "_usertiming"] | ||
) | ||
) AS custom_metrics, | ||
NULL AS lighthouse, | ||
GET_FEATURES(pages.payload) AS features, | ||
tech.technologies AS technologies, | ||
pages.payload._metadata AS metadata | ||
FROM ( | ||
SELECT | ||
* EXCEPT(payload), | ||
SAFE.PARSE_JSON(payload, wide_number_mode => 'round') AS payload | ||
FROM pages.${constants.fnDateUnderscored(iteration.date)}_${iteration.client} ${constants.devTABLESAMPLE} | ||
) AS pages | ||
|
||
LEFT JOIN summary_pages.${constants.fnDateUnderscored(iteration.date)}_${iteration.client} AS summary_pages ${constants.devTABLESAMPLE} | ||
ON pages.url = summary_pages.url | ||
|
||
LEFT JOIN ( | ||
SELECT DISTINCT | ||
CONCAT(origin, '/') AS page, | ||
experimental.popularity.rank AS rank | ||
FROM ${ctx.resolve('chrome-ux-report', 'experimental', 'global')} | ||
WHERE yyyymm = ${constants.fnPastMonth(iteration.date).substring(0, 7).replace('-', '')} | ||
) AS crux | ||
ON pages.url = crux.page | ||
|
||
LEFT JOIN ( | ||
SELECT | ||
page, | ||
ARRAY_AGG(technology) AS technologies | ||
FROM( | ||
SELECT | ||
url AS page, | ||
STRUCT< | ||
technology STRING, | ||
categories ARRAY<STRING>, | ||
info ARRAY<STRING> | ||
>( | ||
app, | ||
ARRAY_AGG(category), | ||
ARRAY_AGG(info) | ||
) AS technology | ||
FROM technologies.${constants.fnDateUnderscored(iteration.date)}_${iteration.client} ${constants.devTABLESAMPLE} | ||
GROUP BY page, app | ||
) | ||
GROUP BY page | ||
) AS tech | ||
ON pages.url = tech.page; | ||
`) | ||
}) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rviscomi please list (if you can remember) the backfill table cases that need to be handled separately.
For example I see some non-standard crawl dates discussed in HTTPArchive/data-pipeline#114.