Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More corrupted values in technology detection data #29

Open
max-ostapenko opened this issue Dec 10, 2024 · 1 comment
Open

More corrupted values in technology detection data #29

max-ostapenko opened this issue Dec 10, 2024 · 1 comment
Assignees

Comments

@max-ostapenko
Copy link

max-ostapenko commented Dec 10, 2024

Based on the corrupted data here is the list of pages with corrupted ca:

WITH wappalyzer AS (
  SELECT
    category
  FROM wappalyzer.apps,
    UNNEST(categories) AS category
)

SELECT
  technology,
  category,
  count(distinct page) AS cnt_pages,
  ARRAY_AGG(DISTINCT page LIMIT 3) AS sample_pages
FROM crawl.pages,
  UNNEST (technologies) AS technology,
  UNNEST (technology.categories) AS category
LEFT JOIN wappalyzer
USING (category)
WHERE date = '2024-11-01'
AND wappalyzer.category IS NULL
GROUP BY 1,2
order by category ASC

The detection seems to work fine. It looks like page context is messing with some built-in objects again.
Maybe we could avoid using any values that could be impacted by it.

A few cases:

One of the observations - in most of these cases only the values within detected_technologies have correct data (keys are also impacted).
Maybe we should switch to it for the BigQuery data?
For example:

technologies = [
    {
        "technology": technology["name"],
        "categories": [category["name"] for category in technology["categories"]],
        "info": [technology["version"]]
    }
    for technology in detected_technologies.values()
]
@pmeenan pmeenan self-assigned this Dec 10, 2024
pmeenan added a commit that referenced this issue Dec 10, 2024
@max-ostapenko
Copy link
Author

max-ostapenko commented Dec 10, 2024

Here is a cleanup query:

DECLARE crawl_month DATE DEFAULT DATE('2024-11-01');

CREATE TEMP TABLE technologies_cleaned AS (
  WITH wappalyzer AS (
    SELECT
      name as technology,
      category
    FROM wappalyzer.apps,
      UNNEST(categories) AS category
  ), pages AS (
    SELECT
      date,
      client,
      page,
      technologies
    FROM crawl.pages
    WHERE date = crawl_month
  ), impacted_pages AS (
    SELECT DISTINCT
      date,
      client,
      page
    FROM pages,
      UNNEST (technologies) AS tech,
      UNNEST (tech.categories) AS category
    LEFT JOIN wappalyzer
    USING (technology, category)
    WHERE wappalyzer.category IS NULL OR
      wappalyzer.technology IS NULL
  ), flattened_technologies AS (
    SELECT
      date,
      client,
      page,
      technology,
      category,
      info
    FROM pages,
      UNNEST(technologies) AS tech,
      UNNEST(tech.categories) AS category
    WHERE page IN (SELECT DISTINCT page FROM impacted_pages)
  ), whitelisted_technologies AS (
    SELECT
      date,
      client,
      page,
      f.technology,
      f.category,
      f.info
    FROM flattened_technologies f
    INNER JOIN wappalyzer
    USING (technology, category)
  ), reconstructed_technologies AS (
    SELECT
      date,
      client,
      page,
      ARRAY_AGG(STRUCT(
        technology,
        categories,
        info
      )) AS technologies
    FROM (
      SELECT
        date,
        client,
        page,
        technology,
        ARRAY_AGG(DISTINCT category IGNORE NULLS) AS categories,
        info
      FROM whitelisted_technologies
      GROUP BY date, client, page, technology, info
    )
    GROUP BY date, client, page
  )

  SELECT
    date,
    client,
    page,
    r.technologies
  FROM impacted_pages
  LEFT JOIN reconstructed_technologies r
  USING (date, client, page)
);

UPDATE crawl.pages
SET technologies = technologies_cleaned.technologies
FROM technologies_cleaned
WHERE pages.date = crawl_month AND
  pages.client = technologies_cleaned.client AND
  pages.page = technologies_cleaned.page;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants