forked from catchpoint/WebPageTest.agent
    
        
        - 
                Notifications
    You must be signed in to change notification settings 
- Fork 0
Open
Description
Based on the corrupted data here is the list of pages with corrupted ca:
WITH wappalyzer AS (
  SELECT
    category
  FROM wappalyzer.apps,
    UNNEST(categories) AS category
)
SELECT
  technology,
  category,
  count(distinct page) AS cnt_pages,
  ARRAY_AGG(DISTINCT page LIMIT 3) AS sample_pages
FROM crawl.pages,
  UNNEST (technologies) AS technology,
  UNNEST (technology.categories) AS category
LEFT JOIN wappalyzer
USING (category)
WHERE date = '2024-11-01'
AND wappalyzer.category IS NULL
GROUP BY 1,2
order by category ASCThe detection seems to work fine. It looks like page context is messing with some built-in objects again.
Maybe we could avoid using any values that could be impacted by it.
A few cases:
- https://newcar.one2car.com/search (capitalised)
- https://ascf.amorepacific.co.kr/ (whitespaces removed)
- https://advancement.shu.edu/get-involved/events-calendar.html (replaced with HTML)
- https://www.gmi.go.kr/ (lowercase with dashes)
- https://iot.lostnfound.com/en/functions/ (replaced with undefined)
- etc
One of the observations - in most of these cases only the values within detected_technologies have correct data (keys are also impacted).
Maybe we should switch to it for the BigQuery data?
For example:
technologies = [
    {
        "technology": technology["name"],
        "categories": [category["name"] for category in technology["categories"]],
        "info": [technology["version"]]
    }
    for technology in detected_technologies.values()
]Metadata
Metadata
Assignees
Labels
No labels