How to know what resources page has downloaded using puppeteer
To retrieve the resources that a page has downloaded behind the scenes, such as CSS, JavaScript files, images, etc., Puppeteer provides a way to capture network requests and responses. This allows you to inspect what resources are being loaded by the page. Here’s how you can achieve this using Puppeteer:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Enable request interception
await page.setRequestInterception(true);
// Array to store captured resources
const resources = [];
// Event listener for request interception
page.on('request', interceptedRequest => {
resources.push({
url: interceptedRequest.url(),
type: interceptedRequest.resourceType(),
});
interceptedRequest.continue();
});
// Navigate to a website
await page.goto('https://example.com');
// Wait for a few seconds to capture network requests
await page.waitForTimeout(5000); // Adjust as needed
// Display captured resources
console.log('Captured Resources:', resources);
await browser.close();
})();
Explanation:
-
Launch Puppeteer: Start a Puppeteer-controlled browser instance using
puppeteer.launch()
. -
Create a New Page: Open a new browser tab/page with
browser.newPage()
. -
Enable Request Interception: Use
page.setRequestInterception(true)
to intercept all network requests made by the page. -
Capture Resources: Use
page.on('request', ...)
to listen for intercepted requests. In the callback function, push details of each request (url
andresourceType
) into theresources
array. -
Navigate to a Website: Use
page.goto('https://example.com')
to load a specific URL. Replace'https://example.com'
with the URL of the website you want to access. -
Wait for Requests: Use
page.waitForTimeout()
(or other waiting strategies) to ensure sufficient time for network requests to be captured. -
Display Captured Resources: Log or process the
resources
array, which now contains information about all resources (CSS, JS, images, etc.) that the page has requested and downloaded. -
Close the Browser: Always close the browser instance using
browser.close()
to free up system resources once you've finished using Puppeteer.
Notes:
-
Resource Types: The
interceptedRequest.resourceType()
method returns the type of the intercepted resource (e.g.,'document'
,'script'
,'stylesheet'
,'image'
,'xhr'
, etc.). This helps identify the type of resource being requested. -
Event Handling: Puppeteer's event-driven approach (
page.on('request', ...)
) allows you to capture and handle network requests in real-time as they occur. -
Security Considerations: Always ensure that you have proper permissions or rights to access and use the network resources of websites, especially when automating interactions or scraping data.