Puppeteer - Connecting to HTTPS website through a proxy with authentication doesn't work - authentication

I have been trying to solve this issue for the past 2 days and haven't been able to. I've looked this up everywhere and still no solution.. Here's the code:
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
const PROXY_SERVER_IP = 'IP.IP.IP.IP';
const PROXY_SERVER_PORT = '1234';
const PROXY_USERNAME = 'username';
const PROXY_PASSWORD = 'password';
(async () => {
const browser = await puppeteer.launch({
args: [`--proxy-server=http://${PROXY_SERVER_IP}:${PROXY_SERVER_PORT}`],
});
const page = await browser.newPage();
await page.authenticate({
username: PROXY_USERNAME,
password: PROXY_PASSWORD,
});
await page.goto('https://www.google.ca/', {
timeout: 0,
});
await page.screenshot({ path: 'test4.png', fullPage: true });
await browser.close();
})();
I get a navigation timeout error on the page.goto() call because it just hangs for some reason. I can't figure out why. When I put a proxy that doesn't require authentication, it works. I'm thinking of switching to another headless solution because of this one issue and I would really appreciate some help.

So I figured it out. Turns out the proxy was really bad for some reason. The reason why Axios and cURL gave fast responses was because they just get the initial HTML code and unlike headless browsers, don't actually do anything with HTML text. With headless browsers, they actually make all the requests for the assets as well (css, images, etc.) and any other network requests and it's all going through the proxy, so it's much slower. When I tried a different proxy (one that requires authentication), it was much faster.

Related

Playwright: unable to login via API setting cookie (able to do it with Cypress)

I'm trying to implemented login via API following Playwright's guidelines but somehow nothing seems to be working.
As a comparison I've built the same in Cypress and it works out of the box:
Context:
Playwright Version: 1.30
Operating System: Mac
Node.js version: v16.19.0
Browser: Chromium
I am unable to make a simple API login that works perfectly using Cypress instead. Let me share the 2 code snippets for comparison:
Simple test case:
API request to the login end-point - Auth token is retrieved
set the auth token as a cookie
navigate to a page that is accessible only if authenticated
Code Snippet
Cypress (working fine)
const body = {
username: 'username...',
password: 'password',
rememberMe: true,
};
describe('Login via API to management console', () => {
it('Login via API to management console', () => {
cy.request({
method: 'POST',
url: loginEndPoint,
headers: {
'Content-Type': 'application/json',
},
body,
}).then((response) => {
cy.setCookie('Authorization', `Token ${response.body.data.token}`);
});
cy.visit(`/management`);
});
});
Playwright (not working)
test('Login via API', async ({ browser }) => {
const context = await browser.newContext();
const page = await context.newPage();
const loginResponse = await context.request.post(`https://${process.env.MANAGEMENT_URL}/web/api/v2.1/users/login`, {
data: {
username: process.env.MANAGEMENT_USER,
password: process.env.MANAGEMENT_PASSWORD,
rememberMe: true,
}
});
const {
data: { token },
} = await loginResponse.body().then((b) => {
return JSON.parse(b.toString());
});
expect(token).toMatch(/^[a-z0-9]{80}$/)
await context.addCookies([{ name: 'Authorization', value: `Token ${token}`, path: '/', domain: `https://${process.env.MANAGEMENT_URL}` }]);
await page.goto(`https://${process.env.MANAGEMENT_URL}/management/`);
await expect(page).toHaveURL(/management/);
});
Describe the bug
Both scripts are successful at retrieving the authentication token but somehow either I'm doing something wrong with setting the cookie in Playwright or there is an issue. I'd assume the 2 scripts should be comparable.
Furthermore: I've tried to execute login via UI using global-setup, saving the storage-state, loading it before running the test and it fails also in this case... so there is something that is not setting properly the state in this case or the cookie in the previous one.
Not entirely sure why the cookie approach wasn’t working, perhaps the https:// part should be removed from the domain?
That being said, in Playwright you shouldn’t even need to do that especially within a single test, looking at the Playwright docs on signing in via the API and related page about the request context particularly under cookie management. The associated request and browser contexts share cookies, so once you complete the login request, the browser should already have the cookie state too and be logged in, so you should be able to just remove getting the token and adding the cookie. Or you can login with the API in the global setup even, as that doc showed. Just make sure in that case to save the storage state, and specify the same file in your config.
I see you tried the global setup approach (through the UI, but you can use the API since you have it), not sure what happened there. I would say to ensure that you specified the storageState in the config; I would be curious how you loaded it as mentioned, and if you’re still having problems maybe share the code you’re using for that piece?
Hope that helps or we can troubleshoot further!

How can I combine puppeteer-extra with express?

I'm trying to combine puppeter-extra with express. For each request I will be able to load a different plugin, for example:
const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
const router = express.Router();
router.get("/", async (req, res) => {
const { useStealth } = req.query
if(useStealth) puppeteer.use(StealthPlugin());
const browser = await puppeteer.launch(parameters..)
const page = await browser.newPage();
})
The problem is that, when I send the first request with query useStealth, it will set in node cache puppeteer-extra to use StealthPlugin, so the others next requests will use it. I tried to solve this problem by clear node cache, it works but it's a problem for concorrent requests. My code to try to solve it (But it has the concorrent request problem):
delete require.cache[require.resolve('puppeteer-extra')];
puppeteer = require('puppeteer-extra');
Is there anyway to clean puppeteer.use function ? (So, It would be a new instance of puppeteer-extra per request)
Thanks!

How to use Puppetter with express API and properly closing the browser without affecting other concurrent request

I have a website which has some data as HTML (loaded via ajax) and I have to convert that to JSON with custom formatting.
So, for this I choose Puppeteer.
const browser = await puppeteer.launch({
headless: true, args: ['--no-sandbox']
});
const page = await browser.newPage();
This web API that I'm developing will be having concurrent web requests, so I thought browser.close() might affect the other concurrent requests, so I decide to do only page.close.
One problem that I'm facing is when I do puppeteer.launch, for each request it opens two about:blank tab in a new window.
And when browser.newPage() is requested it returns one of the blank tabs and leaves the other one opened.
That leads to multiple window opened with multiple about:blank.
Here I don't know the right way to handle this, I can't close the browser because it will close all the pages which are being used by other requests.
You are seeing an empty (about:blank) tab each time you run this code, because you are doing two things here:
Launching a new browser - which already starts with an open tab
const browser = await puppeteer.launch({
headless: true, args: ['--no-sandbox']
});
Opening a new tab.
const page = await browser.newPage();
If you don't want to have "zombie" blank tabs, then you can just reuse the initial tab like this:
const browser = await puppeteer.launch({
headless: true, args: ['--no-sandbox']
});
const currentPages = await browser.pages(); // list the opened tabs
const [page] = currentPages; // use the first (and only) opened tab.
Note that in this case, since you are just reusing the only one and initial tab, closing it with page.close() will have the same effect as closing the browser with browser.close().
Exploring some Express + Browser concurrency alternatives
Consider that a different solution would take place if you want to reuse the same browser instance for the lifetime of your Express server, ie. serve all requests on the same browser, or if you want to use a new browser instance for each individual request.
1. One browser instance per server
In this case it might make sense, depending on your requirements, in managing one tab per request.
// launch the browser instance, once
const browser = await puppeteer.launch({
headless: true, args: ['--no-sandbox']
});
// handle incoming requests
app.get("/foo", async (req, res) => {
const page = await browser.newPage();
try {
// ... execute some logic on this new page
} catch(error) {
// whoops, logic went wrong, respond with 500 or something
} finally {
// cleanup: close the opened tab, no matter how the logic resulted
await page.close()
}
})
Note that still in this scenario, the browser context would be shared across the pages, for example cookies, local storage, and so on. You have to consider this if you plan to allow concurrent requests that also can have conflicts in reusing the same shared context.
2. One browser instance per request
In this scenario you launch a new browser instance per request, you ensure each request will have a clean context and won't collide with other possible requests.
app.get("/foo", async (req, res) => {
// launch the browser instance, one per request
const browser = await puppeteer.launch({
headless: true, args: ['--no-sandbox']
});
// no need to open a new tab, reuse the first one
const [page] = await browser.pages();
try {
// ... execute some logic on the page
} catch(error) {
// whoops, logic went wrong, respond with 500 or something
} finally {
// cleanup: close the browser
// await page.close() // (not really needed if you will close the entire browser,
// and would have the same effect as browser.close()
// if you haven't opened more tabs)
await browser.close()
}
})
But consider that spining a new browser process up would also be more resource-intensive, and your request would take more time to resolve, compared to reusing an already available browser process.
EDIT: code formatting.

How to retrieve cookies in express via browser?

I've been trying to figure this out for 2 days now, but to no avail. I'm using cookie-parser and have followed the code of other people but it's still not working. It works perfectly in postman, but not on the browser. The browser I'm currently using is google chrome, but I've also tested it on microsoft edge which gives the same result.
app.get('/testCookie',(req,res) => {
res.cookie('username','flavio');
res.json({message:req.cookies});
})
// frontend
const data = await axios.get('http://localhost:5000/testCookie');
console.log(data); // returns {}
Asper, I understand you want to set cookies and access it in the client-side, then this might help you.
Server-side
const cookieConfig = {
httpOnly: false,
maxAge: 315569260000,//age of cookie any value possible
}
app.get('/testCookie',(req,res) => {
res.cookie('username','flavio',cookieConfig);
})
Client side
let data = document.cookie
.split('; ')
.find(row => row.startsWith('username'))
.split('=')[1];
console.log(data)

express and static assets with external resources

I am using Express to serve static assets. Frontend is AngularJS 1.x and I have html5mode enabled. Trying to implement Recaptcha is where I noticed the following in Chrome dev tools:
Uncaught SyntaxError: Unexpected token <
api.js?onload=vcRecaptchaApiLoaded&render=explicit“:1
When I click on the function to initiate the Recaptcha process I receive:
Error: reCaptcha has not been loaded yet.
So far this makes sense to be bacause I noticed the string that the first error is reporting is part of the url path to load Recaptcha from Google.
When I click on the url (api.js?onload=vcRecaptchaApiLoaded&render=explicit“:1) in chrome tools it loads my index.html! Strange!
This has be believing it has something to do with my static asset serving. I have played around with my express server until the cows came home and cannot figure out how to remedy.
Live example:
http://ninjacape.herokuapp.com
Here is my code and thank you for taking a look!
index.html
<script src=“https://www.google.com/recaptcha/api.js?onload=vcRecaptchaApiLoaded&render=explicit“ async defer></script>
express.js
var express = require('express');
var compression = require('compression');
var app = module.exports.prod = exports.prod = express();
var devAPI = 'http://localhost:1337';
app.use(compression());
app.use(express.static('.tmp'));
app.get('/*', function(req, res) {
res.sendFile(__dirname + '/.tmp/index.html');
});
var proxy = require('express-http-proxy');
app.use('/api', proxy(devAPI));
var port = process.env.PORT || 8000;
app.listen(port);
Well... I wish I had a better answer however I am just happy I got it to work. Something in the way I am statically serving files is appending any url in index.html to http://localhost:8000. To work around this I took a look at the actual request coming into Express and found the url. Then added logic to redirect that request to the real url. See commented code below for more info:
// Any requests matching /*
app.get('/*', function(req, res, next) {
// Log the original url express is tying to go to
console.log(req.url);
// This is the url found from the step above (Where are the extra characters coming from?!)
var url ='/%E2%80%9Chttps://www.google.com/recaptcha/api.js?onload=vcRecaptchaApiLoaded&render=explicit%E2%80%9C'
// Self explanatory
if (req.url === url) {
// Respond by redirecting the request
res.redirect('https://www.google.com/recaptcha/api.js?onload=vcRecaptchaApiLoaded&render=explicit')
//End this block and continue
next();
} else {
// If it doesn't match the above url, proceed as normal
res.sendFile(__dirname + '/.tmp/index.html');
}
});