Changing window.navigator within puppeteer to bypass antibot system - selenium

I'm trying to make my online bot undetectable. I read number of articles how to do it and I took all tips together and used them. One of them is to change window.navigator.webdriver.
I managed to change window.navigator.webdriver within puppeteer by this code:
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
});
I'm bypassing this test just fine:
However this test is still laughing at me somehow:
Why WEBDRIVER is inconsistent?

Try this,
First, remove the definition, it will not work if you define and delete from prototype.
Object.defineProperty(navigator, 'webdriver', ()=>{}) // <-- delete this part
Replace your code with this one.
delete navigator.__proto__.webdriver;
Result:
Why does it work?
Removing directly just remove the instance of the object rather than the actual definition. The getter and setter is still there, so browser can find it.
However if you remove from the actual prototype, it will not exist at all in any instance anymore.
Additional Tips
You mentioned you want to make your app undetectable, there are many plugins which achieve the same, for example this package called puppeteer-extra-plugin-stealth includes some cool anti-bot detection techniques. Sometimes it's better to just reuse some packages than to re-create a solution over and over again.
PS: I might be wrong above the above explanation, feel free to guide me so I can improve the answer.

Related

Filtering out assets from precaching in create-react-app

I'm using React 17 with cra-template-pwa to create a PWA. One of my UI libraries has several hundred static image resources that all get preloaded in the PWA (and I don't use most of them). This causes a long delay in enabling the PWA, and even causes Lighthouse to crash. I'm looking at various approaches to fixing the problem, but for a quick fix just to run lighthouse, I'd like to just disable precaching. I haven't been able to find concrete info how to do this. Any advice?
The cleanest solution would entail using the exclude option in the workbox-webpack-plugin configuration, but that requires ejecting in create-react-app.
Something you can do without ejecting, though, is to explicitly filter out entries from the injected self.__WB_MANIFEST array before passing the value to precacheAndRoute().
Your service-worker.js could look something like:
import {precacheAndRoute} from 'workbox-precaching';
// self.__WB_MANIFEST will be replaced with an
// Array<{url: string, revision: string}> during the build process.
// This will filter out all manifest entries with URLs ending in .jpg
// Adjust the criteria as needed.
const filteredManifest = self.__WB_MANIFEST.filter((entry) => {
return !entry.url.endsWith('.jpg');
});
precacheAndRoute(filteredManifest);
The downsides of this approach is that your service-worker.js file will be a bit larger than necessary (since it will include inline {url, revision} entries that aren't needed), and that you'll end up triggering the service worker update flow more than strictly necessary, if the contents of one of your images changes. Those unnecessary service worker updates won't actually harm anything or change the behavior of your web app, though.

Workbox/Vue: Create a custom variation on an existing caching strategy handler

Background:
I'm building an SPA (Single Page Application) PWA (Progressive Web App) using Vue.js. I've a remote PostgreSQL database, serving the tables over HTTP with PostgREST. I've a working Workbox Service Worker and IndexedDB, which hold a local copy of the database tables. I've also registered some routes in my service-worker.js; everything is fine this far....
I'm letting Workbox cache GET calls that return tables from the REST service. For example:
https://www.example.com/api/customers will return a json object of the customers.
workbox.routing.registerRoute('https://www.example.com/api/customers', workbox.strategies.staleWhileRevalidate())
At this point, I need Workbox to do the stale-while-revalidate pattern, but to:
Not use a cache, but instead return the local version of this table, which I have stored in IndexedDB. (the cache part)
Make the REST call, and update the local version, if it has changed. (the network part)
I'm almost certain that there is no configurable option for this in this workbox strategy. So I would write the code for this, which should be fairly simple. The retrieval of the cache is simply to return the contents of the requested table from IndexedDB. For the update part, I'm thinking to add a data revision number to compare against. And thus decide if I need to update the local database.
Anyway, we're now zooming in on the actual question:
Question:
Is this actually a good way to use Workbox Routes/Caching, or am I now misusing the technology because I use IndexedDB as the cache?
and
How can I make my own version of the StaleWhileRevalidate strategy? I would be happy to understand how to simply make a copy of the existing Workbox version and be able to import it and use it in my Vue.js Service Worker. From there I can make my own necessary code changes.
To make this question a bit easier to answer, these are the underlying subquestions:
First of all, the StaleWhileRevalidate.ts (see link below) is a .ts (TypeScript?) file. Can (should) I simply import this as a module? I propably can. but then I get errors:
When I to import my custom CustomStaleWhileRevalidate.ts in my main.js, I get errors on all of the current import statements because (of course) the workbox-core/_private/ directory doesn't exist.
How to approach this?
This is the current implementation on Github:
https://github.com/GoogleChrome/workbox/blob/master/packages/workbox-strategies/src/StaleWhileRevalidate.ts
I don't think using the built-in StaleWhileRevalidate strategy is the right approach here. It might be possible to do what you're describing using StaleWhileRevalidate along with a number of custom plugin callbacks to override the default behavior... but honestly, you'd end up changing so much via plugins that starting from scratch would make more sense.
What I'd recommend that you do instead is to write a custom handlerCallback function that implements exactly the logic you want, and returns a Response.
// Your full logic goes here.
async function myCustomHandler({event, request}) {
event.waitUntil((() => {
const idbStuff = ...;
const networkResponse = await fetch(...);
// Some IDB operation go here.
return finalResponse;
})());
}
workbox.routing.registerRoute(
'https://www.example.com/api/customers',
myCustomHandler
);
You could do this without Workbox as well, but if you're using Workbox to handle some of your unrelated caching needs, it's probably easiest to also register this logic via a Workbox route.

How would you redirect calls to the top object in Cypress?

In my application code, there are a lot of calls (like 100+) to the "top object" referring to window.top such as top.$("title") and so forth. Now, I've run into the problem using Cypress to perform end-to-end testing. When trying to log into the application, there are some calls to top.$(...) but the DevTools shows a Uncaught TypeError: top.$ is not a function. This resulted in my team and I discovering that the "top" our application is trying to reach is the Cypress environment itself.
The things I've tried before coming here are:
1) Trying to stub the window.top with the window object referencing our app. This resulted in us being told window.top is a read-only object.
2) Researching if Cypress has some kind of configuration that would smartly redirect calls to top in our code to be the top-most environment within our app. We figured we probably weren't the only ones coming across this issue.
If there were articles, I couldn't find any, so I came to ask if there was a way to do that, or if anyone would know of an alternate solution?
Another solution we considered: Looking into naming window objects so we can reference them by name instead of "window" or "top". If there isn't a way to do what I'm trying to do through Cypress, I think we're willing to do this as a last resort, but hopefully, we don't have to change that, since we're not sure how much of the app it will break upfront.
#Mikkel Not really sure what code I can provide to be useful, but here's the code that causes Cypress to throw the uncaught exception
if (sample_condition) {
top.$('title').text(...).find('content') // Our iframe
} else {
top.$('title').text(page_title)
}
And there are more instances in our code where we access the top object, but they are generally similar. We found out the root cause of the issue is that within Cypress calls to "top" actually interface with Cypress instead of their intended environment which is our app.
This may not be a direct answer to your question, it's just expanding on your request for more information about the technique that I used to pass info from one script to another. I tried to do it within the same script without success - basically because the async nature of .then() stopped it from working.
This snippet is where I read a couple of id's from sessionStorage, and save them to a json file.
//
// At this point the cart is set up, and in sessionStorage
// So we save the details to a fixtures file, which is read
// by another test script (e2e-purchase.js)
//
cy.window().then(window => {
const contents = {
memberId: window.sessionStorage.getItem('memberId'),
cartId: window.sessionStorage.getItem('mycart')
}
cy.writeFile(`tests/cypress/fixtures/cart.json`, contents)
})
In another script, it loads the file as a fixture (fixtures/cart.json) to pull in a couple of id's
cy.fixture(`cart`).then(cart => {
cy.visit(`/${cart.memberId}/${cart.cartId}`)
})

LibGit2Sharp: how to Task-Async

Hi I have begun to use the package for some very simple tasks, mainly cloning a Git-Wiki repo and subsequently pulling the changes from the server when needed.
Now I can not see any methods corresponding to the Task-Async (TAP) pattern. Also in the documentation I could not find anything concerning.
Could you please give me some direction how to wrap the LibGit2Sharp methods into a TAP construct? Link to documentation (if I missed something) or just telling me which callback to hook up to the TaskCompletionSource object would be nice.
It also doesn't really help that I am a newbie with Git, and normally I only do basic branching, merging, pushing with it.
For cloning I use:
Repository.Clone(#"https://MyName#bitbucket.org/MyRepo/MyProject.git/wiki", "repo");
For pulling I use:
using (var repo = new Repository("repo"))
{
// Credential information to fetch
LibGit2Sharp.PullOptions options = new LibGit2Sharp.PullOptions();
options.FetchOptions = new FetchOptions();
var signature = new LibGit2Sharp.Signature(new Identity("myname", "mymail#google.com"), DateTimeOffset.Now);
Commands.Pull(repo, signature, options);
}
Thanks in advance
First of all, you should never try sync over async or async over sync. See this article.
If you're thinking of using Task.Run, don't. That will just trade on thread pool thread for another with the added cost of 2 context switches.
But you should reconsider your whole approach to this. You don't need to clone the repository just to get the contents of a file. Each version of a file, has a unique URL. You can even get the URL of a file for a specific branch.

Why is Mage_Persistent breaking /api/wsdl?soap

I get the following error within Magento CE 1.6.1.0
Warning: session_start() [<a href='function.session-start'>function.session-start</a>]: Cannot send session cookie - headers already sent by (output started at /home/dev/env/var/www/user/dev/wdcastaging/lib/Zend/Controller/Response/Abstract.php:586) in /home/dev/env/var/www/user/dev/wdcastaging/app/code/core/Mage/Core/Model/Session/Abstract/Varien.php on line 119
when accessing /api/soap/?wsdl
Apparently, a session_start() is being attempted after the entire contents of the WSDL file have already been output, resulting in the error.
Why is magento attempting to start a session after outputting all the datums? I'm glad you asked. So it looks like controller_front_send_response_after is being hooked by Mage_Persistent in order to call synchronizePersistentInfo(), which in turn ends up getting that session_start() to fire.
The interesting thing is that this wasn't always happening, initially the WSDL loaded just fine for me, initially I racked my brains to try and see what customization may have been made to our install to cause this, but the tracing I've done seems to indicate that this is all happening entirely inside of core.
We have also experienced a tiny bit of (completely unrelated) strangeness with Mage_Persistent which makes me a little more willing to throw my hands up at this point and SO it.
I've done a bit of searching on SO and have found some questions related to the whole "headers already sent" thing in general, but not this specific case.
Any thoughts?
Oh, and the temporary workaround I have in place is simply disabling Mage_Persistent via the persistent/options/enable config data. I also did a little bit of digging as to whether it might be possible to observe an event in order to disable this module only for the WSDL controller (since that seems to be the only one having problems), but it looks like that module relies exclusively on this config flag to determine it's enabled status.
UPDATE: Bug has been reported: http://www.magentocommerce.com/bug-tracking/issue?issue=13370
I'd report this is a bug to the Magento team. The Magento API controllers all route through standard Magento action controller objects, and all these objects inherit from the Mage_Api_Controller_Action class. This class has a preDispatch method
class Mage_Api_Controller_Action extends Mage_Core_Controller_Front_Action
{
public function preDispatch()
{
$this->getLayout()->setArea('adminhtml');
Mage::app()->setCurrentStore('admin');
$this->setFlag('', self::FLAG_NO_START_SESSION, 1); // Do not start standart session
parent::preDispatch();
return $this;
}
//...
}
which includes setting a flag to ensure normal session handling doesn't start for API methods.
$this->setFlag('', self::FLAG_NO_START_SESSION, 1);
So, it sounds like there's code in synchronizePersistentInf that assumes the existence of a session object, and when it uses it the session is initialized, resulting in the error you've seen. Normally, this isn't a problem as every other controller has initialized a session at this point, but the API controllers explicitly turns it off.
As far as fixes go, your best bet (and probably the quick answer you'll get from Magento support) will be to disable the persistant cart feature for the default configuration setting, but then enable it for specific stores that need it. This will let carts
Coming up with a fix on your own is going to be uncharted territory, and I can't think of a way to do it that isn't terribly hacky/unstable. The most straight forward way would be a class rewrite on the synchronizePersistentInf that calls it's parent method unless you've detected this is an API request.
This answer is not meant to replace the existing answer. But I wanted to drop some code in here in case someone runs into this issue, and comments don't really allow for code formatting.
I went with a simple local code pool override of Mage_Persistent_Model_Observer_Session to exit out of the function for any URL routes that are within /api/*
Not expecting this fix to need to be very long-lived or upgrade-friendly, b/c I'm expecting them to fix this in the next release or so.
public function synchronizePersistentInfo(Varien_Event_Observer $observer)
{
...
if ($request->getRouteName() == 'api') {
return;
}
...
}