I am trying to do web scraping using phantomJs. I am doing recurring task in that script so same script runs around 10k times with different parameters in single execution. It would take around 3 hours to complete process.
The issue is, it suddenly stops at random point with killed status written on the screen.
I tried some of the tricks to solve it but nothing worked.
Like - Tried localStorage.clear() in page.evaluate() function,
Reinstalling Phantom Js
So I need to know why it is happening and what can I do to fix it.
Related
I'm trying to rapidly develop my frontend, but every time I change my code I find myself refreshing my browser and running some macro to test whether the changes in my code solved the problem.
I tried changing the process to headless PySelenium, but it takes so long for the driver to launch every time I change my code.
I also tried Cypress.io, but after following the tutorial, the directory just didn't load.
I'm looking for a headless option that runs as fast as possible.
Using PhantomJS and bash, I'm working on a little piece of anti-malware that reads a web page, grabs all the domains that are delivering assets to the browser, then prints each server's country of origin. It works fine except for one site that has a... uh... 'suboptimal' piece of javascript that calls to an external server every 5 seconds. PhantomJS just loads the resource over and over and over, page.open() never finishes, and page.onLoadFinished() is never called.
Is there a way around this? Can I set a time limit on page.load()? I guess, as a workaround, can I set a time limit on the Linux process?
Thanks in advance, and if anyone is interested in a copy of this script let me know and I'll post it somewhere public.
I solved this problem using the solutions given here to set a execution time limit on the phantomjs command and kill it if needed.
Command line command to auto-kill a command after a certain amount of time
So I am trying to keep my Node server on a embedded computer running when it is out in the field. This lead me to leveraging inittab's respawn action. Here is the file I added to inittab:
node:5:respawn:node /path/to/node/files &
I know for a fact that when I startup this node application from command line, it does not get to the bottom of the main body and console.log "done" until a good 2-3 seconds after I issue the command.
So I feel like in that 2-3 second window the OS just keeps firing off respawns of the node app. I see in the error logs too in fact that the kernel ends up killing off a bunch of node processes because its running out of memory and stuff... plus I do get the 'node' process respawning too fast will suspend for 5 minutes message too.
I tried wrapping this in a script, dint work. I know I can use crontab but thats every minute... am I doing something wrong? or should I have a different approach all together?
Any and all advice is welcome!
TIA
Surely too late for you, but in case someone else finds such a problem: try removing the & from the command invocation.
What happens is that when the command goes to the background (thanks to the &), the parent (init) sees that it exited, and respawns it. Result: a storm of new instantations of your command.
Worse, you mention embedded, so I guess you are using busybox, whose init won't rate-limit the respawning - as would other implementations. So the respawning will only end when the system is out of memory.
inittab is overkill for this. I found out what I need is a process monitor. I found one that is lightweight and effective; it has some good reports of working great out in the field. http://en.wikipedia.org/wiki/Process_control_daemon
Using this would entail configuring this daemon to start and monitor your Node.js application for you.
That is a solution that works from the OS side.
Another way to do it is as follows. So if you are trying to keep Node.js running like I was, there are several modules written meant to keep other Node.js apps running. To mention a couple there are forever and respawn. I chose to use respawn.
This method entails starting one app written in Node.js that uses the respawn module to start and monitor the actual Node.js app you were interested in keeping running anyway.
Of course the downside of this is that if the Node.js engine (V8) goes down altogether then both your monitoring and monitored process will go down with it :-(. But its better than nothing!
PCD would be the ideal option. It would go down probably only if the OS goes down, and if the OS goes down then hope fully one has a watchdog in place to reboot the device/hardware.
Niko
I'm running CGI-LUA scripts with lighttpd on embedded device. The web client attempts to run via POST three scripts every 3 seconds.
Most of the time it works, but the issue is that from time to time I get 500 internal server error, like the server fails to run the script, though nothing changed and in the 'top' I see that the CPU is idle most of the time.
I'm new to web, any ideas?
If I were trying to solve this problem I would start with:
1) Look in /var/log/lighttpd/error.log to see what lighttpd is reporting when the failure occurs.
2) Write a very simple CGI-LUA script that does something traceable, like touch a file with the current unixtime as its name, and hit it every 3 seconds instead of your script. This will help you figure out if the problem is in CGI-LUA or in your script.
3) Run your script outside CGI-LUA repeatedly in a loop to see if it ever fails.
This is driving me a bit nuts, I have rufus doing some scheduling to call a rules engine (ruleby). So most work I have running is inside the running engine and then inside the scheduler. As a result when I have a error the information is a bit limited.
Fast forward, Im still working on my code but now I have this exception error:
'undefined method `+' for nil:NilClass'
It wasnt happening before, Im not sure exactly when it started and if it was what I was doing with the code or some events that came in that come in via http push. I comment out the code I think is causing it, stops happening, I put the code back in, still not happening, I leave it for a while, starts happening again. I try and run the engine manually outside the scheduler (so just once instead of every x many minutes), doesnt happen.
Put it back on the scheduler to run a few times, starts happening again. I would google the above error but google doesnt love the + in the search. Anyone have any ideas where to direct me to for this? Its clearly something happening when the rules engine is running but it was more than happily running for weeks before i got back to trying to finish it off. Best thought is that its during the rules engine running it passes events into it one at a time and something is missing that wasnt before.
Really want to know what the + method it refers to is/could be/suppose to be.