Wait for all the ajax requests to finish before returning html in Splash using Lua Script - scrapy

I am using Splash along with Scrapy to execute some script on the page before scraping it.
Basically, few elements are loaded via AJAX on the the click of the button.
There are multiple AJAX request happening per page. Below is the Lua Script which I am using.
function main(splash)
assert(splash:go(splash.args.url))
splash:wait(1)
local btns = splash:select_all('.buttonShowCo')
for _, btn in ipairs(btns) do
btn:mouse_click()
end
splash:wait(12)
return splash:html()
end
The issue is script misses few dynamic elements. I am assuming that the script return before all the AJAX call finish.
I added a wait time to let all the AJAX calls finish but it is not working.
Is there is a way to wait until all the AJAX call finish?

You can create a variable inside each AJAX request initialy set as "false", then when an AJAX request finish you set it to "true". In another function you create a while loop to check wheter all variables are "true" or not, before doing what your are willing to do

Related

Can scrapy-splash Ignore 504 HTTP Status?

i want to scrap javascript loading web pages, so i use scrapy-splash but some pages so lots of loading time.
like this :
i think [processUser..] things that makes slower.
there are any way to ignore that 504 pages ? because when i set timeout less than 90 , cause 504 gateway error in scrapy shell or spiders.
and can get result html code ( only get 200 ) when time i set is over?
There's a mechanism in splash to abort a request before it starts loading the body which you can leverage using splash:on_response_headers hook. However in your case this hook will only be able to catch and abort the page when the status and the headers are in, and that is after it finishes waiting for the gateway timeout (504). So instead you might want splash:on_request hook to abort the request before it's even sent like so
function main(splash, args)
splash:on_request(function(request)
if request.url:find('processUser') then
request:abort()
end
end)
assert(splash:go(args.url))
assert(splash:wait(.5))
return {
har = splash:har(),
}
end
UPD: Another and perhaps a better way to go about this is to set splash.resource_timeout before any requests take place:
function main(splash, args)
splash.resource_timeout = 3
...
When you are using Splash to render a webpage you are basically using a web browser.
When you ask Splash to render http://example.com:
Splash goes to http://example.com
Splash executes all of the javascript
2.1 javascript makes some requests
2.2 some requests return 50x codes
Splash returns page data
Unforntunately Splash right now does not support any custom rules for blocking javascript requests - it just takes the page and does everything your browser would do without any addons: load everything without question.
All that being said it's highly unlikely that those 50x requests are slowing down your page load, if so it shouldn't be a significant amount.

Splash get html before finish Lua Script

I have web page with strong ajax pagination (button for only next page).
For go to page etc number 5, script should press button Next 5 times.
But after script click - data for current page will lost.
It's possible return html content from Lua script to scrapy, and after this continue script run?
Now i use bad way. I merge html code for each page inside Lua script, and after last page i return it. But i think it's not good.

YII getFlashes() not deleting?

Before jumping in with an answer, please make sure you understand my scenario.
I have ajax calls that CREATE flashes.
I have other ajax calls that FETCH the flashes as JSON.
What is currently happening: I click a button which creates the flash. After which I run a ajax call that executes:
public function actionGetAllFlashesAsJSON() {
$flashMessages = Yii::app()->user->getFlashes(true);
$returnResult = array();
foreach ($flashMessages as $key => $value) {
$newItem = array();
$newItem['message'] = $value;
$newItem['kind'] = $key;
$returnResult[]= $newItem;
}
print json_encode($returnResult);
die();
}
My problem is, when I execute this function twice in a row, it still keeps returning the flashes. However, if I refresh the site, it shows the error, and then if I press refresh again, it's gone. My theory is that page refresh is causing some other kind of deletion of messages... but what? And how can I force the deletion of these messages after I receive the message in the above code?
More background info: I am using the flashes as ERROR messages, but i want them to appear at the top of my site AS THEY ARE CREATED. Flashes might get created via Ajax, so I have javascript running to check for new messages, and display them, but my problem is it shows the messages several times, because they are not getting deleted after calling getFlashes?
The flash messages are controlled by SESSION variables, which Yii destroys when the page is loaded (probably somewhere quite deep in the framework). You will have to manually destroy all the previous flash messages at the start of the ajax request
You can use: getFlashes() to get all the existing flash messages
For the other flash message methods have a look at the CWebUser docs here

Keeping browser from timing out in mvc3 app during long processing time

MVC3 vb.net. In my app I have a point where 500+ emails with attachments are sent out using a for each loop to accomplish this.. Nothing is returned to the browser the entire time this is running so eventually the browser think it has timed out... I tried just having it redirect to another actionresult function after every email and that function just passes it back to the email function. This is not working and I feel the reason is that nothing is actually being sent to the browser window its self.. Is there a way to fix this issue??
If _keepAlive = 1 Then
RedirectToAction("keepAlive", "Email")
End If
Function keepAlive() As ActionResult
Return RedirectToAction("SendClassSchedules", "Email")
End Function
You can try an async action and set the timeout to a large value:
http://msdn.microsoft.com/en-us/library/system.web.mvc.asynctimeoutattribute.aspx
http://blogs.claritycon.com/blog/2011/04/12/roll-your-own-mvc-3-long-polling-chat-site/
In the past, we have successfully started a background thread to do the processing, then set the page to refresh itself once a second - with each refresh reporting progress. I suppose If I was doing something like that today, I would use a page method and javascript ajax call to update the page with the progress.
Are you talking about this? http://msdn.microsoft.com/en-us/library/ms525473(v=vs.90).aspx
I think you could change Session.Timeout to suit your needs better..

How can I block based on URL (from address bar) in a safari extension

I'm trying to write an extension that will block access to (configurable) list of URLs if they are accessed more than N times per hour. From what I understand, I need to have a start script pass a "should I load this" message to a global HTML page (who can access the settings object to get the list of URLs), who will give a thumbs up/thumbs down message back to the start script to deny/allow loading.
That works out fine for me, but when I use the usual beforeLoad/canLoad handlers, I get messages for all the sub-items that need to be loaded (images/etc..), which screws up the #accesses/hour limit I'm trying to make.
Is there a way to synchronously pass messages back and forth between the two sandboxes so I can tell the global HTML page, "this is the URL in the window bar and the timestamp for when this request came in", so I can limit duplicate requests?
Thanks!
You could use a different message for the function that checks whether to allow the page to load, rather than using the same message as for your beforeLoad handler. For example, in the injected script (which must be a "start" script), put:
safari.self.tab.dispatchMessage('pageIsLoading');
And in the global script:
function handleMessage(event) {
if (event.name == 'pageIsLoading') {
if (event.target.url.indexOf('forbidden.site.com') > -1) {
console.log(event.timeStamp);
event.target.url = 'about:blank';
}
}
}