crawl requestes in FIFO order with scrapy - scrapy

i want scrapy to process the crawl in FIFO order, for example i have a loop in that loop each element process 3 DEPTH nodes, the second element start after completed the first one with 3 DEPTH calls.

The way I would solve this is to put all the data that is needed to make a request in meta and have a parsing function that would handle 2 cases.
First case would handle "3 levels deep and the logic" and things related to that and the second case would be parsing the the main page.
After You have done that simply return a request to call the same function again.
General idea is to have all the "next step" information in meta and act on that information to separate the cases.

Related

Too Many Child context Mule 4

Can any one suggests the best way to page through an API in Mule 4? The examples I saw used a choice with a flow reference to call the flow again in a loop. This API doesn't return the total number of records or pages so need to loop through each page until it returns an empty payload.
But if I call the same flow recursively, its throwing too many child context error.
What is the ideal way to handle this scenario?
I faced same kind of scenario recently. I have done a recursive call using another flow to call the parent flow. If you use a flow reference from the same flow it will not allow you to do so.
In the child flow, I have incremented the next start index as well.
If you are looking for paging from the API level, you can accept the start index and the Page size as the two parameters. This call can go until the number of records < page size OR the API returns 0 records.

Stop Scrapy spider when date from page is older that yesterday

This code is part of my Scrapy spider:
# scraping data from page has been done before this line
publish_date_datetime_object = (datetime.strptime(publish_date, '%d.%m.%Y.')).date()
yesterday = (datetime.now() - timedelta(days=1)).date()
if publish_date_datetime_object > yesterday:
continue
if publish_date_datetime_object < yesterday:
raise scrapy.exceptions.CloseSpider('---STOP---DATE IS OLDER THAN YESTERDAY')
# after this is ItemLoader and yield
This is working fine.
My question is Scrapy spider best place to have this code/logic?
I do not know how to put implement it in another place.
Maybe it can be implemented in a pipeline, but AFAIK the pipeline is evaluated after the scraping has been done, so that means that I need to scrape all adds, even thous that I do not need.
A scale is 5 adds from yesterday versus 500 adds on the whole page.
I do not see any benefit in moving code to pipeline it that means processing(downloading and scraping) 500 adds if I only need 5 from it.
It is the right place if you need your spider to stop crawling after something indicates there's no more useful data to collect.
It is also the right way to do it, rising a CloseSpider exception with a verbose closing reason message.
A pipeline would be more suitable only if there were items to be collected after the threshold detected, but if they are ALL disposable this would be a waste of resources.

what is the best use of BusyIndicator?

Just want to understand the usage of busy indicator does it alternate to timeout/putting wait etc.
for example have following line of code in mainfunct()
1. busy.show();
2. callcustom(); --asynch function without callback this is calling xmlhttpRequest etc.
3. busy.hide();
4. callanothercustom(); -- asynch function without callback
now question is
does line 4 will be executed only when busy.hide() completes and
line 3 only when line 2 is completed while without busy all (2,4)
will be called inside mainfunct() without waiting line 2 to
complete...
when busy.hide() is being called is there any timer setup which
holds until line 2 finishes and then hide and call line 4.
A busyIndicator's show and hide functions only control when to display the indicator and when to hide the indicator. They have no effect what-so-ever on anything else going on in your code.
In other words your code is basically:
callcustom()
callanothercustom()
In your customcode you can still make sure that callanothercustom will be called only when it's finished by adding your own callback... I assume this is AJAX inside of it, so: jQuery ajax success callback function definition
function callcustom() {
$.ajax({
url : 'example.com',
type: 'GET',
success : callanothercustom
})
}
And then in callanothercustom you can busy.hide...
Or any other combination of business logic - it really depends on what's going on in your code.
In my opinion, the only main use case for using a busy indicator is a Long running synchronous task that's blocking UI. Let's say greater than 2 seconds long. Hopefully, these are few are far between.
If you have asynch tasks, then the UI is not blocked and user can interact. If you are relying on the results for next steps as you imply above, then you must have a callback/promise to trigger the next steps. If you want the user to be blocked until the async task is complete, then treat it as a synch task and show the Busy.
Be aware, use of Busy Indicator is now seen mostly as an anti-pattern. Its basically yelling at your user "See how slow this app is!". Sometimes you cannot avoid dead time in your app, such as fetching a large block of data to generate a view, but there are many ways to mitigate this. An example may be to -- get something on the view as fast as possible (< 1 sec), and then backfill with larger data. Just always ask yourself WHY you need this Busy, and can I work out a way to avoid it, but not leave the user wondering what the app is doing.

Use Scrapy to combine data from multiple AJAX requests into a single item

What is the best way to crawl pages with content coming from multiple AJAX requests? It looks like I have the following options (given that AJAX URLs are already known):
Crawl AJAX URLs sequentially passing the same item between requests
Crawl AJAX URLs concurrently and output each part as a separate item
with a shared key (e.g. source URL)
What is the most common practice? Is there a way to get a single item at the end, but allow some AJAX requests to fail w/o compromising the rest of the data?
scrapy is built for concurrency and statelessness, so if point 2 is possible, it is always preferred, from both speed and memory consumption aspects.
in case requests must be serialized, consider accumulate items in request meta field
Check scrapy-inline-requests. It allows to smoothly process multiple nested requests in a response handler.

Update HttpResponse Every Few Seconds

My application in Django can create some very big SQL queries. I currently use a HttpRequest object, for the data I need, then a HttpResponse, to return what I want to show the User.
Obviously, I can let the User wait for a minute whilst these many sets of queries are being executed and extracted from the database, then return this monolothic HTML page.
Ideally, I'd like to update the page when I want, something like:
For i,e in enumerate(example):
Table.objects.filter(someObjectForFilter[i]).
#Return the object to the page.
#Then Loop again, 'updating' the response after each iteration.
Is this possible?
I discovered recently that an HttpResponse can be a generator:
def myview(request, params):
return HttpResponse(mygenerator(params))
def mygenerator(params):
for i,e in enumerate(params):
yield '<li>%s</li>' % Table.objects.filter(someObjectForFilter[i])
This will progressively return the results of mygenerator to the page, wrapped in an HTML <li> for display.
Your approach is a bit flawed. You have a few different options.
The first is probably the easiest - use AJAX and HTTPRequest. Have a series of these, each of which results in a single Table.objects.filter(someObjectForFilter[i]).. As each one finishes, the script completes and returns the results to the client. The client updates the UI and initiates the next query via another AJAX call.
Another method is to use a batch system. This is a bit heftier, but probably a better design if you're going for real "heavy lifting" in the database. You'll need to have a batch daemon running (a cron probe works just fine for this) scanning for incoming tasks. The user wants to perform something, so their request submits this task (it could simply be a row in a database with their paramters). The daemon grabs it, processes it completely offline - perhaps even by a different machine - and updates the task row when it's complete with the results. The client can then refresh periodically to check the status of that row, via traditional or AJAX methods.