Response.css() is giving no results for pagination in scrapy crawler after login

Response.css() is giving no results for pagination in scrapy crawler after login - scrapy

I want to read 'title' of list of projects which are in pagination & 335 records almost.
What i am trying to do is :
1) First I get response of the browser by this command in windows cmd:
scrapy shell https://www.slingshotinsights.com/projects
2)It shows the HTML rendered in cmd, and very next i write
reponse.css('a.grey-link').extract()
and press enter, it gives me [] 'empty array'.
The question is, How to get data from scrapy script for the URL's which appears after login? Because https://www.slingshotinsights.com/projects is the link, which comes when user successfully login the page.
And may be scrapy is unable to find that
reponse.css('a.grey-link').extract()
css selector becuase it can't be loaded in logout view.

Related

How to use Scrapy-Splash

I try to use Scrapy to crawl a web whose URL will not change with the page go on.I used Splash to simulate CLICK .However ,it only take me to second page.I wonder how can I keep get next page and how to crawl web like that.

Trying to log into site with scrapy and response shows login page

I'm new to Scrapy and I'm trying to get a log in working, starting in the shell. This is the site I'm trying to log into:
https://www.acdd.com/customer/account/login/
First I did
from scrapy.http import FormRequest
and then I did
token = response.xpath('//*[#name="form_key"]/#value').extract_first() to get the token and the output looks correct. I then did
FormRequest.from_response(response,formdata={'form_key': token,'login[customerid]': '12345','login[username]': 'myaddress#email.com','login[password]': 'mysecret'})
It outputs
<GET https://www.acdd.com/catalogsearch/result/?q=&login%5Bcustomerid%5D=12345&login%5Busername%5D=myaddress%40email.com&login%5Bpassword%5D=mysecret&form_key=abcdef12345>
If I do view(response) it just shows the login page and not the user page like it should. I've been following tutorials and examples but I think maybe there is just something different about this site than the simple examples I've used. I logged in with Firefox and looked in the developer tools to see what form data it POST and I have all the elements. It also looks like while the form is on https://www.acdd.com/customer/account/login/, it actually posts to https://www.acdd.com/customer/account/login/Post. I've tried to just post to that page in the shell but there are no form elements. This is outside the basic examples I've worked with. Any help is appreciated.

You didn't select target form and Scrapy uses the first one on the page (search form):
FormRequest.from_response(
response=response,
formid="login-form",
formdata={
'login[customerid]': '12345',
'login[username]': 'myaddress#email.com',
'login[password]': 'mysecret',
'send': "",
}
)
Also you don't need form_key here because Scrapy will get it from a form for you.
UPDATE Try to add send key.

XHR request pulls a lot of HTML content, how can I scrape it/crawl it?

So, I'm trying to scrape a website with infinite scrolling.
I'm following this tutorial on scraping infinite scrolling web pages: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016
But the example given looks pretty easy, it's an orderly JSON object with the data you want.
I want to scrape this https://www.bahiablancapropiedades.com/buscar#/terrenos/venta/bahia-blanca/todos-los-barrios/rango-min=50.000,rango-max=350.000
The XHR response for each page is weird, looks like corrupted html code
This is how the Network tab looks
I'm not sure how to navigate the items inside "view". I want the spider to enter each item and crawl some information for every one.
In the past I've succesfully done this with normal pagination and rules guided by xpaths.

https://www.bahiablancapropiedades.com/buscar/resultados/0
This is XHR url.
While scrolling the page it will appear the 8 records per request.
So do one thing get all records XPath. these records divide by 8. it will appear the count of XHR requests.
do below process. your issue will solve. I get the same issue as me. I applied below logic. it will resolve.
pagination_count = xpath of presented number
value = int(pagination_count) / 8
for pagination_value in value:
url = https://www.bahiablancapropiedades.com/buscar/resultados/+[pagination_value]
pass this url to your scrapy funciton.

It is not corrupted HTML, it is escaped to prevent it from breaking the JSON. Some websites will return simple JSON data and others, like this one, will return the actual HTML to be added.
To get the elements you need to get the HTML out of the JSON response and create your own parsel Selector (this is the same as when you use response.css(...)).
You can try the following in scrapy shell to get all the links in one of the "next" pages:
scrapy shell https://www.bahiablancapropiedades.com/buscar/resultados/3
import json
import parsel
json_data = json.loads(response.text)
sel = parsel.Selector(json_data['view']) # view contains the HTML
sel.css('a::attr(href)').getall()

Splash get html before finish Lua Script

I have web page with strong ajax pagination (button for only next page).
For go to page etc number 5, script should press button Next 5 times.
But after script click - data for current page will lost.
It's possible return html content from Lua script to scrapy, and after this continue script run?
Now i use bad way. I merge html code for each page inside Lua script, and after last page i return it. But i think it's not good.

Selenium IDE - Assert that JavaScript redirect worked after clicking Ajax button

I have a button that executes an Ajax request and then it successfully redirects to another page.
How do I assert that the redirected page was successfully reached?
I have a clickAndWait on the button. But after that..?

you can use verifyTextPresent command to verify a unique lable or text in the redirected page.by that way you can fix you have successfully reached the redirected page.
try like this
command : verifyTextPresent
Target : some unique text in the redirected page.
i think your problem can be fixed by this.

IDE has many assert commands. You can use any of the one and you can achieve the test scope(here the page is navigated or not).
Example:
command : assertTitle
Target : check the title of the page.
In the above he used verifyTextPresent it will check weather the text is present or not in the page and it will continue the next step. If you use assert commands it will proceed the next step when the assert step is passed. Otherwise it fails.
One thing you keep in mind, selenium won't wait for Ajax kind of loading, it will wait for the Page loading. So, you have to put Wait commands explicitly to finish the Ajax loading.
You can get moreassertions when you convert the selenese code into the preferred language and testing framework. You can see that option in the Option tab in the IDE.
For more info on AssertCommands in SIDE

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Response.css() is giving no results for pagination in scrapy crawler after login - scrapy

Related

How to use Scrapy-Splash

Trying to log into site with scrapy and response shows login page

XHR request pulls a lot of HTML content, how can I scrape it/crawl it?

Splash get html before finish Lua Script

Selenium IDE - Assert that JavaScript redirect worked after clicking Ajax button

Categories

Resources