splash issue in scrapy

splash issue in scrapy - scrapy

Hi all I have seen lots of questions regarding this. I know that javascript dynamic page will rendered using scrapyjs or webdriver like selenium or phantomjs. webdriverkit is bit slow. I want somebody to guide me in this link
Price info before view deal button. I don't know which js is executing for this to use splash, scrapyjs can someone help me for this link.
thanks in advance.
EDIT
as per andres reply i have recreated XHR request. when we enter the XHR request url in browser window since it is a GET method if first hit i got partial json output. if we hit reload next time it loads more data that seems weired. can anyone help me in this. thanks in advance

When you request this URL:
http://ar.trivago.com/?iPathId=38715&iGeoDistanceItem=47160&aDateRange%5Barr%5D=2016-01-01&aDateRange%5Bdep%5D=2016-01-02&iRoomType=7&tgs=4716002&aHotelTestClassifier=&aPriceRange%5Bfrom%5D=0&aPriceRange%5Bto%5D=0&iIncludeAll=0&iGeoDistanceLimit=20000&aPartner=&iViewType=0&bIsSeoPage=false&bIsSitemap=false&
An XHR request is made to:
http://ar.trivago.com/search/region?iPathId=38715&bDispMoreFilter=false&iSlideOutItem=47160&aDateRange%5Barr%5D=2016-01-01&aDateRange%5Bdep%5D=2016-01-02&aCategoryRange=0%2C1%2C2%2C3%2C4%2C5&iRoomType=7&sOrderBy=relevance%20desc&aPartner=&aOverallLiking=1%2C2%2C3%2C4%2C5&iGeoDistanceLimit=20000&iOffset=0&iLimit=25&iIncludeAll=0&bTopDealsOnly=false&iViewType=0&aPriceRange%5Bfrom%5D=0&aPriceRange%5Bto%5D=0&iGeoDistanceItem=47160&aGeoCode%5Blng%5D=-0.1589&aGeoCode%5Blat%5D=51.513802&bIsSeoPage=false&mgo=false&bHotelTestContext=false&th=false&aHotelTestClassifier=&bSharedRooms=false&bIsSitemap=false&rp=&sSemKeywordInfo=&tgs=4716002&bRecommendedItem=false&iFilterTab=0&&_=1446673248317
Where you can find these values (in JSON format):
Which are the ones showed here:
So I think you don't need any ScrapyJS nor PhantomJS to scrape that information. Just understand where is it getting the information from and scrape the endpoint, directly.

Related

(Karate) How to intercept the XHR request response code?

I am testing a login functionality on a 3rd party website. I have this url example.com/login . When I copy and paste this into the browser (chrome), page sometimes load, but sometime does not (empty blank white page).
The problem is that I have to run a script on this page to click one of the elements (all the elements are embedded inside #shadow-root). If the page loads, no problem, script is evaluated successfully. But page sometimes does not load and it returns a 404 in response to an XHR request, and as a result, my * eval(scrip("script") step returns "js eval failed...".
So I found the solution to refresh the page, and to do that, I am considering to capture the xhr request response. If the status code is 404, then refresh the page. If not, continue with the following steps.
Now, I think this may work, but I do not know how to implement karate's Intercepting HTTP Requests. And firstly, is that something doable?
I have looked into documentation here, but could not understand the examples.
https://github.com/karatelabs/karate/tree/master/karate-netty
Meanwhile, if there is another way of refreshing the page conditionally, I will be more than happy to hear about it. Thanks anyone in advance.

First, using JavaScript you should be able to handle shadow roots: https://stackoverflow.com/a/60618233/143475
And the above answer links to advanced examples of executing JS in the context of the current page. I suggest you do some research into that, try to take the help of someone who knows JS, the DOM and HTML well - and you should be find a way to know if the XHR has been made successfully or not - for e.g. based on whether some element on the page has changed etc.
Finally here is how you can do interception: https://stackoverflow.com/a/61372471/143475

Trying to log into site with scrapy and response shows login page

I'm new to Scrapy and I'm trying to get a log in working, starting in the shell. This is the site I'm trying to log into:
https://www.acdd.com/customer/account/login/
First I did
from scrapy.http import FormRequest
and then I did
token = response.xpath('//*[#name="form_key"]/#value').extract_first() to get the token and the output looks correct. I then did
FormRequest.from_response(response,formdata={'form_key': token,'login[customerid]': '12345','login[username]': 'myaddress#email.com','login[password]': 'mysecret'})
It outputs
<GET https://www.acdd.com/catalogsearch/result/?q=&login%5Bcustomerid%5D=12345&login%5Busername%5D=myaddress%40email.com&login%5Bpassword%5D=mysecret&form_key=abcdef12345>
If I do view(response) it just shows the login page and not the user page like it should. I've been following tutorials and examples but I think maybe there is just something different about this site than the simple examples I've used. I logged in with Firefox and looked in the developer tools to see what form data it POST and I have all the elements. It also looks like while the form is on https://www.acdd.com/customer/account/login/, it actually posts to https://www.acdd.com/customer/account/login/Post. I've tried to just post to that page in the shell but there are no form elements. This is outside the basic examples I've worked with. Any help is appreciated.

You didn't select target form and Scrapy uses the first one on the page (search form):
FormRequest.from_response(
response=response,
formid="login-form",
formdata={
'login[customerid]': '12345',
'login[username]': 'myaddress#email.com',
'login[password]': 'mysecret',
'send': "",
}
)
Also you don't need form_key here because Scrapy will get it from a form for you.
UPDATE Try to add send key.

Response has nothing in it

I have been following the scrapy tutorial trying to create a very simple web scraper for warframe.market. I have about a year of coding experience from school, but no python experience. I simply want to get the price of an item from the website. I used the following to scrape the page:
scrapy shell "https://warframe.market/items/hydroid_prime_set"
then I inspected the web page to find the individual elements that I am trying to scrape. I used this command to try to view the results I wanted:
response.css("div.order-row.d-flex.col-12").extract()
This did not work, so I used view(response) to see what I had scraped, and my cmd just waits endlessly at this point.
Is HTTPS stopping me from scraping? Am I selecting the wrong css in my response? Is the webpage too big? Could someone please show me where I went wrong?
Thanks

The response isn't empty, but it's rendered using javascript (you can validate it inspecting the response.body), for example try this in the shell:
import json
data = json.loads(response.css('#application-state::text').extract_first())
for order in data.get('payload',{}).get('orders', []):
print '"{}" price: {}'.format(order.get('platinum'),
order.get('user',{}).get('ingame_name'))

scrape the reponse which would be loaded from ajax event

I am using scrapy tool to scrape content from website, i need help from you guys how to scrape the reponse which is dynamically loaded from ajax.
when content loading from ajax at that mean time url not changing it keep remains same but content would be changed so on that event i need to crawl.
thank you,
G.kavirajan

yield FormRequest('http://addons.prestashop.com/en/modules/featureproduct/ajax-homefeatured.php',
formdata={'type':'new','ajax':'1'},
callback=self.your_callback_method)

bellow are the urls that you can easily catch using fiddler or firebug
this is for featured tab http://addons.prestashop.com/en/modules/featureproduct/ajax-homefeatured.php?ajax=1&type=random
this is for new tab http://addons.prestashop.com/en/modules/featureproduct/ajax-homefeatured.php?ajax=1&type=new
you can request on these url directly to get results you required, although website is using POST request to get data for these url, but i tried with parameter GET request is also working properly

POST a HTML Form programmatically?

I need to POST a HTML form to a 3rd party website (a Mass-SMS texting system).
In the past I've done this by forwarding to a page containing a form I've pre-populated and hidden (using display:none), then I've ran a javascript function at the end of the page to automatically submit this form.
However I'm hoping theres someway I can do all this programmatically (as I don't care about the response, and the user doesn't need to see the page the form is being posted to).
How can I do this? Cheers

You could use a WebClient.UploadValues method to send an HTTP POST request to a remote server from your code behind. Just fill up the name/value collection with the values coming from the hidden fields.

If you're willing to get into PHP, you can very easily use cURL for this.
Otherwise it's going to be quite difficult using just Javascript.
See here for a detailed tutorial.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

splash issue in scrapy - scrapy

Related

(Karate) How to intercept the XHR request response code?

Trying to log into site with scrapy and response shows login page

Response has nothing in it

scrape the reponse which would be loaded from ajax event

POST a HTML Form programmatically?

Categories

Resources