Scrapy is not returning any data after a certain level of div

Scrapy is not returning any data after a certain level of div - scrapy

I am trying to crawl a website : https://www.firstpost.com/search/sachin-tendulkar
steps followed :
a. fetch("https://www.firstpost.com/search/sachin-tendulkar")
b. view(response) --> everything is working as expected till this point.
Once i start to extract the data with the below syntax I am able to only get divs upto certain levels
response.xpath('//div[#id="results"]').extract()
after this div i am not able to access any other divs and its content.
I haven't faced this kind of issue in past when developing crawler for other website.. is the issue site specific..?
Can you please let me know a way to crawl the internal divs?

Can you elaborate on "not able to access any other divs and its content"? Do you get any error?
I can access all the div's and their content. For ex. the main content of the search result is inside the div - gsc-expansionArea which can be accessed via
//div[class="gsc-expansionArea"]
and this can give you an iterable to work.
Only the first result is outside this div which can be accessed via another div
//div[class="gsc-webResult gsc-result"]
And the last sibling of this //div[class="gcsc-branding"] has no search results in it.

Related

How to get data in dashboard with Scrapy?

I'm scraping some data about car renting from getaround.com. I recently saw that it was possible to get cars availability with scrapy-splash from a calendar rendered with Javascript. An example is given in this url :
https://fr.getaround.com/location-voiture/liege/ford-fiesta-533656
The information I need is contained in the div tag with class owner_calendar_month. However, I saw that some data seem to be accessible in the div tag with class js_car_calendar calendar_large, in which the attribute data-path specify /dashboard/cars/533656/calendar. Do you know how to access this path ? And to scrape the data within it using Scrapy ?

If you visit https://fr.getaround.com/dashboard/cars/533656/calendar you get an error saying you have to be logged in to view the data. So first of all you would have to create a method in Scrapy to sign in to the website if you want to be able to scrape that data.

XHR request pulls a lot of HTML content, how can I scrape it/crawl it?

So, I'm trying to scrape a website with infinite scrolling.
I'm following this tutorial on scraping infinite scrolling web pages: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016
But the example given looks pretty easy, it's an orderly JSON object with the data you want.
I want to scrape this https://www.bahiablancapropiedades.com/buscar#/terrenos/venta/bahia-blanca/todos-los-barrios/rango-min=50.000,rango-max=350.000
The XHR response for each page is weird, looks like corrupted html code
This is how the Network tab looks
I'm not sure how to navigate the items inside "view". I want the spider to enter each item and crawl some information for every one.
In the past I've succesfully done this with normal pagination and rules guided by xpaths.

https://www.bahiablancapropiedades.com/buscar/resultados/0
This is XHR url.
While scrolling the page it will appear the 8 records per request.
So do one thing get all records XPath. these records divide by 8. it will appear the count of XHR requests.
do below process. your issue will solve. I get the same issue as me. I applied below logic. it will resolve.
pagination_count = xpath of presented number
value = int(pagination_count) / 8
for pagination_value in value:
url = https://www.bahiablancapropiedades.com/buscar/resultados/+[pagination_value]
pass this url to your scrapy funciton.

It is not corrupted HTML, it is escaped to prevent it from breaking the JSON. Some websites will return simple JSON data and others, like this one, will return the actual HTML to be added.
To get the elements you need to get the HTML out of the JSON response and create your own parsel Selector (this is the same as when you use response.css(...)).
You can try the following in scrapy shell to get all the links in one of the "next" pages:
scrapy shell https://www.bahiablancapropiedades.com/buscar/resultados/3
import json
import parsel
json_data = json.loads(response.text)
sel = parsel.Selector(json_data['view']) # view contains the HTML
sel.css('a::attr(href)').getall()

Modifying photosphere on website thing

What i am trying to do is to use a photosphere on my website so that it shows up on full screen as a website cover page. The problem is the the code to embed a photosphere in a webpage given here by google
https://developers.google.com/photo-sphere/web/
lets only the photosphere size to be hardcoded as
displaysize="600,400"
what ever the values but its still hardcoded. What i want is that it gets adjusted to the screen of the user and gets displayed in the whole browser window. Any one got an idea how to pull it off? I didn't find any stuff about 'photosphere on web' other than the google link i gave above.

Indeed the API is currently designed to take static values. I think it's a good point that users might want to set the dimensions to 100% and let it resize dynamically.
I put it on the TODO list and will try to get to it shortly.
In the meantime, one work around is the following: After the viewer loads you will find an iframe on the page which contains it. You can change it's dimensions dynamically to your liking and the viewer should adapt.

The API provided by Google wraps the whole photosphere in layers of iFrames.
You can use the API to request a certain photosphere but only use the response to parse it for the values you need. Then you create your own request and the result can be shown fullscreen.
An example link is this
I created this link dynamically from the JSON response from the elements
media$group media$content 0 url
Hope it helps.

Can't you take the raw image and just use webgl to project it on the inside of a sphere?

Drupal - Simple Edit can't find

Ok, I have been searching for days on how to fix the vimeo urls on this page: https://www.createjobsforusa.org
Basically, I just got an SSL certificate and I'd like to change http://{the vimeo url} vimeo linking videos to https://{the vimeo url} Simple edit is all I need, but I can't find where the videos are located.
Content Blocks? All I get are settings for this. Pages? So, I go to "Content" and I see a huge list of pages in there, I see a page called "Home", so I click on the Edit link and the body of the page is blank? Ok, so this has to be coming from someplace else, but where?
Can someone please help me with how to find the vimeo video URLs and change them to "https://" instead of "http://"
I think the View is called: A-Spots... here are pics of what I get when I click on the Edit A-Spot View:
What exactly am I supposed to do here? Seems like so much to do, but every option I seem to choose still doesn't give me the option to change the vimeo URLs.

A view just select nodes (or other entities) to show them. If you edit a view, you just change the way those nodes are selected or shown; you don't edit the nodes a view selects to show.
If you look at the preview of that view, you will notice it shows some numbers; those numbers are node IDs. Just edit the node from https://www.createjobsforusa.org/node/55291246/edit; replace 55291246 with the other IDs shown, and you will be able to edit all the nodes used from the view.
If that doesn't work, https://www.createjobsforusa.org/admin/content lists all the content in the site. Just look for the nodes whose ID is the one shown in that preview, and edit them.

Google+ : Multiple +1 on same page, different content

I've tried to find an answer to this (both in the dev docs and here), but with no luck.
The "+1 button" works fine on normal pages (where there's just the single +1). But I have a page with multiple entities (to use the terms of Drupal: A View displaying multiple nodes) where I'd like to add "share buttons". So far I've added Twitter and Facebook.
Twitter is the simplest as it just takes the string you give it..
Facebook takes an url, but you can specify your own url.
When I try to specify my own url for +1 I get this Error:
Unsafe JavaScript attempt to access frame with URL http://one80.seasites.se/whats-up from frame with URL https://plusone.google.com/_/+1/hover?hl=sv&url=http%3A%2F%2Fone80.seasites.se%2Fwhats-up%2Fl%25C3%25B6rdag&t=1342724634133&source=widget&isSet=false&referer=http%3A%2F%2Fone80.seasites.se%2Fwhats-up&jsh=m%3B%2F_%2Fapps-static%2F_%2Fjs%2Fgapi%2F__features__%2Frt%3Dj%2Fver%3Dr4LFRxx-_oY.sv.%2Fsv%3D1%2Fam%3D!ZCfx2q5v6YmYvWjcTQ%2Fd%3D1%2Frs%3DAItRSTNI50TT3SY8R9klRLc_1sBJ5_Rp3g#id=I3_1342724634541&parent=http%3A%2F%2Fone80.seasites.se&rpctoken=619983104&_methods=mouseEvent%2CtrackingEvent%2ConVisibilityChanged%2C_onopen%2C_ready%2C_onclose%2CcloseOrHideThisBubble%2C_close%2C_open%2C_resizeMe%2C_renderstart. Domains, protocols and ports must match.
rs=AItRSTOQ10u7fGwgD-LqzsOa-fsgdlhDCg:173
ec.a.v rs=AItRSTOQ10u7fGwgD-LqzsOa-fsgdlhDCg:173
xh rs=AItRSTOQ10u7fGwgD-LqzsOa-fsgdlhDCg:203
q.get rs=AItRSTOQ10u7fGwgD-LqzsOa-fsgdlhDCg:211
ec.w rs=AItRSTOQ10u7fGwgD-LqzsOa-fsgdlhDCg:173
Rh rs=AItRSTOQ10u7fGwgD-LqzsOa-fsgdlhDCg:208
q.w rs=AItRSTOQ10u7fGwgD-LqzsOa-fsgdlhDCg:220
Rb rs=AItRSTOQ10u7fGwgD-LqzsOa-fsgdlhDCg:30
Xg rs=AItRSTOQ10u7fGwgD-LqzsOa-fsgdlhDCg:187
(anonymous function) rs=AItRSTOQ10u7fGwgD-LqzsOa-fsgdlhDCg:226
To explain why I want to use separate URL:
every node is something like an event, every node has it's own url (which contains an image and text/info). So when you click Like (for FB) it gets the title, info & image and includes it in the post (So it says "What's up - Gathering", instead of a generic "What's up" and no/the same image).
I'd like to accomplish the same with G+.
Is there a way to accomplish this for G+?? Have I missed something??
I guess one way to do this is by using an iframe for each of the nodes and pull in a special version of the "node page" with just the g+-button. But that's a pretty nasty hack (and not that fun to set up).
Any ideas are welcome!

The error you're seeing is actually due to an issue in Chrome. The +1 button should automatically recover.
You can explicitly specify target pages by using the href attribute. Your markup will look like this in practice:
<g:plusone href="http://example.com/targeturl"></g:plusone>
Or like this with HTML5 syntax:
<div class="g-plusone" data-href="http://example.com/targeturl"></div>
If these don't work, can you share a link to a page where you're seeing it not work? I can take a look :)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scrapy is not returning any data after a certain level of div - scrapy

Related

How to get data in dashboard with Scrapy?

XHR request pulls a lot of HTML content, how can I scrape it/crawl it?

Modifying photosphere on website thing

Drupal - Simple Edit can't find

Google+ : Multiple +1 on same page, different content

Categories

Resources