Scrapy LinkExtractor with svg element as Next button - scrapy

I am using a CrawlSpider that recursively follow links calling the next page using a link extraction like:
rules = (Rule(LinkExtractor(
allow=(),\
restrict_xpaths=('//a[contains(.,"anextpage")]')),\
callback='parse_method',\
follow=True),
)
I have applied this strategy to recursively crawl different websites, and as far as there was text in the html tag, like sometext, everything worked fine.
I am now trying to scrape a website that has an
<div class="bui-pagination__item bui-pagination__next-arrow">
<a class="pagenext" href="/url.html" aria-label="Pagina successiva">
<svg class="bk-icon -iconset-navarrow_right bui-pagination__icon" height="18" role="presentation" width="18" viewBox="0 0 128 128">
<path d="M54.3 96a4 4 0 0 1-2.8-6.8L76.7 64 51.5 38.8a4 4 0 0 1 5.7-5.6L88 64 57.2 94.8a4 4 0 0 1-2.9 1.2z"></path>
</svg>
</a>
</div>
as a 'next' button instead of simple text, and my LinkExtractor rule does not seem to apply anymore, and the spider stops after the first page.
I have tried to look for the svg element, but that doesn't seem to trigger the extraction:
restrict_xpaths=('//a[contains(.,name()=svg) and contains(#class,"nextpageclass")]'))
Is there anything I am missing?

That's most probably because the site uses javascript. You may need to use Splash to simulate clicks to navigate and return pre rendered websites. This is a good place to start:
https://docs.scrapy.org/en/latest/topics/dynamic-content.html

Related

eclipse xpath element button location

I cannot locate this element by xpath or another option.
<button tabindex="0" class="jss65 jss59" type="button">
<span class="jss64">
<svg class="jss68" focusable="false" viewBox="0 0 24 24" aria-hidden="true" role="presentation">
<path d="M15.41 7.41L14 6l-6 6 6 6 1.41-1.41L10.83 12z"></path>
<path fill="none" d="M0 0h24v24H0z"></path>
</svg>
</span>
<span class="jss77"></span>
</button>```
You can locate this button using relative xPaths. For example, the following relative xPath worked for me when locating this element:
//button[#class="jss65 jss59"]
For further information on how to customize your relative xPath, I would encourage reading up on this article: https://www.guru99.com/xpath-selenium.html
Personally, however, if you have the option of changing this element, I would highly encourage the use of id within the <button> tag. This way, you have the option of locating the element by id, rather than by relying on xPath. Since id is designed to always be unique per page, it can often be a much easier and reliable solution over xPath when locating an element.

Why does nth-child equivalent work for css selector but not xpath?

I am trying to get the xpath equivalent of this css_selector of this website.
To get 6 elements I add: div:nth-child(1). For xpath it would be //div[1] yet this makes no difference. I am wanting all the 6 numbers under the left result tab
Css:
div:nth-child(1) > proposition-return > div > animate-odds-change > div > div
Returns 6 elements
xpath (quite similar):
//div[#class='propositions-wrapper'][1]//div[contains(#class, 'proposition-return')]//animate-odds-change//div//div
Returns 18
I desire 6.
<div ng-repeat="odd in odds" class="animate-odd ng-binding ng-scope" ng-class="{
'no-animation': oddsShouldChangeWithoutAnimation
}" style="">2.05</div>
Those are totally different selectors:
CSS: div:nth-child(1) > proposition-return > div > animate-odds-change > div > div
Find every div that is the first child of its parent, then get its direct proposition-return child (I assume you wanted to use it like .proposition-return, then its direct div and so on..
XPATH: //div[#class='propositions-wrapper'][1]//div[contains(#class, 'proposition-return')]//animate-odds-change//div//div
Find all div elements with proposition-wrapper as class, then only get the first. After, find all div with proposition-return class that are descendant of the previous element and so on..
[1] is very different to nth-child(1) to begin with, and also // is not the same as >, but / is.
For getting the specific elements that you want, I would use this xpath:
//div[contains(#class, "template-item")]//div[#data-test-match-propositions]/div[1]//div[contains(#class, "animate-odd")]
that site is not available worldwide, but looking at the img you provided helps. as eLRuLL points out, your xpath is not equivalent to the css.
try getting rid of some of the double slashes
//div[#class='propositions-wrapper']//div[contains(#class, 'proposition-return')]/animate-odds-change/div/div[contains(#class, 'animate-odd')]

Selenium find all the elements which have two divs

I am trying to collect texts and images from a website to help collect missing people related tweets. Here is the problem:
Some tweets don't have images so the corresponding <div class='c' ....> has only one <div>...</div>.
Some tweets have images, so the corresponding <div class='c' ....> has two <div>...</div>, as shown in the following codes:
<div class='c' id="M_D*****">
<div>...</div>
and
<div class='c' id="M_D*****">
<div>...</div>
<div>...</div>
I intend to check whether a tweet has an image, i.e. find out whether the corresponding <div class='c' ....> has two <div>...</div>.
PS: The following codes are used to collect all the texts and image URLs but not all tweets have images so I want to match them by solving the above problem.
tweets = browser.find_elements_by_xpath("//span[#class='ctt']")
graph_links = browser.find_elements_by_xpath("//img[#alt='img' and #class='ib']")
This is a public welfare program, which aims to help the missing people go back home.
By collecting the text and the images separately, I think that it's going to be impossible to match the text with the related image after the fact. I would suggest a different approach. I would search for the <div class='c'...> that contains both the text and the optional image. Once you have the "container" DIV, you can then get the text and see if an image exists and put them all together. Without all the relevant HTML, you may have to tweak the code below but it should give you an idea on how to approach this.
containers = browser.find_elements_by_css_selector("div.c")
for container in containers:
print container.find_element_by_css_selector("span.ctt").text // the tweet text
images = container.find_elements_by_css_selector("img.ib")
if len(images) > 0 // see if the image exists
print images[0].get_attribute("src") // the URL of the image
print "-------------" // separator between tweets
The html you provided is probably not enough, but basing on it I suggest xpath: //div[#id='M_D*****' and ./div//img] which find div with specified id and containing div with image.
But answering directly to your question:
//div[./div[2] and not(./div[3])] will find all divs with exactly 2 div children

Scrapy, javascript form, not crawling next page

I am having an issue. I am using scrapy to extract data from HTML tables that are displayed after a form search. The problem is that it will not continue to crawl to the next page. I have tried multiple combinations of rules. I understand that it is not recommended to override the default parse logic in CrawlSpider. I have found many answers that fix others issues but, I have not been able to find a solution in which a form POST must occur first. I look at my code and see that it requests the allowed_urls then POST to search.do and the results are returned in HTML formatted results page and thus the parsing begins. Here is my code and I have replaced the real url with nourl.com
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import FormRequest, Request
from EMD.items import EmdItem
class EmdSpider(CrawlSpider):
name = "emd"
start_urls = ["https://nourl.com/methor"]
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=('//div//div//div//span[#class="pagelinks"]/a[#href]'))),
Rule(SgmlLinkExtractor(allow=('')), callback = 'parse_item')
)
def parse_item(self, response):
url = "https://nourl.com/methor-app/search.do"
payload = {"county": "ANDERSON"}
return (FormRequest(url, formdata = payload, callback = self.parse_data))
def parse_data(self, response):
print response
sel = Selector(response)
items = sel.xpath('//td').extract()
print items
I have left allow = ('') blank because I have tried so many combinations of it. Also in my xpath leads to this:
<div align="center">
<div id="bg">
<!--
Main Container
-->
<div id="header2"></div>
<!--
Content
-->
<div id="content">
<!--
Hidden/Accessible Headers
-->
<h1 class="hide"></h1>
<!--
InstanceBeginEditable name="Content"
-->
<h2></h2>
<p align="left"></p>
<p id="printnow" align="center"></p>
<p align="left"></p>
<span class="pagebanner"></span>
<span class="pagelinks">
[First/Prev]
<strong></strong>
,
<a title="Go to page 2" href="/methor-app/results.jsp?d-49653-p=2"></a>
,
<a title="Go to page 3" href="/methor-app/results.jsp?d-49653-p=3"></a>
[
/
]
</span>
I have checked with multiple tools and my xpath is correctly pointing to the URLs to go to next page. my output in the command prompt is only grabbing data from the first page. I have seen a couple of tutorials where the code contains a yield statement but I am not sure what that does other than "tell the function that it will be used again later without loosing its data" Any ideas would be helpful. Thank you!!!
It may be because you need to select the actual URL in your rule, not just the <a>node. [...] in XPath is used to make a condition, not select something. Try:
//span[#class="pagelinks"]/a/#href
Also a few comments:
How did you find this HTML? Beware of tools to find XPath, as HTML retrieved with browsers and with scrapy may be different, because scrapy doesn't handle Javascript (which can be used to generated the page you're looking at, and also some browsers try to sanitize HTML).
It may not be the case here, but the "javascript form" in a scrapy question spooked me. You should always check that the content of response.body is what you expect.
//div//div//div is exactly the same as //div. The two slashes means we don't care anymore about the structure, just select all the nodes named div in the children of the current node. That also why here //span[...] might do the trick.

CSS locator for corresponding xpath for selenium

The some part of the html of the webpage which I'm testing looks like this
<div id="twoWideCallouts">
<div class="callout">
<a target="_blank" href="http://facebook.com">Facebook</a>
</div>
<div class="callout last">
<a target="_blank" href="http://youtube.com">Youtube</a>
</div>
I've to check using selenium that when I click on text, the URL opened is the same that is given in href and not error page.
Using Xpath I've written the following command
//i is iterator
selenium.getAttribute("//div[contains(#class, 'callout')]["+i+"]/a/#href")
However, this is very slow and for some of the links doesn't work. By reading many answers and comments on this site I've come to know that CSS loactors are faster and cleaner to maintain so I wrote it again as
css = div:contains(callout)
Firstly, I'm not able to reach to the anchor tag.
Secondly, This page can have any number of div where id = callout. Using xpathcount i can get the count of this, and I'll be iterating on that count and performing the href check. How can something similar be done using CSS locator?
Any help would be appreciated.
EDIT
I can click on the link using the locator css=div.callout a, but when I try to read the href value using String str = "css=div.callout a[href]";
selenium.getAttribute(str);. I get the Error - element not found. Console description is given below.
19:12:33.968 INFO - Command request: getAttribute[css=div.callout a[href], ] on session
19:12:33.993 INFO - Got result: ERROR: Element css=div.callout a[href not found on session
I tried to get the href attribute using xpath like this
"xpath=(//div[contains(#class, 'callout')])["+1+"]/a/#href" and it worked fine.
Please tell me what should be the corresponding CSS locator for this.
It should be -
css = div:contains(callout)
Did you notice ":" instead of "." you used?
For CSSCount this might help -
http://www.eviltester.com/index.php/2010/03/13/a-simple-getcsscount-helper-method-for-use-with-selenium-rc/
#
On a different note, did you see proposal of new selenium site on area 51 - http://area51.stackexchange.com/proposals/4693/selenium.
#
To read the sttribute I used css=div.callout a#href and it worked. The problem was with use of square brackets around attribute name.
For the first part of your question, anchor your identifier on the hyperlink:
css=a[href=http://youtube.com]
For achieving a count of elements in the DOM, based on CSS selectors, here's an excellent article.