Scrapy next href with rel="next" - scrapy

on the Scrapy example i found this line
next_page = response.css('div.prev-post > a ::attr(href)').extract_first()
Now i wan't select the first
So i will extract the first link where have a 'rel="next"' on it.
I try it with
next_page = response.css('div.prev-post > a[#rel="next"] ::attr(href)').extract_first()
But its dont work.
How i can do that?
Thanks
Joni

you are combining css selector with xpath selectors.
with css:
'a[rel="next"]::attr(href)'
with xpath
'//a[#rel="next"]/#href'

Related

How can I get the number of div inside a div in scrapy with css?

I would like to count the number of div (so 2) inside the div "thumb-container".
So far I have used css selectors so I would like to keep using css and not xpath:
yield {
'date':datetime.date.today(),
'title': response.css('h1::text').extract()[-1],
'rating':response.css('bl-rating::attr(rating)').get(),
}
getall() returns a list so you can get it's length.
With the code you provided (next time post it as text and not as image):
In [1]: len(response.css('div.thumb-container div').getall())
Out[1]: 2

Pandas web scraping(Beautiful soup) find in tag with class, another tag with a link. Then following the link inside href

I tried fins 'td' tag with specific attribute, and then find 'a' tag inside of the 'td' tag
for row in bs4.find_all('<td class="series-column"'):
for link in bs4.find_all('a'):
if link.has_attr('href') and (link.has_attr('class') == 'formatted-title external-link result-url'):
print(link.attrs['href'])
On the screenshot you see html for this page
Your bs4.find_all('<td class="series-column"') is wrong. You have to supply tag name and attributes you want to find, for example bs4.find_all('td', class_='series-column'). Or use CSS selector:
from bs4 import BeautifulSoup
txt = '''
<td class="series-column">
<a class="formatted-title external-link result-url" href="//knoema.com/...">link text</a>
</td>'''
soup = BeautifulSoup(txt, 'html.parser')
for link in soup.select('td.series-column a.formatted-title.external-link.result-url'):
print(link['href'])
Prints:
//knoema.com/...

Scrapy, No Errors, Spider closes after crawling

for restaurant in response.xpath('//div[#class="listing"]'):
restaurantItem = RestaurantItem()
restaurantItem['name'] = response.css(".title::text").extract()
yield restaurantItem
next_page = response.css(".next > a::attr('href')")
if next_page:
url = response.urlJoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
I fixed all the errors, that it was giving me. Now, I am getting no errors. Spider just closes, after crawling the start_url. the for loop never gets executed.
When you try to find an element this way:
response.xpath('//div[#class="listing"]')
You are telling I want to find a div that literally only has "listing" as its class:
<div class="listing"></div>
But this doesn't exist anywhere in the DOM, what's happening is the following:
<div class="listing someOtherClass"></div>
To select the above element you have to tell that the element contains a certain attribute text but can contain more. Here, like this:
response.xpath('//div[contains(#class,"listing")]')

Scrapy Running Results

Just getting started with Scrapy, I'm hoping for a nudge in the right direction.
I want to scrape data from here:
https://www.sportstats.ca/display-results.xhtml?raceid=29360
This is what I have so far:
import scrapy
import re
class BlogSpider(scrapy.Spider):
name = 'sportstats'
start_urls = ['https://www.sportstats.ca/display-results.xhtml?raceid=29360']
def parse(self, response):
headings = []
results = []
tables = response.xpath('//table')
headings = list(tables[0].xpath('thead/tr/th/span/span/text()').extract())
rows = tables[0].xpath('tbody/tr[contains(#class, "ui-widget-content ui-datatable")]')
for row in rows:
result = []
tds = row.xpath('td')
for td in enumerate(tds):
if headings[td[0]].lower() == 'comp.':
content = None
elif headings[td[0]].lower() == 'view':
content = None
elif headings[td[0]].lower() == 'name':
content = td[1].xpath('span/a/text()').extract()[0]
else:
try:
content = td[1].xpath('span/text()').extract()[0]
except:
content = None
result.append(content)
results.append(result)
for result in results:
print(result)
Now I need to move on to the next page, which I can do in a browser by clicking the "right arrow" at the bottom, which I believe is the following li:
<li><a id="mainForm:j_idt369" href="#" class="ui-commandlink ui-widget fa fa-angle-right" onclick="PrimeFaces.ab({s:"mainForm:j_idt369",p:"mainForm",u:"mainForm:result_table mainForm:pageNav mainForm:eventAthleteDetailsDialog",onco:function(xhr,status,args){hideDetails('athlete-popup');showDetails('event-popup');scrollToTopOfElement('mainForm\\:result_table');;}});return false;"></a>
How can I get scrapy to follow that?
If you open the url in a browser without javascript you won't be able to move to the next page. As you can see inside the li tag, there is some javascript to be executed in order to get the next page.
Yo get around this, the first option is usually try to identify the request generated by javascript. In your case, it should be easy: just analyze the java script code and replicate it with python in your spider. If you can do that, you can send the same request from scrapy. If you can't do it, the next option is usually to use some package with javascript/browser emulation or someting like that. Something like ScrapyJS or Scrapy + Selenium.
You're going to need to perform a callback. Generate the url from the xpath from the 'next page' button. So url = response.xpath(xpath to next_page_button) and then when you're finished scraping that page you'll do yield scrapy.Request(url, callback=self.parse_next_page). Finally you create a new function called def parse_next_page(self, response):.
A final, final note is if it happens to be in Javascript (and you can't scrape it even if you're sure you're using the correct xpath) check out my repo in using splash with scrapy https://github.com/Liamhanninen/Scrape

How can i click on a nested anchor href tag using selenium webdriver?

This is my html code:
YIKKU, TFYTUR
I want to click on the link name YIKKU TFYTUR, i have tried the following but nothing worked-
driver.findElement(By.partialLinkText("YIKKU, TFYTUR")).click();
driver.findElement(By.cssSelector("a[href*='Y']")).click();
can anyone please help me??
The only solution to these kind of Href tags are find the nearest "id" element, in my case was this-
<table id="resSearchResultsTBL">
then find this element using css selector:
WebElement guest = driver.findElement(By.cssSelector("table[id='resSearchResultsTBL']"));
and then find again in this element a sub element of "a href" tag:
guest.findElement(By.cssSelector("a[href*='guestProfile.do']")).click();
This worked perfectly for me.:)
Try -
WebElement link = driver.findElement(By.xpath("//a[#name=\"Y\"]"));
wait.until(ExpectedConditions.elementToBeClickable(link));
link.click();
or
WebElement link = driver.findElement(By.xpath("//a[#target=\"sgr\"]"));
wait.until(ExpectedConditions.elementToBeClickable(link));
link.click();