Enable to select element using Scrapy shell - scrapy

I'm trying to print out all the titles of the products of this website using scrapy shell: 'https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas'
Once it is open I start fetching:
fetch('https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas')
And I try to print out the title of each product as a result nothing is selected:
>>> response.css('.shelfProductTile-descriptionLink::text')
output: []
Also tried:
>>> response.css('a')
output: []
How can I do ? Thanks

Your code is correct. What happens is that there is no a element in the HTML retrieved by scrapy. When you visit the page with your browser, the product list is populated with javascript, on the browser side. They are not in the HTML code.
In the doc you'll find techniques to pre-render javascript. Maybe you should try that.

Related

How to find Xpath of any "city" of variable dropdown?

I installed Chrome addon "Selectors Hub".
I opened site: spicejet.com
I choose some random city of "from" dropdown.
With help of "Selectors Hub" chrome addon, I grabbed the Xpath code of
that city:
//div[#class='css-1dbjc4n r-14lw9ot r-z2wwpe r-vgw6uq r-156q2ks r-urutk0 r-8uuktl r-136ojw6']//div[11]
While validating this Xpath code in console, it shows 0 matches.
This website is built on ReactJs as front-end, if I am not wrong, and finding the elements of a ReactJs website is a bit challenging; adding to that, if you rely on some locator finding tool, it's gets more difficult. It's always better you build your own locator strategy than rely on tools, especially for websites built with React, Vue, etc.
Having said that, the strategy here is to find the relatively narrowed down relative locator, and then since you are looking for a random selection of city, collect all the cities first, then apply random to it. Here is what it figured:
I collected cities, but along with it came some unwanted items (courtesy my relative locator), and I check them and if they are picked up, I pass them out, and only when a city is selected by random, I click on it. Check this code:
import random
driver.get("https://www.spicejet.com/")
time.sleep(10)
driver.find_element(By.XPATH, "//div[#data-testid='to-testID-destination']").click()
time.sleep(2)
cities = driver.find_elements(By.XPATH, "//div[#data-testid='to-testID-destination']//div[#data-focusable='true']")
print(len(cities))
x = random.choice(cities)
if x.text in ['To', 'India', 'International']:
pass
else:
print(x.text)
x.click()
time.sleep(5)
driver.quit()
Output:
Pakyong
Pakyong Airport
PYG
Process finished with exit code 0

Trying to resolve a scrapy python for loop

If possible I would like to ask for some assistance in scraping some details from a webpage.
https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&period=today&gf=13
The structure is as follows
Webpage data structure
Webpage data structure expanded
I am able to retrieve all songs using the following command:
response.css("div.trk-cell.title a").xpath("#href").extract()
or
resource.xpath("//div[#class='trk-cell title']/a/#href").get()
I am able to retrieve all artists using the following command:
response.css("div.trk-cell.artists a").xpath("#href").extract()
or
resource.xpath("//div[#class='trk-cell artists']/a/#href").get()
so now I am trying to perform a loop which extracts all the titles and artists on the page and encapsulate each result together in either csv or json. I am struggling to work out the for loop, I have been trying the following with no success.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "traxsourcedeephouse"
start_urls = ['https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&gf=13']
def parse(self, response):
for track in response.css("div.trklist.v-.full.v5"):
yield {
'link': track.xpath("//div[#class='trk-cell title']/a/#href").get(),
'artists': track.xpath("//div[#class='trk-cell artists']/a/#href").get()
}
As far as I can tell the "trklist" div appears to encapsulate the artist and title div's so I'm unsure as to why this code doesn't work.
I have tried the following command in the scrapy shell and it doesn't return any results which I suspect is the issue, but why not?
response.css("div.trklist.v-.full.v5")
A push in the correct direction would be a lot of help, thanks
You only select the table which contains the items, but not the items themselves, so you're not really looping through them.
The CSS selector to the table is a little different on scrapy so we need to match it (no v5).
Inside the loop you're missing a dot inside track.xpath(...).
Notice in the code that I added "hdr", I did it in order to skip the table's header.
I added both CSS and xpath for the for loop (they both work, choose one of them):
import scrapy
class QuotesSpider(scrapy.Spider):
name = "traxsourcedeephouse"
start_urls = ['https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&gf=13']
def parse(self, response):
# for track in response.css('div.trklist.v-.full div.trk-row:not(.hdr)'):
for track in response.xpath('//div[#class="trklist v- full init-invis"]/div[not(contains(#class, "hdr"))]'):
yield {
'link': track.xpath(".//div[#class='trk-cell title']/a/#href").get(),
'artists': track.xpath(".//div[#class='trk-cell artists']/a/#href").get()
}
In scrapy shell if you execute view(response) to view your response in web browser. You will find that there is no data because data is generating dynamically using javascript where scrapy does not work.
You should use selenium or other.

FormRequest that renders JS content in scrapy shell

I'm trying to scrape content from this page with the following form data:
I need the County: set to Prince George's and DateOfFilingFrom set to 01-01-2000 so I do the following:
% scrapy shell
In [1]: from scrapy.http import FormRequest
In [2]: request = FormRequest(url='https://registers.maryland.gov/RowNetWeb/Estates/frmEstateSearch2.aspx', formdata={'DateOfFilingFrom': '01-01-2000', 'County:': "Prince George's"})
In [3]: response
In [4]:
But it's not working(response is None) plus, the next page looks like the following which is loaded dynamically, I need to know how to be able to access each of the links shown below with the following inspection(as far as I know this might be done using Splash however, I'm not sure how to combine a SplashRequest within a FormRequest and do it all from within scrapy shell for testing purposes. I need to know what I'm doing wrong and how to render the next page(the one that results from the FormRequest shown below)
The request you're sending is missing a couple of fields, which is probably why you don't get a response back. The fields you fill in also don't correspond to the fields they are expecting in the request. A good way to deal with this is using scrapy's from_response (doc), which can populate some fields for you already based on the information in the form.
For this website the following worked for me (using scrapy shell):
>>> url = "https://registers.maryland.gov/RowNetWeb/Estates/frmEstateSearch2.aspx"
>>> fetch(url)
>>> from scrapy import FormRequest
>>> req = FormRequest.from_response(
... response,
... formxpath="//form[#id='form1']", # specify the form on the current page
... formdata={
... 'cboCountyId': '16', # the county you select is converted to a number
... 'DateOfFilingFrom': '01-01-2001',
... 'cboPartyType': 'Decedent',
... 'cmdSearch': 'Search'
... },
... clickdata={'type': 'submit'},
... )
>>> fetch(req)

Scrapy returning empty lists when using css

I am trying to scrape nordstrom product descriptions. I got all the item links (stored in local mongodb db) and now am itertating through them and here is an example link https://www.nordstrom.ca/s/leith-ruched-body-con-tank-dress/5420732?origin=category-personalizedsort&breadcrumb=Home%2FWomen%2FClothing%2FDresses&color=001
My code for the spider is:
def parse(self, response):
items = NordstromItem()
description = response.css("div._26GPU").css("div::text").extract()
items['description'] = description
yield items
I also tried scrapy shell and the returned page is blank.
I am also using scrapy random agents.
I suggest you to use css or xpath selector to get the info you want. Here's more about it: https://docs.scrapy.org/en/latest/topics/selectors.html
And you can also use css/xpath checker to help identify if the selector gets the info you want. Like this Chrome extesion: https://autonomiq.io/chropath/

why the code of python2.7 have no any output?

This is an example from a python book. When I run it I don't get any output. Can someone help me? Thanks!!!
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
text = urlopen('https://python.org/community/jobs').read()
soup = BeautifulSoup(text)
jobs = set()
for header in soup('h3'):
links = header('a', 'reference')
if not links: continue
link = links[0]
jobs.add('%s (%s)' % (link.string, link['href']))
print jobs.add('%s (%s)' % (link.string, link['href']))
print '\n'.join(sorted(jobs, key=lambda s: s.lower()))
reedit--
firstly,i only considered the url is wrong but ignore the html infomation i wanna to get was not exist. May be this is why i get empty output.
If you open the page and inspect the html you'll notice there are no <h3> tags containing links. This is why you have no output.
So if not links: continue always continues.
This is probably because the page has moved to https://www.python.org/jobs/ so the <h3> tags containing links on the page are no longer present.
If you point this code's url to the new page. I'd suggest using taking some time to familiarize yourself with the page source. For instance it uses <h2> instead of <h3> tags for its links.