Scrapy returning empty lists when using css - scrapy

I am trying to scrape nordstrom product descriptions. I got all the item links (stored in local mongodb db) and now am itertating through them and here is an example link https://www.nordstrom.ca/s/leith-ruched-body-con-tank-dress/5420732?origin=category-personalizedsort&breadcrumb=Home%2FWomen%2FClothing%2FDresses&color=001
My code for the spider is:
def parse(self, response):
items = NordstromItem()
description = response.css("div._26GPU").css("div::text").extract()
items['description'] = description
yield items
I also tried scrapy shell and the returned page is blank.
I am also using scrapy random agents.

I suggest you to use css or xpath selector to get the info you want. Here's more about it: https://docs.scrapy.org/en/latest/topics/selectors.html
And you can also use css/xpath checker to help identify if the selector gets the info you want. Like this Chrome extesion: https://autonomiq.io/chropath/

Related

Selenium web scraping elements from tag

I'm looping from a diferents urls trying to get some information from some movies
I'm trying to get the writers. I am not extracting each csselector because perhaps in some other movie there is not the same number of scriptwriters and it would give an error. For this reason I want to extract the elements that are bound to the tag. For example I want to get all the elements of the tag "a" (image attached)
I have the following code but it's not working:
driver.find_element(By.TAG_NAME,"a")
I don't know if there is any other way without using tag
url movie = "https://www.imdb.com/title/tt7740496/?ref_=watch_fanfav_tt_t_4"
I think you are using python. Try to use one of this methods:
driver.find_elements_by_xpath('(//span[contains(text(),"Guión")])[1]/../div//a')
driver.find_elements(By.XPATH,'(//span[contains(text(),"Guión")])[1]/../div//a')
Check selenium documentation: Locating Elements
My result with java code it returns 3 elements as you want.

Trying to resolve a scrapy python for loop

If possible I would like to ask for some assistance in scraping some details from a webpage.
https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&period=today&gf=13
The structure is as follows
Webpage data structure
Webpage data structure expanded
I am able to retrieve all songs using the following command:
response.css("div.trk-cell.title a").xpath("#href").extract()
or
resource.xpath("//div[#class='trk-cell title']/a/#href").get()
I am able to retrieve all artists using the following command:
response.css("div.trk-cell.artists a").xpath("#href").extract()
or
resource.xpath("//div[#class='trk-cell artists']/a/#href").get()
so now I am trying to perform a loop which extracts all the titles and artists on the page and encapsulate each result together in either csv or json. I am struggling to work out the for loop, I have been trying the following with no success.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "traxsourcedeephouse"
start_urls = ['https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&gf=13']
def parse(self, response):
for track in response.css("div.trklist.v-.full.v5"):
yield {
'link': track.xpath("//div[#class='trk-cell title']/a/#href").get(),
'artists': track.xpath("//div[#class='trk-cell artists']/a/#href").get()
}
As far as I can tell the "trklist" div appears to encapsulate the artist and title div's so I'm unsure as to why this code doesn't work.
I have tried the following command in the scrapy shell and it doesn't return any results which I suspect is the issue, but why not?
response.css("div.trklist.v-.full.v5")
A push in the correct direction would be a lot of help, thanks
You only select the table which contains the items, but not the items themselves, so you're not really looping through them.
The CSS selector to the table is a little different on scrapy so we need to match it (no v5).
Inside the loop you're missing a dot inside track.xpath(...).
Notice in the code that I added "hdr", I did it in order to skip the table's header.
I added both CSS and xpath for the for loop (they both work, choose one of them):
import scrapy
class QuotesSpider(scrapy.Spider):
name = "traxsourcedeephouse"
start_urls = ['https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&gf=13']
def parse(self, response):
# for track in response.css('div.trklist.v-.full div.trk-row:not(.hdr)'):
for track in response.xpath('//div[#class="trklist v- full init-invis"]/div[not(contains(#class, "hdr"))]'):
yield {
'link': track.xpath(".//div[#class='trk-cell title']/a/#href").get(),
'artists': track.xpath(".//div[#class='trk-cell artists']/a/#href").get()
}
In scrapy shell if you execute view(response) to view your response in web browser. You will find that there is no data because data is generating dynamically using javascript where scrapy does not work.
You should use selenium or other.

Selenium 4. find_element_by_css_selector not working

Based on the Selenium documentation the find element by css selector syntax is element = driver.find_element_by_css_selector('#foo') but the example shows there is a.nav before the # sign ‘(a.nav#home)’ which based on this website is HTML tag.
In another part of the Selenium documentation the css_selector even doesn't have the # sign: ele = driver.find_element(By.CSS_SELECTOR, 'h1')
Questions:
Which syntax is correct? with or without the HTML tag? with or without the # sign?
In Visual Studio Code I used these syntaxes to find the search boxes or sign-in boxes. It worked in this website but didn't work in this website. Could you help me find the search box using css_selector in this website?
Here is an example of my scripts:
from selenium import webdriver
from selenium.webdriver.common.by import By
try:
driver = webdriver.Chrome()
driver.get("https://www.arizonarealestate.com")
searchBox = driver.find_element(By.CSS_SELECTOR, "#input[placeholder='Enter city, address, neighborhood, zip, or MLS #']")
searchBox = driver.find_element(By.CSS_SELECTOR, "input#input[placeholder='Enter city, address, neighborhood, zip, or MLS #']")
searchBox.send_keys("Some text")
searchBtn = driver.find_element(By.CSS_SELECTOR, "button#.btn.btn-primary.btn-lg.btn-block.js-qs-btn").click()
finally:
#print("============ Done!")
driver.quit()
Generally speaking a css selector is just a string with some specific syntax and is not really defined by the selenium WebDriver itself.
You should have a look at the MDN description of css selectors.
In your question you specifically seem to have question on where to specify the id selector specified with the # character. This selector should actually only be used just by itself as all id's in a page should be unique and therefore no other information is needed.
In your example, the css selector #input[placeholder=...'] selector would select an element with an id equal to input.
If you intended selecting an input tag with a specific placeholder you should omit the #.

Enable to select element using Scrapy shell

I'm trying to print out all the titles of the products of this website using scrapy shell: 'https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas'
Once it is open I start fetching:
fetch('https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas')
And I try to print out the title of each product as a result nothing is selected:
>>> response.css('.shelfProductTile-descriptionLink::text')
output: []
Also tried:
>>> response.css('a')
output: []
How can I do ? Thanks
Your code is correct. What happens is that there is no a element in the HTML retrieved by scrapy. When you visit the page with your browser, the product list is populated with javascript, on the browser side. They are not in the HTML code.
In the doc you'll find techniques to pre-render javascript. Maybe you should try that.

why the code of python2.7 have no any output?

This is an example from a python book. When I run it I don't get any output. Can someone help me? Thanks!!!
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
text = urlopen('https://python.org/community/jobs').read()
soup = BeautifulSoup(text)
jobs = set()
for header in soup('h3'):
links = header('a', 'reference')
if not links: continue
link = links[0]
jobs.add('%s (%s)' % (link.string, link['href']))
print jobs.add('%s (%s)' % (link.string, link['href']))
print '\n'.join(sorted(jobs, key=lambda s: s.lower()))
reedit--
firstly,i only considered the url is wrong but ignore the html infomation i wanna to get was not exist. May be this is why i get empty output.
If you open the page and inspect the html you'll notice there are no <h3> tags containing links. This is why you have no output.
So if not links: continue always continues.
This is probably because the page has moved to https://www.python.org/jobs/ so the <h3> tags containing links on the page are no longer present.
If you point this code's url to the new page. I'd suggest using taking some time to familiarize yourself with the page source. For instance it uses <h2> instead of <h3> tags for its links.