Here is my scrapy spider
class Spider(scrapy.Spider):
name = "name"
start_urls = ["https://www.aurl"]
def parse(self, response):
links_page_urls = response.css("a.dynamic-linkset::attr(href)").extract()
for url in contract_page_urls:
yield response.follow(url, callback=self.parse_list_page)
next_data_cursor = response.css("li.next").css("a::attr(href)").extract_first()
if next_data_cursor:
self.log("going to next page - {}".format(next_data_cursor))
yield response.follow(next_data_cursor, callback=self.parse)
def parse_list_page(self, response):
list = response.css("div.row div.list-group a.list-group-item").css("a::attr(href)").extract()
for url in list:
self.log("url - {}".format(url))
yield scrapy.Request(url=self.base_url + url, callback=self.parse_page)
def parse_page(self, response):
#Lots of code for parsing elements from a page
# build an item and return it
My observations is that on my own machine at home and with no download delay set, the actual pages are visited in rapid succession and saved to mongo. When I move this code to an EC2 instance and set the download delay to 60, what I now notice is that the web pages aren't visited for scraping and instead the first page is visited, a next data token is scraped and it's visited. Then I see a lot of print outs related to scraping the list pages rather than each individual page.
The desired behavior is to visit the initial URL, get the page list, and then visit each page and scrape it, then move onto the next data cursor and repeat this process.
You could try to set a much lower DOWNLOAD_DELAY (probably= 2) and also set CONCURRENT_REQUESTS = 1 assigning a higher priority to the requests from the page list, something like:
yield response.follow(url, callback=self.parse_list_page, priority=1)
Related
I am working to scrape the website:- "https://www.moglix.com/automotive/car-accessories/216110000?page=101" NOTE: 101 is the page number and this site has 783 pages.
I wrote this code to get all the URL's of the product mentioned on the page using beautifulsoup:-
prod_url = []
for i in range(1,400):
r = requests.get(f'https://www.moglix.com/automotive/car-accessories/216110000?page={i}')
soup = BeautifulSoup(r.content,'lxml')
for link in soup.find_all('a',{"class":"ng-tns-c100-0"}):
prod_url.append(link.get('href'))
There are 40 products on each page, and this should give me 16000 URLs for the products but I am getting 7600(approx)
After checking I can see that the class for a tag is changing on pages. For Eg:-
How to get this href for all the products on all the pages.
You can use find_all method and specified attrs to get all a tags also further filter it by using split and startswith method to get exact product link URL's
res=requests.get(f"https://www.moglix.com/automotive/car-accessories/216110000?page={i}")
soup=BeautifulSoup(res.text,"html.parser")
x=soup.find_all("a",attrs={"target":"_blank"})
lst=[i['href'] for i in x if (len(i['href'].split("/"))>2 and i['href'].startswith("/"))]
Output:
['/love4ride-steel-tubeless-tyre-puncture-repair-kit-tyre-air-inflator-with-gauge/mp/msnv5oo7vp8d56',
'/allextreme-exh4hl2-2-pcs-36w-9000lm-h4-led-headlight-bulb-conversion-kit/mp/msnekpqpm0zw52',
'/love4ride-2-pcs-35-inch-fog-angel-eye-drl-led-light-set-for-car/mp/msne5n8l6q1ykl',..........]
If possible I would like to ask for some assistance in scraping some details from a webpage.
https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&period=today&gf=13
The structure is as follows
Webpage data structure
Webpage data structure expanded
I am able to retrieve all songs using the following command:
response.css("div.trk-cell.title a").xpath("#href").extract()
or
resource.xpath("//div[#class='trk-cell title']/a/#href").get()
I am able to retrieve all artists using the following command:
response.css("div.trk-cell.artists a").xpath("#href").extract()
or
resource.xpath("//div[#class='trk-cell artists']/a/#href").get()
so now I am trying to perform a loop which extracts all the titles and artists on the page and encapsulate each result together in either csv or json. I am struggling to work out the for loop, I have been trying the following with no success.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "traxsourcedeephouse"
start_urls = ['https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&gf=13']
def parse(self, response):
for track in response.css("div.trklist.v-.full.v5"):
yield {
'link': track.xpath("//div[#class='trk-cell title']/a/#href").get(),
'artists': track.xpath("//div[#class='trk-cell artists']/a/#href").get()
}
As far as I can tell the "trklist" div appears to encapsulate the artist and title div's so I'm unsure as to why this code doesn't work.
I have tried the following command in the scrapy shell and it doesn't return any results which I suspect is the issue, but why not?
response.css("div.trklist.v-.full.v5")
A push in the correct direction would be a lot of help, thanks
You only select the table which contains the items, but not the items themselves, so you're not really looping through them.
The CSS selector to the table is a little different on scrapy so we need to match it (no v5).
Inside the loop you're missing a dot inside track.xpath(...).
Notice in the code that I added "hdr", I did it in order to skip the table's header.
I added both CSS and xpath for the for loop (they both work, choose one of them):
import scrapy
class QuotesSpider(scrapy.Spider):
name = "traxsourcedeephouse"
start_urls = ['https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&gf=13']
def parse(self, response):
# for track in response.css('div.trklist.v-.full div.trk-row:not(.hdr)'):
for track in response.xpath('//div[#class="trklist v- full init-invis"]/div[not(contains(#class, "hdr"))]'):
yield {
'link': track.xpath(".//div[#class='trk-cell title']/a/#href").get(),
'artists': track.xpath(".//div[#class='trk-cell artists']/a/#href").get()
}
In scrapy shell if you execute view(response) to view your response in web browser. You will find that there is no data because data is generating dynamically using javascript where scrapy does not work.
You should use selenium or other.
I have this code scraping each job title and company name from :
https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt
This is for every job title
job_titles = browser.find_elements_by_css_selector("a.job-card-list__title")
c = []
for title in job_titles:
c.append(title.text)
print(c)
print((len(c)))
This is for every company name
Company_Names = browser.find_elements_by_css_selector("a.job-card-container__company-name")
d = []
for name in Company_Names:
d.append(name.text)
print(d)
print((len(d)))
I provided the URL above, there are many many pages!
how can I make Selenium auto-open each page and scrape each of the 4 thousand results available?
I have found a way to paginate to each page, but I am yet to know how to scrape each page.
So the URL is :
https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start=25
The start parameter in the URL increments by 25 from each page to the other.
so we add this piece of code which navigates us successfully to the other pages:
page = 25
pagination = browser.get('https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start={}'.format(page))
for i in range(1,40):
page = i * 25
pagination = browser.get('https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start={}'.format(page
I am trying to scrape nordstrom product descriptions. I got all the item links (stored in local mongodb db) and now am itertating through them and here is an example link https://www.nordstrom.ca/s/leith-ruched-body-con-tank-dress/5420732?origin=category-personalizedsort&breadcrumb=Home%2FWomen%2FClothing%2FDresses&color=001
My code for the spider is:
def parse(self, response):
items = NordstromItem()
description = response.css("div._26GPU").css("div::text").extract()
items['description'] = description
yield items
I also tried scrapy shell and the returned page is blank.
I am also using scrapy random agents.
I suggest you to use css or xpath selector to get the info you want. Here's more about it: https://docs.scrapy.org/en/latest/topics/selectors.html
And you can also use css/xpath checker to help identify if the selector gets the info you want. Like this Chrome extesion: https://autonomiq.io/chropath/
My scrapy script has rules specified as below:
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=<xpath for next page>), callback=parse_website, follow= True, ),)
The website itself has a navigation but each page only shows the link to the next page. i.e as page 1 loads, I can get the link to page 2 and so on and so forth.
How do I get my spider to navigate through all of the n pages?
Thank you!