HREF Class changing on every page - beautifulsoup

I am working to scrape the website:- "https://www.moglix.com/automotive/car-accessories/216110000?page=101" NOTE: 101 is the page number and this site has 783 pages.
I wrote this code to get all the URL's of the product mentioned on the page using beautifulsoup:-
prod_url = []
for i in range(1,400):
r = requests.get(f'https://www.moglix.com/automotive/car-accessories/216110000?page={i}')
soup = BeautifulSoup(r.content,'lxml')
for link in soup.find_all('a',{"class":"ng-tns-c100-0"}):
prod_url.append(link.get('href'))
There are 40 products on each page, and this should give me 16000 URLs for the products but I am getting 7600(approx)
After checking I can see that the class for a tag is changing on pages. For Eg:-
How to get this href for all the products on all the pages.

You can use find_all method and specified attrs to get all a tags also further filter it by using split and startswith method to get exact product link URL's
res=requests.get(f"https://www.moglix.com/automotive/car-accessories/216110000?page={i}")
soup=BeautifulSoup(res.text,"html.parser")
x=soup.find_all("a",attrs={"target":"_blank"})
lst=[i['href'] for i in x if (len(i['href'].split("/"))>2 and i['href'].startswith("/"))]
Output:
['/love4ride-steel-tubeless-tyre-puncture-repair-kit-tyre-air-inflator-with-gauge/mp/msnv5oo7vp8d56',
'/allextreme-exh4hl2-2-pcs-36w-9000lm-h4-led-headlight-bulb-conversion-kit/mp/msnekpqpm0zw52',
'/love4ride-2-pcs-35-inch-fog-angel-eye-drl-led-light-set-for-car/mp/msne5n8l6q1ykl',..........]

Related

How can I get the link of all the posts in the Instagram profile with Selenium?

I'm trying to get the links of all the posts in an instagram profile.
How can I get to the href="/p/CX067tNhZ8i/" in the photo.
What I'm trying to do is find the href= blabla of all posts.
All your posts are in class="v1Nh3 kIKUG _bz0w".
I tried to get the hraf= blabla value from this class with the get_attribute command, but it didn't work.
Thank you for your help.
browser.get("https://www.instagram.com/lightning.mcqueen34/")
links = []
elements = browser.find_element_by_xpath('//*[#id="react-root"]/div/div/section/main/div/div[4]/article/div[1]/div/div[1]/div[3]')
for i in elements:
links.append(i.get_attribute('href'))
I thought this would work but the elements value is not a list . It gave an error.
This should work:elements = browser.find_elements_by_tag_name('a')
Below answer will not work in all cases, dependant on how the DOM of the page is loaded.
Replace this line:
elements = browser.find_element_by_xpath('//*[#id="react-root"]/div/div/section/main/div/div[4]/article/div[1]/div/div[1]/div[3]')
With:
elements = browser.find_element_by_xpath("//a[#href]")
This will let you retreive all links with a href from the page.
Try to change XPath first to get the DIV class or ID after trying this //a[#href] Xpath to get all HREF.

Selenium web scraping elements from tag

I'm looping from a diferents urls trying to get some information from some movies
I'm trying to get the writers. I am not extracting each csselector because perhaps in some other movie there is not the same number of scriptwriters and it would give an error. For this reason I want to extract the elements that are bound to the tag. For example I want to get all the elements of the tag "a" (image attached)
I have the following code but it's not working:
driver.find_element(By.TAG_NAME,"a")
I don't know if there is any other way without using tag
url movie = "https://www.imdb.com/title/tt7740496/?ref_=watch_fanfav_tt_t_4"
I think you are using python. Try to use one of this methods:
driver.find_elements_by_xpath('(//span[contains(text(),"Guión")])[1]/../div//a')
driver.find_elements(By.XPATH,'(//span[contains(text(),"Guión")])[1]/../div//a')
Check selenium documentation: Locating Elements
My result with java code it returns 3 elements as you want.

Make Selenium scroll LinkedIn to scrape jobs

I have this code scraping each job title and company name from :
https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt
This is for every job title
job_titles = browser.find_elements_by_css_selector("a.job-card-list__title")
c = []
for title in job_titles:
c.append(title.text)
print(c)
print((len(c)))
This is for every company name
Company_Names = browser.find_elements_by_css_selector("a.job-card-container__company-name")
d = []
for name in Company_Names:
d.append(name.text)
print(d)
print((len(d)))
I provided the URL above, there are many many pages!
how can I make Selenium auto-open each page and scrape each of the 4 thousand results available?
I have found a way to paginate to each page, but I am yet to know how to scrape each page.
So the URL is :
https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start=25
The start parameter in the URL increments by 25 from each page to the other.
so we add this piece of code which navigates us successfully to the other pages:
page = 25
pagination = browser.get('https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start={}'.format(page))
for i in range(1,40):
page = i * 25
pagination = browser.get('https://www.linkedin.com/jobs/search/?geoId=106155005&location=Egypt&start={}'.format(page

Scrapy returning empty lists when using css

I am trying to scrape nordstrom product descriptions. I got all the item links (stored in local mongodb db) and now am itertating through them and here is an example link https://www.nordstrom.ca/s/leith-ruched-body-con-tank-dress/5420732?origin=category-personalizedsort&breadcrumb=Home%2FWomen%2FClothing%2FDresses&color=001
My code for the spider is:
def parse(self, response):
items = NordstromItem()
description = response.css("div._26GPU").css("div::text").extract()
items['description'] = description
yield items
I also tried scrapy shell and the returned page is blank.
I am also using scrapy random agents.
I suggest you to use css or xpath selector to get the info you want. Here's more about it: https://docs.scrapy.org/en/latest/topics/selectors.html
And you can also use css/xpath checker to help identify if the selector gets the info you want. Like this Chrome extesion: https://autonomiq.io/chropath/

Scraping categories and subcategories using beautifulsoup

I am trying to retrieve all categories and subcategories within a website. I am able to use BeautifulSoup to pull every single product in the category once I am in it. However, I am struggling with the loop for categories. I'm using this as a test website: http://www.shophive.com.
How do I loop through each category as well as the subcategories on the left side of the website? I would like to extract all products within the category/subcategory and display on my page.
from bs4 import BeautifulSoup
import user_agent
import requests
useragent = user_agent.generate_user_agent(device_type='desktop')
headers = {'User-Agent': useragent}
req = requests.get('http://www.shophive.com/', headers=headers)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
main_category_links = []
for div in soup.find_all('div', class_='parentMenu arrow'):
main_category_links.append(soup.find('a').get('href'))
print(main_category_links)
subcategory_links = []
for link in soup.find_all('a', class_='itemMenuName'):
subcategory_links.append(link.get('href'))
print(subcategory_links)
I'll break this down for you piece by piece.
useragent = user_agent.generate_user_agent(device_type='desktop')
headers = {'User-Agent': useragent}
req = requests.get('http://www.shophive.com/', headers=headers)
html = req.text
Here we just make the request and store the HTML. I use a module called "user_agent" to generate a User Agent to use in the headers, just my preference.
<div class="parentMenu arrow">
<a href="http://www.shophive.com/year-end-clearance-sale">
<span>New Year's Clearance Sale</span>
</a>
</div>
The links for the main categories are stored like so, so in order to extract just the links we do this:
main_category_links = []
for div in soup.find_all('div', class_='parentMenu arrow'):
main_category_links.append(soup.find('a').get('href'))
We iterate over the results of soup.find_all('div', class_='parentMenu arrow') since elements the links we want are children of these elements. Then we append soup.find('a').get('href') to our list of main category links. We use soup.find this time because we only want one result, then we get the contents of the href.
<a class="itemMenuName level1" href="http://www.shophive.com/apple/mac">
<span>Mac / Macbooks</span>
</a>
The subcategories are stored like this, notice the "a" tag has a class this time, this makes it a little easier for us to find it.
subcategory_links = []
for link in soup.find_all('a', class_='itemMenuName'):
subcategory_links.append(link.get('href'))
Here we iterate over soup.find_all('a', class_='itemMenuName'). When you search for classes in BeautifulSoup, you can just search for part of the class name. This is helpful to us in this case since the class name varies from itemMenuName level1 to itemMenuName level2. These elements have the link inside of them already, so we just extract the contents of the href that holds the URL with link.get('href') and append it to our list of subcategory links.