Scraping categories and subcategories using beautifulsoup - beautifulsoup

I am trying to retrieve all categories and subcategories within a website. I am able to use BeautifulSoup to pull every single product in the category once I am in it. However, I am struggling with the loop for categories. I'm using this as a test website: http://www.shophive.com.
How do I loop through each category as well as the subcategories on the left side of the website? I would like to extract all products within the category/subcategory and display on my page.

from bs4 import BeautifulSoup
import user_agent
import requests
useragent = user_agent.generate_user_agent(device_type='desktop')
headers = {'User-Agent': useragent}
req = requests.get('http://www.shophive.com/', headers=headers)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
main_category_links = []
for div in soup.find_all('div', class_='parentMenu arrow'):
main_category_links.append(soup.find('a').get('href'))
print(main_category_links)
subcategory_links = []
for link in soup.find_all('a', class_='itemMenuName'):
subcategory_links.append(link.get('href'))
print(subcategory_links)
I'll break this down for you piece by piece.
useragent = user_agent.generate_user_agent(device_type='desktop')
headers = {'User-Agent': useragent}
req = requests.get('http://www.shophive.com/', headers=headers)
html = req.text
Here we just make the request and store the HTML. I use a module called "user_agent" to generate a User Agent to use in the headers, just my preference.
<div class="parentMenu arrow">
<a href="http://www.shophive.com/year-end-clearance-sale">
<span>New Year's Clearance Sale</span>
</a>
</div>
The links for the main categories are stored like so, so in order to extract just the links we do this:
main_category_links = []
for div in soup.find_all('div', class_='parentMenu arrow'):
main_category_links.append(soup.find('a').get('href'))
We iterate over the results of soup.find_all('div', class_='parentMenu arrow') since elements the links we want are children of these elements. Then we append soup.find('a').get('href') to our list of main category links. We use soup.find this time because we only want one result, then we get the contents of the href.
<a class="itemMenuName level1" href="http://www.shophive.com/apple/mac">
<span>Mac / Macbooks</span>
</a>
The subcategories are stored like this, notice the "a" tag has a class this time, this makes it a little easier for us to find it.
subcategory_links = []
for link in soup.find_all('a', class_='itemMenuName'):
subcategory_links.append(link.get('href'))
Here we iterate over soup.find_all('a', class_='itemMenuName'). When you search for classes in BeautifulSoup, you can just search for part of the class name. This is helpful to us in this case since the class name varies from itemMenuName level1 to itemMenuName level2. These elements have the link inside of them already, so we just extract the contents of the href that holds the URL with link.get('href') and append it to our list of subcategory links.

Related

How to scrape time tag using BeautifulSoup?

I am trying to get the date of the tweet using the following code
div_class="css-901oao r-18jsvk2 r-37j5jr r-a023e6 r-16dba41 r-rjixqe r-bcqeeo r-bnwqim r-qvutc0
page = driver.page_source
soup = BeautifulSoup(page, "html.parser")
contents = soup.find_all(class_=div_class)
for p in contents:
print(p.time)
but None is printed
Provided that the element is included in your soup - Classes look high dynamic so better change your selection strategy and use more static id, HTML structure, attriutes.
Following css selector selects all <time> that is an directly child of an <a>:
for t in soup.select('a>time'):
# tag / element
print(t)
# text
print(t.text)
# attribute value
print(t.get('datetime))

HREF Class changing on every page

I am working to scrape the website:- "https://www.moglix.com/automotive/car-accessories/216110000?page=101" NOTE: 101 is the page number and this site has 783 pages.
I wrote this code to get all the URL's of the product mentioned on the page using beautifulsoup:-
prod_url = []
for i in range(1,400):
r = requests.get(f'https://www.moglix.com/automotive/car-accessories/216110000?page={i}')
soup = BeautifulSoup(r.content,'lxml')
for link in soup.find_all('a',{"class":"ng-tns-c100-0"}):
prod_url.append(link.get('href'))
There are 40 products on each page, and this should give me 16000 URLs for the products but I am getting 7600(approx)
After checking I can see that the class for a tag is changing on pages. For Eg:-
How to get this href for all the products on all the pages.
You can use find_all method and specified attrs to get all a tags also further filter it by using split and startswith method to get exact product link URL's
res=requests.get(f"https://www.moglix.com/automotive/car-accessories/216110000?page={i}")
soup=BeautifulSoup(res.text,"html.parser")
x=soup.find_all("a",attrs={"target":"_blank"})
lst=[i['href'] for i in x if (len(i['href'].split("/"))>2 and i['href'].startswith("/"))]
Output:
['/love4ride-steel-tubeless-tyre-puncture-repair-kit-tyre-air-inflator-with-gauge/mp/msnv5oo7vp8d56',
'/allextreme-exh4hl2-2-pcs-36w-9000lm-h4-led-headlight-bulb-conversion-kit/mp/msnekpqpm0zw52',
'/love4ride-2-pcs-35-inch-fog-angel-eye-drl-led-light-set-for-car/mp/msne5n8l6q1ykl',..........]

Unable to pare the href tag in python

I get the following output in my beautiful soup.
[Search over 301,944 datasets\n]
I need to extract only the number 301,944 in this. Please guide me how this can be done. My code so far
import requests
import re
from bs4 import BeautifulSoup
source = requests.get('https://www.data.gov/').text
soup = BeautifulSoup (source , 'lxml')
#print soup.prettify()
images = soup.find_all('small')
print images
con = images.find_all('a') // I am unable to get anchor tag here. It says anchor tag not present
print con
#for con in images.find_all('a',href=True):
#print con
#content = images.split('metrics')
#print content[1]
#images = soup.find_all('a', {'href':re.compile('\d+')})
#print images
There is only one <small> tag on website.
Your images variable references it. But you use it in a wrong way to retrive anchor tag.
If you want to retrieve text from a tag you can get it with:
soup.find('small').a.text
where find method returns first small element it encounters on website. If you use find_all, you will get list of all small elements (but there's only one small tag here).

why the code of python2.7 have no any output?

This is an example from a python book. When I run it I don't get any output. Can someone help me? Thanks!!!
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
text = urlopen('https://python.org/community/jobs').read()
soup = BeautifulSoup(text)
jobs = set()
for header in soup('h3'):
links = header('a', 'reference')
if not links: continue
link = links[0]
jobs.add('%s (%s)' % (link.string, link['href']))
print jobs.add('%s (%s)' % (link.string, link['href']))
print '\n'.join(sorted(jobs, key=lambda s: s.lower()))
reedit--
firstly,i only considered the url is wrong but ignore the html infomation i wanna to get was not exist. May be this is why i get empty output.
If you open the page and inspect the html you'll notice there are no <h3> tags containing links. This is why you have no output.
So if not links: continue always continues.
This is probably because the page has moved to https://www.python.org/jobs/ so the <h3> tags containing links on the page are no longer present.
If you point this code's url to the new page. I'd suggest using taking some time to familiarize yourself with the page source. For instance it uses <h2> instead of <h3> tags for its links.

Beautiful Soup NoneType error

I am trying to write a small script in Python to help me through some of the more tedious parts of my job. I wrote this:
from bs4 import BeautifulSoup
import lxml
import os
import sys
questid = "1478"
soup = BeautifulSoup(open("Data/xmls/quests.xml"), "lxml")
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
questSoup = BeautifulSoup(quest)
for floor in questSoup.find_all('location_id'):
print(floor)
What this is supposed to do is to get a part of a huge xml called "quests", based on tag - and its attribute - "id". Then it is supposed to make a new soup from that part and get all the tags from within the . For now, before I figure out which quest ids I want to choose (and how will I handle input), I just hardcoded one ("1478").
The script so far prints the quest, but fails to create a new soup from it.
Is it possible that the quest variable is not a string? Or am I doing something wrong?
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
# questSoup = BeautifulSoup(quest)
for floor in quest.find_all('location_id'):
print(floor)
No need to build a new soup object from tag object, you can use find_all on both of them, as both are navigable strings, so they behave in the same way and can be accessed in the same way.
In my opinion, soup object is special tag object which is named document
import requests, bs4
r =requests.get('http://www.google.com')
soup = bs4.BeautifulSoup(r.text, 'lxml')
soup.name
out:
'[document]'