BeautifulSoup: How to get the document element from a tag? - beautifulsoup

It is easy to create a new tag using the Document object, but how do I create a new tag if I just have a tag?
def bold(tag):
b = tag.new_tag('b') # no new_tag method here
tag.wrap(b)

All elements have the parents generator, just get the last one:
def bold(tag):
b = list(tag.parents)[-1].new_tag('b') # find root element
tag.wrap(b)
See #jason-s comment below for caveats.

Related

How to scrape time tag using BeautifulSoup?

I am trying to get the date of the tweet using the following code
div_class="css-901oao r-18jsvk2 r-37j5jr r-a023e6 r-16dba41 r-rjixqe r-bcqeeo r-bnwqim r-qvutc0
page = driver.page_source
soup = BeautifulSoup(page, "html.parser")
contents = soup.find_all(class_=div_class)
for p in contents:
print(p.time)
but None is printed
Provided that the element is included in your soup - Classes look high dynamic so better change your selection strategy and use more static id, HTML structure, attriutes.
Following css selector selects all <time> that is an directly child of an <a>:
for t in soup.select('a>time'):
# tag / element
print(t)
# text
print(t.text)
# attribute value
print(t.get('datetime))

How to find the next link after an id with selenium?

I want to return the links to all posts from a specific subreddit on my Reddit homepage. My intuition is to do this by looking for the next link after it finds an href = r/whatever.
I was using https://www.reddit.com/r/programming/
I would recommend using infinite scroll load.
Then after use this to grab all the links.
links = [x.get_attribute("href") for x in driver.find_elements(By.XPATH, "//a[#href and #data-click-id='body']")]
you can find all a tags with href attribute and after that, you can iterate through this list. python implementation.
driver = webdriver.WhateverDriver
links = driver.find_elements(By.XPATH, "//a[#href]") # This will return all links

Trying to resolve a scrapy python for loop

If possible I would like to ask for some assistance in scraping some details from a webpage.
https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&period=today&gf=13
The structure is as follows
Webpage data structure
Webpage data structure expanded
I am able to retrieve all songs using the following command:
response.css("div.trk-cell.title a").xpath("#href").extract()
or
resource.xpath("//div[#class='trk-cell title']/a/#href").get()
I am able to retrieve all artists using the following command:
response.css("div.trk-cell.artists a").xpath("#href").extract()
or
resource.xpath("//div[#class='trk-cell artists']/a/#href").get()
so now I am trying to perform a loop which extracts all the titles and artists on the page and encapsulate each result together in either csv or json. I am struggling to work out the for loop, I have been trying the following with no success.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "traxsourcedeephouse"
start_urls = ['https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&gf=13']
def parse(self, response):
for track in response.css("div.trklist.v-.full.v5"):
yield {
'link': track.xpath("//div[#class='trk-cell title']/a/#href").get(),
'artists': track.xpath("//div[#class='trk-cell artists']/a/#href").get()
}
As far as I can tell the "trklist" div appears to encapsulate the artist and title div's so I'm unsure as to why this code doesn't work.
I have tried the following command in the scrapy shell and it doesn't return any results which I suspect is the issue, but why not?
response.css("div.trklist.v-.full.v5")
A push in the correct direction would be a lot of help, thanks
You only select the table which contains the items, but not the items themselves, so you're not really looping through them.
The CSS selector to the table is a little different on scrapy so we need to match it (no v5).
Inside the loop you're missing a dot inside track.xpath(...).
Notice in the code that I added "hdr", I did it in order to skip the table's header.
I added both CSS and xpath for the for loop (they both work, choose one of them):
import scrapy
class QuotesSpider(scrapy.Spider):
name = "traxsourcedeephouse"
start_urls = ['https://www.traxsource.com/genre/13/deep-house/all?cn=tracks&ipp=50&gf=13']
def parse(self, response):
# for track in response.css('div.trklist.v-.full div.trk-row:not(.hdr)'):
for track in response.xpath('//div[#class="trklist v- full init-invis"]/div[not(contains(#class, "hdr"))]'):
yield {
'link': track.xpath(".//div[#class='trk-cell title']/a/#href").get(),
'artists': track.xpath(".//div[#class='trk-cell artists']/a/#href").get()
}
In scrapy shell if you execute view(response) to view your response in web browser. You will find that there is no data because data is generating dynamically using javascript where scrapy does not work.
You should use selenium or other.

Unable to pare the href tag in python

I get the following output in my beautiful soup.
[Search over 301,944 datasets\n]
I need to extract only the number 301,944 in this. Please guide me how this can be done. My code so far
import requests
import re
from bs4 import BeautifulSoup
source = requests.get('https://www.data.gov/').text
soup = BeautifulSoup (source , 'lxml')
#print soup.prettify()
images = soup.find_all('small')
print images
con = images.find_all('a') // I am unable to get anchor tag here. It says anchor tag not present
print con
#for con in images.find_all('a',href=True):
#print con
#content = images.split('metrics')
#print content[1]
#images = soup.find_all('a', {'href':re.compile('\d+')})
#print images
There is only one <small> tag on website.
Your images variable references it. But you use it in a wrong way to retrive anchor tag.
If you want to retrieve text from a tag you can get it with:
soup.find('small').a.text
where find method returns first small element it encounters on website. If you use find_all, you will get list of all small elements (but there's only one small tag here).

Scraping categories and subcategories using beautifulsoup

I am trying to retrieve all categories and subcategories within a website. I am able to use BeautifulSoup to pull every single product in the category once I am in it. However, I am struggling with the loop for categories. I'm using this as a test website: http://www.shophive.com.
How do I loop through each category as well as the subcategories on the left side of the website? I would like to extract all products within the category/subcategory and display on my page.
from bs4 import BeautifulSoup
import user_agent
import requests
useragent = user_agent.generate_user_agent(device_type='desktop')
headers = {'User-Agent': useragent}
req = requests.get('http://www.shophive.com/', headers=headers)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
main_category_links = []
for div in soup.find_all('div', class_='parentMenu arrow'):
main_category_links.append(soup.find('a').get('href'))
print(main_category_links)
subcategory_links = []
for link in soup.find_all('a', class_='itemMenuName'):
subcategory_links.append(link.get('href'))
print(subcategory_links)
I'll break this down for you piece by piece.
useragent = user_agent.generate_user_agent(device_type='desktop')
headers = {'User-Agent': useragent}
req = requests.get('http://www.shophive.com/', headers=headers)
html = req.text
Here we just make the request and store the HTML. I use a module called "user_agent" to generate a User Agent to use in the headers, just my preference.
<div class="parentMenu arrow">
<a href="http://www.shophive.com/year-end-clearance-sale">
<span>New Year's Clearance Sale</span>
</a>
</div>
The links for the main categories are stored like so, so in order to extract just the links we do this:
main_category_links = []
for div in soup.find_all('div', class_='parentMenu arrow'):
main_category_links.append(soup.find('a').get('href'))
We iterate over the results of soup.find_all('div', class_='parentMenu arrow') since elements the links we want are children of these elements. Then we append soup.find('a').get('href') to our list of main category links. We use soup.find this time because we only want one result, then we get the contents of the href.
<a class="itemMenuName level1" href="http://www.shophive.com/apple/mac">
<span>Mac / Macbooks</span>
</a>
The subcategories are stored like this, notice the "a" tag has a class this time, this makes it a little easier for us to find it.
subcategory_links = []
for link in soup.find_all('a', class_='itemMenuName'):
subcategory_links.append(link.get('href'))
Here we iterate over soup.find_all('a', class_='itemMenuName'). When you search for classes in BeautifulSoup, you can just search for part of the class name. This is helpful to us in this case since the class name varies from itemMenuName level1 to itemMenuName level2. These elements have the link inside of them already, so we just extract the contents of the href that holds the URL with link.get('href') and append it to our list of subcategory links.