Python in Beautiful Soup - beautifulsoup

I want to fetch text between
<label>A</label> class A <br/><label>B</label> class B <br/> <label>C </label> class C <br />
Expected output in Dictionary like data
{'A':'class A','B':'class B','C':'class C'}

You can search for <label> tag and then get next text sibling to it.
For example:
from bs4 import BeautifulSoup
txt = '''<label>A</label> class A <br/><label>B</label> class B <br/> <label>C </label> class C <br />'''
soup = BeautifulSoup(txt, 'html.parser')
data = {label.get_text(strip=True): label.find_next_sibling(text=True).strip() for label in soup.select('label')}
print(data)
Prints:
{'A': 'class A', 'B': 'class B', 'C': 'class C'}

Related

Dynamic dataframe concatenation takes garbage value in python-flask

The below snippets takes list of features dynamically from the html form and the app.py computes the corresponding feature and append the selected feature all together and write it in a CSV file. The problem here is during concatenation, the dataframe of features which are not selected takes some garbage value during concatenation. Also suggest how to append the name of the feature to the header dynamically
<input type="checkbox" id="meanT" name="tdf" value="meanT">
<label for="mean"> Mean</label><br>
<input type="checkbox" id="stdT" name="tdf" value="stdT">
<label for="std"> Standard Deviation</label><br>
<input type="checkbox" id="medianT" name="tdf" value="medianT">
<label for="median"> Median</label><br>
<input type="checkbox" id="madT" name="tdf" value="madT">
<label for="mad"> Mean Absolute Deviation </label><br>
<input type="checkbox" id="rmsT" name="tdf" value="rmsT">
<label for="rms"> Root Mean Square</label><br>
<input type="checkbox" id="covT" name="tdf" value="covT">
<label for="cov"> Covariance</label><br>
app.py
#app.route('/feature_selection', methods =['GET', 'POST']) def feature_selection(): if request.method == 'POST': features=request.form.getlist('tdf')
`import os`
`ROOT_PATH = os.path.dirname(os.path.abspath(__file__))`
`files = request.files['fs_file']`
`files.save(os.path.join(ROOT_PATH,files.filename))
import pandas
raw_csv2 = pandas.read_csv(os.path.join(ROOT_PATH,files.filename))
X=raw_csv2.iloc[:,:-1]
print(X)
print(len(X.columns))
np.savetxt("D:/tool/feat_tobe_sel.csv",X,delimiter=',',fmt='%s')
from scipy.fftpack import fft
final=[]`
`final_mean = np.empty((1,len(X.columns)),np.float64)
final_std = np.empty((1,len(X.columns)),np.float64)
final_median = np.empty((1,len(X.columns)),np.float64)
final_mad = np.empty((1,len(X.columns)),np.float64)
final_rms = np.empty((1,len(X.columns)),np.float64)
final_cov = np.empty((1,len(X.columns)),np.float64)`
`for feature in features:
print(feature)
if feature=='meanT':
for chunk in pd.read_csv('D:/tool/feat_tobe_sel.csv',chunksize=250):
mean = np.array(chunk.mean())#mean
final_mean=np.append(final_mean,[mean],axis=0)
print("meanT")
elif feature=='stdT':
for chunk in pd.read_csv('D:/tool/feat_tobe_sel.csv',chunksize=250):
std = np.array(chunk.std())#standard deviation
final_std = np.append(final_std, [std], axis=0)
print("stdT")
elif feature=='medianT':
for chunk in pd.read_csv('D:/tool/feat_tobe_sel.csv',chunksize=250):
median = np.array(chunk.median())#median
final_median = np.append(final_median, [median], axis=0)
print("medianT")
elif feature=='madT':
for chunk in pd.read_csv('D:/tool/feat_tobe_sel.csv',chunksize=250):
mad = np.array(chunk.mad())
final_mad = np.append(final_mad, [mad], axis=0)
elif feature=='rmsT':
for chunk in pd.read_csv('D:/tool/feat_tobe_sel.csv',chunksize=250):
rms = np.array(np.sqrt(np.mean(chunk**2)))
final_rms = np.append(final_rms, [rms], axis=0)
print("rmsT")
elif feature=='covT':
for chunk in pd.read_csv('D:/tool/feat_tobe_sel.csv',chunksize=250):
cov = chunk.cov()
for covItem in cov:
final_cov = np.append(final_cov, [np.array(cov[covItem])], axis=0)`
`
`df2=pandas.DataFrame(final_mean)
df3=pandas.DataFrame(final_std)
df4=pandas.DataFrame(final_median)
df5=pandas.DataFrame(final_mad)
df6=pandas.DataFrame(final_rms)
df7=pandas.DataFrame(final_cov)
dfs = [df2,df3,df4,df5,df6,df7]`
`non_empty=[df for df in dfs if len(df)!=0]
dfm=pd.concat(non_empty,axis=1)
np.savetxt(r"D:/tool/features_selected.csv",dfm,delimiter=',',fmt='%s') `
`return rendertemplates("feat.html")`

How can I get a value from an attribute inside a tag as a int

I have a soup object like:
class="js-product-discount-item product-discount__item ">
<p class="product-discount__price js-product-discount-price">
<span class="price">3 033 <span class="currency w500">₽<span class="currency_seo">руб.</span></span></span> </p>
I did
soup = BeautifulSoup(src, 'lxml')
price_2 = soup.find(class_='price-discount-value').find(class_='price').text.strip()
x = 2
Result :
3 033 ₽руб.
I'd like to make:
price_3 = price_2/x
I have : TypeError: unsupported operand type(s) for /: 'str' and 'int'
What happens?
You are extracting a string with .text but to use the / operand it should be an int
How to fix?
First at all, clean your string from non digit characters:
...find(class_='price').text.split('₽')[0].replace(' ','')
For calculating convert it with int() to an integer:
int(price_2)/x
Example
Note Changed the find() for these example, cause your question do not provide an correct html
from bs4 import BeautifulSoup
html = '''
<p class="product-discount__price js-product-discount-price">
<span class="price">3 033 <span class="currency w500">₽<span class="currency_seo">руб.</span></span></span>
</p>'''
soup = BeautifulSoup(html, 'lxml')
price_2 = soup.find(class_='product-discount__price').find(class_='price').text.split('₽')[0].replace(' ','')
x = 2
price_3 = int(price_2)/x
print(price_3)
Output
1516.5

How to find_all(data-video-id) from html using beautiful soup

How can I get the data-video-id attribute from the below HTML using BeautifulSoup?
<a href="/watch/36242552" class="thumbnail video vod-show play-video-trigger user-can-watch" data-video-id="36242552" data-video-type="show">
The following prints an empty list.
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
ids = [tag['data-video-id'] for tag in soup.select('a href[data-video-id]')]
print(ids)
Output:
[]
You are getting empty [] because soup.select('a href[data-video-id]') is return nothing.You could try below code. Hope its help you.
from bs4 import BeautifulSoup
html = """<a href="/watch/36242552" class="thumbnail video vod-show play-video-trigger user-can-watch" data-video-id="36242552" data-video-type="show">"""
# html_content = requests.get(url).text
soup = BeautifulSoup(html, "lxml")
print(soup.select('a href[data-video-id]'))
ids = [tag['data-video-id'] for tag in soup.select('a') if tag['data-video-id']]
print(ids)

Text between two h2 tags using BeautifulSoup

I am trying to learn scraping with selenium while parsing the page_source with "html.parser" of BS4 soup. I have all the Tags that contain h2 tag and a class name, but extracting the text in between doesn't seem to work.
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup as soup
opts = webdriver.ChromeOptions()
opts.binary_location = os.environ.get('GOOGLE_CHROME_BIN', None)
opts.add_argument("--headless")
opts.add_argument("--disable-dev-shm-usage")
opts.add_argument("--no-sandbox")
browser = webdriver.Chrome(executable_path="chromedriver", options=opts)
url1='https://www.animechrono.com/date-a-live-series-watch-order'
browser.get(url1)
req = browser.page_source
sou = soup(req, "html.parser")
h = sou.find_all('h2', class_='heading-5')
p = sou.find_all('div', class_='text-block-5')
for i in range(len(h)):
h[i] == h[i].getText()
for j in range(len(p)):
p[j] = p[j].getText()
print(h)
print(p)
browser.quit()
My Output :
[<h2 class="heading-5">Season 1</h2>, <h2 class="heading-5">Date to Date OVA</h2>, <h2 class="heading-5">Season 2</h2>, <h2 class="heading-5">Kurumi Star Festival OVA</h2>, <h2 class="heading-5">Date A Live Movie: Mayuri Judgement</h2>, <h2 class="heading-5">Season 3</h2>, <h2 class="heading-5">Date A Bullet: Dead or Bullet Movie</h2>, <h2 class="heading-5">Date A Bullet: Nightmare or Queen Movie</h2>]
['Episodes 1-12', 'Date to Date OVA', 'Episodes 1-10', 'Kurumi Star Festival OVA', 'Date A Live Movie: Mayuri Judgement', 'Episodes 1-12', 'Date A Bullet: Dead or Bullet Movie', 'Date A Bullet: Nightmare or Queen Movie']
Add this line before driver.quit():
h = [elem.text for elem in h]
print(h)
Full code:
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup as soup
opts = webdriver.ChromeOptions()
opts.binary_location = os.environ.get('GOOGLE_CHROME_BIN', None)
opts.add_argument("--headless")
opts.add_argument("--disable-dev-shm-usage")
opts.add_argument("--no-sandbox")
browser = webdriver.Chrome(executable_path="chromedriver", options=opts)
url1='https://www.animechrono.com/date-a-live-series-watch-order'
browser.get(url1)
req = browser.page_source
sou = soup(req, "html.parser")
h = sou.find_all('h2', class_='heading-5')
p = sou.find_all('div', class_='text-block-5')
for j in range(len(p)):
p[j] = p[j].getText()
h = [elem.text for elem in h]
print(h)
browser.quit()
Output:
['Season 1', 'Date to Date OVA', 'Season 2', 'Kurumi Star Festival OVA', 'Date A Live Movie: Mayuri Judgement', 'Season 3', 'Date A Bullet: Dead or Bullet Movie', 'Date A Bullet: Nightmare or Queen Movie']

getting the alt value in the div tag using beautifulsoup

Im trying to get the value "4" from below html from this website. This is just one of the values from the product list page. I want multiple values in a list form to put it in a dataframe.
<div class="review-stars-on-hover">
<divclass="product-rating">
<divclass="product-rating__meter"alt="4">
<divclass="product-rating__meter-btm">★★★★★</div>
<divclass="product-rating__meter-top"style="width:80%;">★★★★★</div>
</div>
<divclass="product-rating__countedf-font-size--xsmallnsg-text--medium-grey"alt="95">(95)</div>
</div>
</div>...
I tried:
items = soup.select('.grid-item-content')
star = [item.find('div', {'class': 'review-stars-on-hover'}).get('alt') for item in items]
Output(there are 16 products in total in the page, but only none shows up):
[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
Any advice please?
Try the following code.However it returns 16 records based on the class you have mentioned but its only having 11 records for the class product-rating__meter.I have provided the check if product-rating__meter class available then print the alt value.
Hope this will help.
from bs4 import BeautifulSoup
import requests
data= requests.get('https://store.nike.com/us/en_us/pw/mens-walking-shoes/7puZ9ypZoi3').content
soup = BeautifulSoup(data, 'lxml')
print("Total element count : " + str(len(soup.find_all('div',class_='grid-item-content'))))
for item in soup.find_all('div',class_='grid-item-content'):
if item.find('div',class_='product-rating__meter'):
print("Alt value : " + item.find('div',class_='product-rating__meter')['alt'])
Output
Total element count : 16
Alt value : 4
Alt value : 4.3
Alt value : 4.6
Alt value : 4.8
Alt value : 4.4
Alt value : 4.7
Alt value : 4.7
Alt value : 3.8
Alt value : 4.5
Alt value : 3.3
Alt value : 4.5
EDITED
from bs4 import BeautifulSoup
import requests
data= requests.get('https://store.nike.com/us/en_us/pw/mens-walking-shoes/7puZ9ypZoi3').content
soup = BeautifulSoup(data, 'lxml')
print("Total element count : " + str(len(soup.find_all('div',class_='grid-item-content'))))
itemlist=[]
for item in soup.find_all('div',class_='grid-item-content'):
if item.find('div',class_='product-rating__meter'):
#print("Alt value : " + item.find('div',class_='product-rating__meter')['alt'])
itemlist.append("Alt value : " + item.find('div',class_='product-rating__meter')['alt'])
print(itemlist)
OutPut:
Total element count : 16
['Alt value : 4', 'Alt value : 4.3', 'Alt value : 4.6', 'Alt value : 4.8', 'Alt value : 4.4', 'Alt value : 4.7', 'Alt value : 4.7', 'Alt value : 3.8', 'Alt value : 4.5', 'Alt value : 3.3', 'Alt value : 4.5']
You can select by taking the first match only for inner class within parent class
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://store.nike.com/us/en_us/pw/mens-walking-shoes/7puZ9ypZoi3')
soup = bs(r.content, 'lxml')
stars = [item.select_one('.product-rating__meter')['alt'] for item in soup.select('.grid-item-box:has(.product-rating__meter)')]
You can write something like below to retrieve all divs with "alt" attribute:
xml = bs.find_all("div", {"alt": True})
And to retrieve the value:
for x in xml:
print(x["alt"])
Or directly like below if you only want the first "alt":
xml = bs.find("div", {"alt": True})["alt"]