When i am using soup. Find(p)['class'] its is saying literal['class'] cannot be assigns to the type 'SupportsIndex' slice - beautifulsoup

import requests
from bs4 import BeautifulSoup, NavigableString, Tag
url = "https://www.codewithharry.com"
r = requests.get(url)
htmlContent = r.content
soup = BeautifulSoup(htmlContent, 'html.parser')
print(soup.find('p')['class'])

This code showed me a warning in vs code for the written code and the output that was given was correct so you can just write there as shown below:
print (soup. Find('p')['class']) # type: ignore
Sometimes it show the warning in the terminal in vs code so you can use this to stop the error

Related

bs4 'find() takes no keyword arguments' error

from bs4 import BeautifulSoup
import requests
link = requests.get('https://www.amazon.sg/s?k=monitor&ref=nb_sb_noss_2').text
soup = BeautifulSoup(link, 'lxml')
product = soup.find('span', class_='a-offscreen').text
product_name = product.find('a', class_='a-link-normal a-text-normal').text
print(f'''
price: {products}
name: {product_name}
''')
When I executed this I got this error
TypeError: find() takes no keyword arguments
I have tried changing the class, I have retyped the entire thing twice
please help
Your second find is invoked on a str (so you're calling the builtin str.find() function) since you extract the text of the first span with .text:
product = soup.find('span', class_='a-offscreen').text # <---
product_name = product.find('a', class_='a-link-normal a-text-normal').text

How to get data from IMDb using BeautifulSoup

I'm trying to get the names of top 250 IMDb movies using BeautifulSoup. The code does not execute properly and shows no errors.
import requests
from bs4 import BeautifulSoup
url = "https://www.imdb.com/chart/top"
response = requests.get(url)
rc = response.content
soup = BeautifulSoup(rc,"html.parser")
for i in soup.find_all("td",{"class:":"titleColumn"}):
print(i)
I'm expecting it show me all of the td tags with titleColumn classes but it is not working. Am I missing something? Thanks in advance!
Remove the : after the class:
{"class:":"titleColumn"}
to
{"class":"titleColumn"}
Example ++
import requests
from bs4 import BeautifulSoup
url = "https://www.imdb.com/chart/top"
response = requests.get(url)
rc = response.content
soup = BeautifulSoup(rc,"html.parser")
data = []
for i in soup.find_all("td",{"class":"titleColumn"}):
data.append({
'people':i.a['title'],
'title':i.a.get_text(),
'info':i.span.get_text()
})
data

Problem Replacing <br> Tags with Newline Using bs4

Problem: I cannot replace <br> tags with a newline character using Beautiful Soup 4.
Code: My program (the relevant portion of it) currently looks like
for br in board.select('br'):
br.replace_with('\n')
but I have also tried board.find_all() in place of board.select().
Results: When I use board.replace_with('\n') all <br> tags are replaced with the string literal \n. For example, <p>Hello<br>world</p> would end up becoming Hello\nworld. Using board.replace_with(\n) causes the error
File "<ipython-input-27-cdfade950fdf>", line 10
br.replace_with(\n)
^
SyntaxError: unexpected character after line continuation character
Other Information: I am using a Jupyter Notebook, if that is of any relevance. Here is my full program, as there may be some issue elsewhere that I have overlooked.
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get("https://boards.4chan.org/g/")
soup = BeautifulSoup(page.content, 'html.parser')
board = soup.find('div', class_='board')
for br in board.select('br'):
br.replace_with('\n')
message = [obj.get_text() for obj in board.select('.opContainer .postMessage')]
image = [obj['href'] for obj in board.select('.opContainer .fileThumb')]
pid = [obj.get_text() for obj in board.select('.opContainer .postInfo .postNum a[title="Reply to this post"]')]
time = [obj.get_text() for obj in board.select('.opContainer .postInfo .dateTime')]
for x in range(len(image)):
image[x] = "https:" + image[x]
post = pd.DataFrame({
"ID": pid,
"Time": time,
"Image": image,
"Message": message,
})
post
pd.options.display.max_rows
pd.set_option('display.max_colwidth', -1)
display(post)
Any advice would be appreciated. Thank you for reading.
Just tried and it works for me, my bs4 version is 4.8.0, and I am using Python 3.5.3,
example:
from bs4 import BeautifulSoup
soup = BeautifulSoup('hello<br>world')
for br in soup('br'):
br.replace_with('\n')
# <br> was replaced with \n successfully
assert str(soup) == '<html><body><p>hello\nworld</p></body></html>'
# get_text() also works as expected
assert soup.get_text() == 'hello\nworld'
# it is a \n not a \\n
assert soup.get_text() != 'hello\\nworld'
I am not used to work with Jupyter Notebook, but seems that your problem is that whatever you are using to visualize the data is showing you the string representation instead of actually printing the string,
hope this helps,
Regards,
adb
Instead of replacing after converting to soup, try replacing the <br> tags before converting. Like,
soup = BeautifulSoup(str(page.content).replace('<br>', '\n'), 'html.parser')
Hope this helps! Cheers!
P.S.: I did not get any logical reason why this is not working after changing into soup.
After experimenting with variations of
page = requests.get("https://boards.4chan.org/g/")
str_page = page.content.decode()
str_split = '\n<'.join(str_page.split('<'))
str_split = '>\n'.join(str_split.split('>'))
str_split = str_split.replace('\n', '')
str_split = str_split.replace('<br>', ' ')
soup = BeautifulSoup(str_split.encode(), 'html.parser')
for the better part of two hours, I have determined that the Panda data-frame prints the newline character as a string literal. Everything else indicates that the program is working as intended, so I assume this has been the problem all along.
for some reason direct replace with newline does not work with bs4 you have to first replace with some other unique character (character sequence preferably) and then replace that sequence in text with newline.
try this.
for br in soup.find_all('br'): br.replace_with('+++')
text=soup.get_text().replace('+++','\n)

How to include try and Exceptions tests in a thousands downloads program that uses selenium and requests?

I have a program to download photos on various websites. Each url is formed at the end of the address by codes, which are accessed in a dataframe. In a dataframe of 8,583 lines
The sites have javascript, so I use selenium to access the src of the photos. And I download it with urllib.request.urlretrieve
Example of a photo site: http://divulgacandcontas.tse.jus.br/divulga/#/candidato/2018/2022802018/PB/150000608817
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
from bs4 import BeautifulSoup
import time
import urllib.request, urllib.parse, urllib.error
# Root URL of the site that is accessed to fetch the photo link
url_raiz = 'http://divulgacandcontas.tse.jus.br/divulga/#/candidato/2018/2022802018/'
# Accesses the dataframe that has the "sequencial" type codes
candidatos = pd.read_excel('candidatos_2018.xlsx',sheet_name='Sheet1', converters={'sequencial': lambda x: str(x), 'cpf': lambda x: str(x),'numero_urna': lambda x: str(x)})
# Function that opens each page and takes the link from the photo
def pegalink(url):
profile = webdriver.FirefoxProfile()
browser = webdriver.Firefox(profile)
browser.get(url)
time.sleep(10)
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
browser.close()
link = soup.find("img", {"class": "img-thumbnail img-responsive dvg-cand-foto"})['src']
return link
# Function that downloads the photo and saves it with the code name "cpf"
def baixa_foto(nome, url):
urllib.request.urlretrieve(url, nome)
# Iteration in the dataframe
for num, row in candidatos.iterrows():
cpf = (row['cpf']).strip()
uf = (row['uf']).strip()
print(cpf)
print("-/-")
sequencial = (row['sequencial']).strip()
# Creates full page address
url = url_raiz + uf + '/' + sequencial
link_foto = pegalink(url)
baixa_foto(cpf, link_foto)
Please I look guidance for:
Put a try-Exception type to wait for the page to load (I'm having errors reading the src - after many hits the site takes more than ten seconds to load)
And I would like to record all possible errors - in a file or dataframe - to write down the "sequencial" code that gave error and continue the program
Would anyone know how to do it? The guidelines below were very useful, but I was unable to move forward
I put in a folder a part of the data I use and the program, if you want to look: https://drive.google.com/drive/folders/1lAnODBgC5ZUDINzGWMcvXKTzU7tVZXsj?usp=sharing
put your code within :
try:
WebDriverWait(browser, 30).until(wait_for(page_has_loaded))
# here goes your code
except: Exception
print "This is an unexpected condition!"
For waitForPageToLoad :
def page_has_loaded():
page_state = browser.execute_script(
'return document.readyState;'
)
return page_state == 'complete'
30 above is time in seconds. You can adjust it as per your need.
Approach 2 :
class wait_for_page_load(object):
def __init__(self, browser):
self.browser = browser
def __enter__(self):
self.old_page = self.browser.find_element_by_tag_name('html')
def page_has_loaded(self):
new_page = self.browser.find_element_by_tag_name('html')
return new_page.id != self.old_page.id
def __exit__(self, *_):
wait_for(self.page_has_loaded)
def pegalink(url):
profile = webdriver.FirefoxProfile()
browser = webdriver.Firefox(profile)
browser.get(url)
try:
with wait_for_page_load(browser):
html = browser.page_source
soup = BeautifulSoup(html, "html.parser")
browser.close()
link = soup.find("img", {"class": "img-thumbnail img-responsive dvg-cand-foto"})['src']
except Exception:
print ("This is an unexpected condition!")
print("Erro em: ", url)
link = "Erro"
return link

BeautifulSoup code Error

I cannot seem to find the problem in this code.
Help will be appreciated.
import requests
from bs4 import BeautifulSoup
url = 'http://nytimes.com'
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html)
title = soup.find('span','articletitle').string
Code & Error Screenshot
The problem is http://nytimes.com does not have any articletitle span. To be safe, just check if soup.find('span','articletitle') is not None: before accessing it. Also, you don't need to access string property here. For example, the following would work fine.
import requests
from bs4 import BeautifulSoup
url = 'http://nytimes.com'
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html, 'html.parser')
if soup.find('div', 'suggestions') is not None:
title = soup.find('div','suggestions')
print(title)
Put your code inside try and catch & then print exception which is occurring. using the exception occurred you can rectify the problem.
Hi use a parser as your second argument for the get method,
Ex:-
page_content = BeautifulSoup(r.content, "html.parser")