Beautiful Soup List of Words - beautifulsoup

I am trying to find certain words within a website. Right now my code can only check for one word but I want it to be able to check for multiple words, (say instead of just checking for 'dog', i want it to check for ["dog","cat","adult"]
#Import Packages
import requests
from bs4 import BeautifulSoup
def count_words(url, the_word):
r = requests.get(url, allow_redirects=False)
soup = BeautifulSoup(r.content, 'lxml')
words = soup.find(text=lambda text: text and the_word in text)
print(words)
def main():
url = 'https://patch.com/illinois/alsip-crestwood/pet-adoption-alsip-crestwood-area-see-latest-
dogs-cats-more'
word= 'dog'
count = count_words(url, word)
print(url, count, word)
if __name__ == '__main__':
main()
Basically I do not know how to pass in a list of words instead of one singular string!

I believe you're making it a bit too complicated than what is actually necessary. Try something like this:
url = "https://patch.com/illinois/alsip-crestwood/pet-adoption-alsip-crestwood-area-see-latest-dogs-cats-more"
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
pets = ["dog","cat"]
for pet in pets:
print(pet, len(soup.find_all(text=lambda text: text and pet in text)))
Output:
dog 13
cat 76

Related

Extract part of string(/soup element) within a list of lists

I'm having some issues with scraping fish images off a website.
species_with_foto = ['/fangster/aborre-perca-fluviatilis/1',
'/fangster/almindelig-tangnaal-syngnathus-typhle/155',
'/fangster/ansjos-engraulis-encrasicholus/66',
'/fangster/atlantisk-tun-blaafinnet-tun-thunnus-thynnus-/137']
titles = []
species = []
for x in species_with_foto:
specie_page = 'https://www.fiskefoto.dk'+x
driver.get(specie_page)
content = driver.page_source
soup = BeautifulSoup(content)
brutto = soup.find_all('img', attrs={'class':'rapportBillede'})
species.append(brutto)
#print(brutto)
titles.append(x)
try:
driver.find_element(by=By.XPATH, value='/html/body/form/div[4]/div[1]/div/div[13]/div[2]/div/div').click()
print('CLicked next', x)
except NoSuchElementException:
print('Succesfully finished - :', x)
time.sleep(2)
This returns a list of lists with the sublist looking like this:
[<img alt="Aborre (Perca fluviatilis) aborrefiskeri, striber, rygfinne, regnorm, majs, spinner, " class="rapportBillede" src="/images/400/aborre-perca-fluviatilis-medefiskeri-bundrig-0,220kg-24cm-striber-rygfinne-regnorm-majs-spinner-358-22-29-14-740-2013-21-4.jpg" style="width:50%;"/>,
<img alt="Aborre (Perca fluviatilis) aborrefiskeri, striber, rygfinne, regnorm, majs, spinner, " class="rapportBillede" src="/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg" style="width:calc(50% - 6px);margin-bottom:7px;"/>]
How can i clean up the list and only keep the src="/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg" - part? I have tried with other variables in the soup.find_all but can't get it to work.
(The selenium part is also not functioning properly, but I'll get to that after......)
EDIT:
This is my code now, I'm really getting close :) One issue is that now my photos are not saved in a list of lists but just a list - I for the love of god don't understand why this happens?
Help to fix and understand would be greatly appreciated!
titles = []
fish_photos = []
for x in species_with_foto_mini:
site = "https://www.fiskefoto.dk/"+x
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')
titles.append(x)
try:
images = bs.find_all('img', attrs={'class':'rapportBillede'})
for img in images:
if img.has_attr('src'):
#print(img['src'])
a = (img['src'])
fish_photos.append(a)
except KeyError:
print('No src')
#navigate pages
try:
driver.find_element(by=By.XPATH, value='/html/body/form/div[4]/div[1]/div/div[13]/div[2]/div/div').click()
print('CLicked next', x)
except NoSuchElementException:
print('Succesfully finished -', x)
time.sleep(2)
EDIT:
I need the end result to be a list of lists looking something like this:
fish_photos =
[['/images/400/aborre-perca-fluviatilis-medefiskeri-bundrig-0,220kg-24cm-striber-rygfinne-regnorm-majs-spinner-358-22-29-14-740-2013-21-4.jpg',
'/images/400/aborre-perca-fluviatilis-medefiskeri-prop-flaad-med-levende-skalle-paa-enkeltkrog-1.6kg-46cm-6604-1724617.jpg',['/images/400/tungehvarre-arnoglossus-laterna-medefiskeri-6650-2523403.jpg', '/images/400/ulk-myoxocephalus-scorpius-medefiskeri-bundrig-koebenhavner-koebenhavner-torsk-mole-sild-boersteorm-pigge-351-18-48-9-680-2013-6-4.jpg'],[ '/images/400/graeskarpe-ctenopharyngodon-idella-medefiskeri-bobleflaad-med-toastbroed-paa-enkeltkrog-5.02kg-77cm-6436-7486.jpg','/images/400/graeskarpe-ctenopharyngodon-idella-medefiskeri-bobleflaad-med-toastbroed-paa-enkeltkrog-10.38kg-96cm-6337-4823146.jpg']
EDIT:
My output now is a list with identical lists. I need it to put every specie in its own list, like this: fish_photo_list = [[trout1, trout2, trout3], [other fish1, other fish 2, other], [salmon1, salmon2]]
My initial code this, but not now.
Here is an example, you can change:
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "[insert name of the site]"
html = urlopen(site)
bs = BeautifulSoup(html, 'html.parser')
try:
images = bs.find_all('img')
for img in images:
if img.has_attr('src'):
print(img['src'])
except KeyError:
print('No src')

Pandas: Creating multiple new columns from function with multiple output values

Im trying to scrape a website for multiple values regarding a list of books. The links to the book pages are stored in a dataframe. Now I need a function that iterates those links and adds the book values to new columns in the dataframe. I don't want to request the page again every time I'm scraping a new book value, so I want to do it all in one function.
The problem is the function then returns multiple values (e.g. book_title and book_rating) which I don't know how to best add to the dataframe.
I tried the following, which I know can't work but I'm stuck:
import requests as rq
from bs4 import BeautifulSoup as bs
import pandas as pd
#function to get the book page
def get_book_page(page):
# Construct the URL
books_page_url = page
# Get the HTML page content using requests
response = rq.get(books_page_url, headers = headers)
# Ensure that the response is valid
if response.status_code != 200:
print('Status code:', response.status_code)
raise Exception('Failed to fetch web page ' + books_page_url)
# Construct a beautiful soup document
doc = bs(response.content, "html.parser")
return doc
#function to scrape the book title
def scrape_book_title(book_content):
try:
title_tag = book_content.find("h1", class_="bc-heading bc-color-base bc-size-large bc-text-bold").text.strip()
except:
title_tag = "fehlt"
return title_tag
#function to scrape the book rating
def scrape_book_rating(book_content):
star_tag = book_content.find("li", class_="bc-list-item ratingsLabel")
try:
rating_tag = star_tag.find("span", class_="bc-text bc-pub-offscreen").text.strip()
except:
rating_tag = "fehlt"
return rating_tag
#function I'm trying to fix
def get_book_title(links):
bs_page = get_book_page(links)
bs_content = bs_page.find("ul", class_="bc-list bc-spacing-s2 bc-color-secondary bc-list-nostyle")
book_title = scrape_book_title(bs_content)
book_rating = scrape_book_rating(bs_content)
return book_title, book_rating
#here I would like to add the columns "A_Titel" and "A_Rating" with the values of "book_title" and "book_rating"
df['A_Titel'], df['A_Rating'] = df.apply(lambda x: get_book_title(x.Link), axis=1)

Using BeautifulSoup to exploit a URL and its dependent pages and store results in csv?

This code does not crash, which is good. However, it generates and empty icao_publications.csv f. I want to populate icao_publications.csv with all record on all the pages from the URL and capture all the pages. The dataset should be about 10,000 rows or their about in all.
I want to get these 10,000 or so rows in the csv file.
import requests, csv
from bs4 import BeautifulSoup
url = 'https://www.icao.int/publications/DOC8643/Pages/Search.aspx'
with open('Test1_Aircraft_Type_Designators.csv', "w", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["Manufacturers", "Model", "Type_Designator", "Description", "Engine_Type", "Engine_Count", "WTC"])
while True:
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
for row in soup.select('table tbody tr'):
writer.writerow([c.text if c.text else '' for c in row.select('td')])
if soup.select_one('li.paginate_button.active + li a'):
url = soup.select_one('li.paginate_button.active + li a')['href']
else:
break
Here you go:
import requests
import pandas as pd
url = 'https://www4.icao.int/doc8643/External/AircraftTypes'
resp = requests.post(url).json()
df = pd.DataFrame(resp)
df.to_csv('aircraft.csv',encoding='utf-8',index=False)
print('Saved to aircraft.csv')

How to get value of a cell in html page when click to a link in list link?

I have a list about 5000 link.
Ex 2 in 5000 link:
https://racevietnam.com/runner/buiducninh/ecopark-marathon-2019
https://racevietnam.com/runner/drtungnguyen83/ecopark-marathon-2019
...
I want to get value of column Time of Day and row Finish of links.
Ex:
09:51:07 AM - https://racevietnam.com/runner/buiducninh/ecopark-marathon-2019
07:50:55 AM - https://racevietnam.com/runner/ngocsondknb/ecopark-marathon-2019
I got user infor of a website, that website has id, class. But table in https://racevietnam.com/runner/ngocsondknb/ecopark-marathon-2019 have not id, class in table. So I can't.
#!/usr/bin/python
from urllib.request import urlopen
from bs4 import BeautifulSoup
list_user = []
for userID in range(1, 100000):
link = "https://example.com/member.php?u=" + str(userID)
html = urlopen(link)
bsObj = BeautifulSoup(html, "lxml")
user_name = bsObj.find("div", {"id":"main_userinfo"}).h1.get_text()
list_user.append(user_name)
print("username", userID, "is: ", user_name)
with open("result.txt", "a") as myfile:
myfile.write(user_name)
Please help me.
Thank you.
Using bs4 4.7.1.
There is only one table and you want the second column (td) of the last row. You can use last:child to select the last row; which should be used in conjunction with tbody type selector, and > child combinator, so as not to get header row. You can use nth-of-type to specify the td cell to return.
Now you may wish to develop this in at least two ways:
Handle cases where not found e.g.
name = getattr(soup.select_one('title'), 'text', 'N/A')
timing = getattr(soup.select_one('tbody > tr:last-child td:nth-of-type(2)'), 'text', 'N/A')
Add items to lists/data structure, which can be output as a dataframe at end and written out as csv. Or you may wish to stick with your current method
Python:
import requests
from bs4 import BeautifulSoup as bs
urls = ['https://racevietnam.com/runner/buiducninh/ecopark-marathon-2019', 'https://racevietnam.com/runner/drtungnguyen83/ecopark-marathon-2019']
with requests.Session() as s:
for url in urls:
r = s.get(url)
soup = bs(r.content, 'lxml')
name = soup.select_one('title').text
timing = soup.select_one('tbody > tr:last-child td:nth-of-type(2)').text
print(name, timing)
This is my code.
It's working Ok.
import requests
from bs4 import BeautifulSoup
f = open("input.ecopark","r")
f_content = f.readlines()
f.close()
for url in f_content:
r = requests.get(url.rstrip())
soup = BeautifulSoup(r.text, 'html.parser')
result = soup.select("table tbody tr td")
x = ""
for i in result:
if not x:
if i.get_text() == "Finish":
x = 1
continue
if x:
print(url.rstrip()+ " "+i.get_text())
break

How to get ASINs XPATH from 2 different Amazon pages that have the same parent nodes?

I made a web scraping program using python and webdriver and I want to extract the ASIN from 2 different pages. I would like xpath to work for these 2 links at the same .
These are the amazon pages:https://www.amazon.com/Hydro-Flask-Wide-Mouth-Flip/dp/B01ACATW7E/ref=sr_1_3?s=kitchen&ie=UTF8&qid=1520348607&sr=1-3&keywords=-gfds and
https://www.amazon.com/Ubbi-Saving-Special-Required-Locking/dp/B00821FLSU/ref=sr_1_1?s=baby-products&ie=UTF8&qid=1520265799&sr=1-1&keywords=-hgfd&th=1. They have the same parent nodes(id, classes). How can I make this program work for these 2 links at the same time?
So the problem is on these lines of code: 36, 41
asin = driver.find_element_by_xpath('//div[#id="detail-bullets"]/table/tbody/tr/td/div/ul/li[4]').text
and
asin = driver.find_element_by_xpath('//div[#id="detail-bullets_feature_div"]/div[#id="detail-bullets"]/table/tbody/tr/td/div/ul/li[5]').text. I have to change these lines to output in the csv the ASINs for these 2 products. For the first link it prints the wrong information and for the second it prints the ASIN.
I attached the code. I will appreciate any help.
from selenium import webdriver
import csv
import io
# set the proxies to hide actual IP
proxies = {
'http': 'http://5.189.133.231:80',
'https': 'https://27.111.43.178:8080'
}
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server="%s"' % ';'.join(['%s=%s' % (k, v) for k, v in proxies.items()]))
driver = webdriver.Chrome(executable_path="C:\\Users\Andrei-PC\Downloads\webdriver\chromedriver.exe",
chrome_options=chrome_options)
header = ['Product title', 'ASIN']
with open('csv/bot_1.csv', "w") as output:
writer = csv.writer(output)
writer.writerow(header)
links=['https://www.amazon.com/Hydro-Flask-Wide-Mouth-Flip/dp/B01ACATW7E/ref=sr_1_3?s=kitchen&ie=UTF8&qid=1520348607&sr=1-3&keywords=-gfds',
'https://www.amazon.com/Ubbi-Saving-Special-Required-Locking/dp/B00821FLSU/ref=sr_1_1?s=baby-products&ie=UTF8&qid=1520265799&sr=1-1&keywords=-hgfd&th=1'
]
for i in range(len(links)):
driver.get(links[i])
product_title = driver.find_elements_by_xpath('//*[#id="productTitle"][1]')
prod_title = [x.text for x in product_title]
try:
asin = driver.find_element_by_xpath('//div[#id="detail-bullets"]/table/tbody/tr/td/div/ul/li[4]').text
except:
print('no ASIN template one')
try:
asin = driver.find_element_by_xpath('//div[#id="detail-bullets_feature_div"]/div[#id="detail-bullets"]/table/tbody/tr/td/div/ul/li[5]').text
except:
print('no ASIN template two')
try:
data = [prod_title[0], asin]
except:
print('no items v3 ')
with io.open('csv/bot_1.csv', "a", newline="", encoding="utf-8") as output:
writer = csv.writer(output)
writer.writerow(data)
You can simply use
//li[b="ASIN:"]
to get required element on both pages