Web scraping using python throws empty array

Web scraping using python throws empty array - beautifulsoup

import requests
from bs4 import BeautifulSoup as soup
my_url='http://cvr.ac.in/home4/index.php/eee-sp-870859316/eeefaculty'
page=requests.get(my_url)
data=page.text
page_soup=soup(data,'html.parser')
cont=page_soup.select("div",{"class": "item-page"})
print(cont)
I am trying to scrape the faculty details name, designation , profile into a csv file .
when I use above code it throws empty [].
any help greatly appreciated.

The page is looking for any of a defined set of valid user agents. For example,
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://cvr.ac.in/home4/index.php/eee-sp-870859316/eeefaculty', headers = {'User-Agent': 'Chrome/80.0.3987.163'})
soup = bs(r.content, 'lxml')
print(soup.select('.item-page'))
Without that, you get an 406 response and the classes you are looking for in the html are not present.

Related

Scrape wikipedia table using BeautifulSoup

I would like to scrape the table titled "List of chemical elements" from the wikipedia link below and display it using pandas
https://en.wikipedia.org/wiki/List_of_chemical_elements
I am new to beautifulsoup and this is currently what i have.
from bs4 import BeautifulSoup
import requests as r
import pandas as pd
response = r.get('https://en.wikipedia.org/wiki/List_of_chemical_elements')
wiki_text = response.text
soup = BeautifulSoup(wiki_text, 'html.parser')
table_soup = soup.find_all('table')

You can select the table with beautifulsoup in different ways:
By its "title":
soup.select_one('table:-soup-contains("List of chemical elements")')
By order in tree (it is the first one):
soup.select_one('table')
soup.select('table')[0]
By its class (there is no id in your case):
soup.select_one('table.wikitable')
Or simply with pandas
pd.read_html('https://en.wikipedia.org/wiki/List_of_chemical_elements')[0]
*To get the expected result, try it yourself and if you have difficulties, ask a new question.

Using BS, I cannot "find" the ID of info, when I know it exists

I am a new user to Beautiful Soup and am trying to create a baby application that retrieves the view count from a YouTube url.
So, I looked at the BS docs and I saw that you could retrieve items by their id. So I attempted to retrieve the info id - but whenever I attempt to do this, it comes out as "None", so it must not be finding the id.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/watch?v=bUPvE5yv72I'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
divisions = soup.findAll("div")
print(divisions[0])
info = soup.find(id="info")
print(info)

You can try to search for meta itemprop="interactionCount" tag, but this value can be often not exact. Best way is using the official YouTube API:
import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/watch?v=bUPvE5yv72I'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
print(soup.select_one('meta[itemprop="interactionCount"][content]')['content'])
Prints:
165011

BeautifulSoup findAll() not finding all, regardless of which parser I use

So I've read through all the questions about findAll() not working that I can find, and the answer always seems to be an issue with the particular html parser. I have run the following code using the default 'html.parser' along with 'lxml' and 'html5lib' yet I can only find one instance when I should be finding 14.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://robertsspaceindustries.com/pledge/ships'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, features = "lxml")
containers = page_soup.findAll("ul", {"class":"ships-listing"})
len(containers)
I tried posting a picture of the HTML code, but I don't have enough reputation. Here is a link to the image (https://imgur.com/a/cOLO8GO).

When you download a page through urllib (or requests HTTP library) it downloads the original HTML source file.
Initially there's only sinlge tag with the class name 'ships-listing' because that tag comes with the source page. But once you scroll down, the page generates additional <ul class='ships-listing'> and these elements are generated by the JavaScript.
So when you download a page using urllib, the downloaded content only contains the original source page (you could see it by view-source option in the browser).

Web Scraping with Beautiful Soup in Python - JavaScript Table

Im trying to scrape a table from a website but I cant seem to figure it out with Beautifulsoup in Python. Im not sure if its because of the table format, but I basically want to turn this table into a CSV.
from bs4 import BeautifulSoup
import requests
page = requests.geenter code heret("https://spotwx.com/products/grib_index.php?model=hrrr_wrfprsf&lat=41.03399&lon=-73.76291&tz=America/New_York&display=table")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify)
Any advice on how to isolate this data table? I've checked so many Beautifulsoup tutorials, but the HTML looks different than most references. Many thanks in advance for your help -

Try this. The table from that site generates dynamically so you can't get results using requests only.
from selenium import webdriver
from bs4 import BeautifulSoup
import csv
link = "https://spotwx.com/products/grib_index.php?model=hrrr_wrfprsf&lat=41.03399&lon=-73.76291&tz=America/New_York&display=table"
with open("spotwx.csv", "w", newline='') as infile:
writer = csv.writer(infile)
writer.writerow(['DateTime','Tmp','Dpt','Rh','Wh','Wd','Wg','Apcp','Slp'])
with webdriver.Chrome() as driver:
driver.get(link)
soup = BeautifulSoup(driver.page_source, 'lxml')
for item in soup.select("table#example tbody tr"):
data = [elem.text for elem in item.select('td')]
print(data)
writer.writerow(data)

Beautiful Soup NoneType error

I am trying to write a small script in Python to help me through some of the more tedious parts of my job. I wrote this:
from bs4 import BeautifulSoup
import lxml
import os
import sys
questid = "1478"
soup = BeautifulSoup(open("Data/xmls/quests.xml"), "lxml")
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
questSoup = BeautifulSoup(quest)
for floor in questSoup.find_all('location_id'):
print(floor)
What this is supposed to do is to get a part of a huge xml called "quests", based on tag - and its attribute - "id". Then it is supposed to make a new soup from that part and get all the tags from within the . For now, before I figure out which quest ids I want to choose (and how will I handle input), I just hardcoded one ("1478").
The script so far prints the quest, but fails to create a new soup from it.
Is it possible that the quest variable is not a string? Or am I doing something wrong?

for quest in soup.find_all('quest', {"id":questid}):
print(quest)
# questSoup = BeautifulSoup(quest)
for floor in quest.find_all('location_id'):
print(floor)
No need to build a new soup object from tag object, you can use find_all on both of them, as both are navigable strings, so they behave in the same way and can be accessed in the same way.
In my opinion, soup object is special tag object which is named document
import requests, bs4
r =requests.get('http://www.google.com')
soup = bs4.BeautifulSoup(r.text, 'lxml')
soup.name
out:
'[document]'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Web scraping using python throws empty array - beautifulsoup

Related

Scrape wikipedia table using BeautifulSoup

Using BS, I cannot "find" the ID of info, when I know it exists

BeautifulSoup findAll() not finding all, regardless of which parser I use

Web Scraping with Beautiful Soup in Python - JavaScript Table

Beautiful Soup NoneType error

Categories

Resources