How can I extract date out a site using beautiful soup? - beautifulsoup

I try extracting the date out of this article for example: https://www.ynet.co.il/articles/0,7340,L-5665851,00.html#autoplay
As you can see it appears here:
But the problem is I don't know how to extract this as it's pure text and not an attribute like datetime or something, can someone help me?

You can do it using beautifulsoup and json:
import json
from bs4 import BeautifulSoup as bs
import requests
url = "https://www.ynet.co.il/articles/0,7340,L-5665851,00.html"
resp = requests.get(url)
soup = bs(resp.text,'lxml')
#soup receives the response and parses it
data = json.loads(soup.find('script', type='application/ld+json').text)
#the target is contained inside a script tag; soup now extracts the script and python converts it to text; the converted string is in json format; json.loads() loads it into a variable
print(data['datePublished']) # you can access the info in the variable using the key names (datePublished, in this case)
Or you can do it with lxml:
import lxml.html
doc = lxml.html.fromstring(resp.text)
targets = doc.xpath("//script[#type='application/ld+json']")
data = json.loads(targets[0].text)
print(data['datePublished'])
Output (in both cases):
2020-01-25T12:47:27z

Related

Web scraping using python throws empty array

import requests
from bs4 import BeautifulSoup as soup
my_url='http://cvr.ac.in/home4/index.php/eee-sp-870859316/eeefaculty'
page=requests.get(my_url)
data=page.text
page_soup=soup(data,'html.parser')
cont=page_soup.select("div",{"class": "item-page"})
print(cont)
I am trying to scrape the faculty details name, designation , profile into a csv file .
when I use above code it throws empty [].
any help greatly appreciated.
The page is looking for any of a defined set of valid user agents. For example,
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://cvr.ac.in/home4/index.php/eee-sp-870859316/eeefaculty', headers = {'User-Agent': 'Chrome/80.0.3987.163'})
soup = bs(r.content, 'lxml')
print(soup.select('.item-page'))
Without that, you get an 406 response and the classes you are looking for in the html are not present.

Using BS, I cannot "find" the ID of info, when I know it exists

I am a new user to Beautiful Soup and am trying to create a baby application that retrieves the view count from a YouTube url.
So, I looked at the BS docs and I saw that you could retrieve items by their id. So I attempted to retrieve the info id - but whenever I attempt to do this, it comes out as "None", so it must not be finding the id.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/watch?v=bUPvE5yv72I'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
divisions = soup.findAll("div")
print(divisions[0])
info = soup.find(id="info")
print(info)
You can try to search for meta itemprop="interactionCount" tag, but this value can be often not exact. Best way is using the official YouTube API:
import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/watch?v=bUPvE5yv72I'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
print(soup.select_one('meta[itemprop="interactionCount"][content]')['content'])
Prints:
165011

Unable to pare the href tag in python

I get the following output in my beautiful soup.
[Search over 301,944 datasets\n]
I need to extract only the number 301,944 in this. Please guide me how this can be done. My code so far
import requests
import re
from bs4 import BeautifulSoup
source = requests.get('https://www.data.gov/').text
soup = BeautifulSoup (source , 'lxml')
#print soup.prettify()
images = soup.find_all('small')
print images
con = images.find_all('a') // I am unable to get anchor tag here. It says anchor tag not present
print con
#for con in images.find_all('a',href=True):
#print con
#content = images.split('metrics')
#print content[1]
#images = soup.find_all('a', {'href':re.compile('\d+')})
#print images
There is only one <small> tag on website.
Your images variable references it. But you use it in a wrong way to retrive anchor tag.
If you want to retrieve text from a tag you can get it with:
soup.find('small').a.text
where find method returns first small element it encounters on website. If you use find_all, you will get list of all small elements (but there's only one small tag here).

BeautifulSoup findAll() not finding all, regardless of which parser I use

So I've read through all the questions about findAll() not working that I can find, and the answer always seems to be an issue with the particular html parser. I have run the following code using the default 'html.parser' along with 'lxml' and 'html5lib' yet I can only find one instance when I should be finding 14.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://robertsspaceindustries.com/pledge/ships'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, features = "lxml")
containers = page_soup.findAll("ul", {"class":"ships-listing"})
len(containers)
I tried posting a picture of the HTML code, but I don't have enough reputation. Here is a link to the image (https://imgur.com/a/cOLO8GO).
When you download a page through urllib (or requests HTTP library) it downloads the original HTML source file.
Initially there's only sinlge tag with the class name 'ships-listing' because that tag comes with the source page. But once you scroll down, the page generates additional <ul class='ships-listing'> and these elements are generated by the JavaScript.
So when you download a page using urllib, the downloaded content only contains the original source page (you could see it by view-source option in the browser).

Beautiful Soup NoneType error

I am trying to write a small script in Python to help me through some of the more tedious parts of my job. I wrote this:
from bs4 import BeautifulSoup
import lxml
import os
import sys
questid = "1478"
soup = BeautifulSoup(open("Data/xmls/quests.xml"), "lxml")
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
questSoup = BeautifulSoup(quest)
for floor in questSoup.find_all('location_id'):
print(floor)
What this is supposed to do is to get a part of a huge xml called "quests", based on tag - and its attribute - "id". Then it is supposed to make a new soup from that part and get all the tags from within the . For now, before I figure out which quest ids I want to choose (and how will I handle input), I just hardcoded one ("1478").
The script so far prints the quest, but fails to create a new soup from it.
Is it possible that the quest variable is not a string? Or am I doing something wrong?
for quest in soup.find_all('quest', {"id":questid}):
print(quest)
# questSoup = BeautifulSoup(quest)
for floor in quest.find_all('location_id'):
print(floor)
No need to build a new soup object from tag object, you can use find_all on both of them, as both are navigable strings, so they behave in the same way and can be accessed in the same way.
In my opinion, soup object is special tag object which is named document
import requests, bs4
r =requests.get('http://www.google.com')
soup = bs4.BeautifulSoup(r.text, 'lxml')
soup.name
out:
'[document]'