How parsing works - beautifulsoup

I am trying the sample code for the piracy report.
The line of code:
for incident in soup('td', width="90%"):
seraches the soup for an element td with the attribute width="90%", correct? It invokes the __init__ method of the BeautifulStoneSoup class, which eventually invokes SGMLParser.__init__(self)
Am I correct with the class flow above?
The soup looks like this in the report now:
<td class="fabrik_row___jos_fabrik_icc-ccs-piracymap2010___narrations" ><p>22.09.2010: 0236 UTC: Posn: 03:49.9N – 006:54.6E: Off Bonny River: Nigeria.<p/>
<p>About 21 armed pirates in three crafts boarded a pipe layer crane vessel undertow. All crew locked themselves in accommodations. Pirates were able to take one crewmember as hostage. Master called Nigerian naval vessel in vicinity. Later pirates released the crew and left the vessel. All crew safe.<p/></td>
There is no width markup in the text. I changed the line of code that is searching:
for incident in soup('td', class="fabrik_row___jos_fabrik_icc-ccs-piracymap2010___narrations"):
It appears that class is a reserved word, maybe?
How do I get the current example code to run, and has more changed in the application than just the HTML output?
The URL I am using:
urllib2.urlopen("http://www.icc-ccs.org/index.php?option=com_fabrik&view=table&tableid=534&calculations=0&Itemid=82")

There must be a better way....
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.icc-ccs.org/index.php?option=com_fabrik&view=table&tableid=534&calculations=0&Itemid=82")
soup = BeautifulSoup(page)
soup.find("table",{"class" : "fabrikTable"})
list1 = soup.table.findAll('p', limit=50)
i = 0
imax = 0
for item in list1 :
imax = imax + 1
while i < imax:
Itime = list1[i]
i = i + 2
Incident = list1[i]
i = i + 1
Inext = list1[i]
print "Time ", Itime
print "Incident", Incident
print " "
i = i + 1

class is a reserved word and will not work with that method.
This method works but does not return the list:
soup.find("tr", { "class" : "fabrik_row___jos_fabrik_icc-ccs-piracymap2010___narrations" })
And I confirmed the class flow for the parse.
The example will run, but the HTML must be parsed with different methods because the width='90%' is no longer in the HTML.
Still working on the proper methods; will post back when I get it working.

Related

Can use Beautifulsoup to find elements hidden by other wrapped elements?

I would like to extract the text data of the author affiliations on this page using Beautiful soup.
I know of a work around using selenium to simply click on the 'show more' link and scan the page again? Im not sure what kind of elements these are, hidden? as they only appear in the inspector after clicking the button.
Is there a way to extract this info just using beautiful soup or do I need selenium or something equivalent to reveal the elements in the HTML code?
from bs4 import BeautifulSoup
import requests
url = 'https://www.sciencedirect.com/science/article/abs/pii/S0920379621007596'
sp = BeautifulSoup(r.content, 'html.parser')
r = sp.get(url)
author_data = sp.find('div', id='author-group')
affiliations = author_data.find('dl', class_='affiliation').text
print(affiliations)
That info is within a script tag though you need to map the letters for affiliations to the actual affiliations. The code below extracts the JavaScript object housing the info you want and handles with JSON library.
There is then a series of steps to dynamically determine which indices hold the info of interest and then use a constructed mapping of the letters to affiliations to assign the correct affiliation to each author.
The author first and last names are also dynamically ascertained and joined together with a space.
The intention was to avoid hardcoding indices which might change over time.
import re
import json
import requests
r = requests.get('https://www.sciencedirect.com/science/article/abs/pii/S0920379621007596',
headers={'User-Agent': 'Mozilla/5.0'})
data = json.loads(re.search(r'(\{"abstracts".*})', r.text).group(1))
base = [i for i in data['authors']['content']
if i.get('#name') == 'author-group'][0]['$$']
affiliation_data = [i for i in base if i['#name'] == 'affiliation']
author_data = [i for i in base if i['#name'] == 'author']
name_info = [i['_'] for author in author_data for i in author['$$']
if i['#name'] in ['given-name', 'surname']]
affiliations = dict(zip([j['_'] for i in affiliation_data for j in i['$$'] if j['#name'] == 'label'], [
j['_'] for i in affiliation_data for j in i['$$'] if isinstance(j, dict) and '_' in j and j['_'][0].isupper()]))
# print(affiliations)
author_affiliations = dict(zip([' '.join([i[0], i[1]]) for i in zip(name_info[0::2], name_info[1::2])], [
affiliations[j['_']] for author in author_data for i in author['$$'] if i['#name'] == 'cross-ref' for j in i['$$'] if j['_'] != '⁎']))
print(author_affiliations)

BeautifulSoup.find_all does not return the class shown by "Inspect" under "div"

Goal: Query all the job records as a collection as part of job website scraping.
Steps: There are 100 job records and using "Inspect" in Google Chrome, it shows up as follows, when "Inspect"ing a single job record.
<div class="coveo-list-layout CoveoResult">
<div class="coveo-result-frame item-wrap">
<div class="content-main">
<div class="coveo-result-cell content-wrap">
Problem: The following code does not return the count as 100, it is just 0. All of the above mentioned class were used in find_all, but it does not return 100 records. Attached a snip of the "Inspect" to show the class associated with a single record. Output of Inspect on a single job record:
response = requests.get(url)
print(response)
<Response [200]>
response.reason
'OK'
soup = BeautifulSoup(response.text, 'html.parser')
cards = soup.find_all('div','content-list-layout CoveoResult')
len(cards)
0
cards = soup.find_all('div')
len(cards)
86
Code tried as follows: None of them work
cards = soup.find_all('div','content-list-layout CoveoResult')
cards = soup.find_all('div','content-list-layout')
cards = soup.find_all('div','coveo-result-frame item-wrap')
cards = soup.find_all('div','coveo-result-frame')
cards = soup.find_all('div','content-main')
cards = soup.find_all('div','coveo-result-cell content-wrap')
cards = soup.find_all('div','coveo-result-cell')
Next Steps: Need help with finding the class associated with a single record. As a debug I have generated the output of "cards = soup.

Why text function of xpath doesn't show any data on scrapy selenium?

I am trying to scrape a website with scrapy-selenium. I am facing two problem
I applied xpath on chrome developer tool I found all elements but after execution of code it returns only one Selector object.
text() function of xpath expression returns none.
This is the URL I am trying to scrape: http://www.atab.org.bd/Member/Dhaka_Zone
Here is a screenshot of inspector tool:
Here is my code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.keys import Keys
class AtabDSpider(scrapy.Spider):
name = 'atab_d'
def start_requests(self):
yield SeleniumRequest(
url = "https://www.atab.org.bd/Member/Dhaka_Zone",
#url = "https://www.bit2lead.com",
#wait_time = 15,
wait_time = 3,
callback = self.parse
)
def parse(self, response):
companies = response.xpath("//ul[#class='row']/li")
print("Numbers Of Iterable Item: " + str(len(companies)))
for company in companies:
yield {
"company": company.xpath(".//div[#class='card']/div[1]/div/a/h3[#data-bind='text: NameOfOrganization']/text()").get()
#also tried
#"company": company.xpath(".//div[#class='card']/div[1]/div/a/h3/text()").get()
}
Here is a screenshot of my terminal:
And This is the url: ( https://www.algoslab.com ) I was practicing before That worked well. Although it's simple enough.
Why don't you try directly like the following to get everything in one go with the blink of an eye:
import requests
link = 'http://123.253.36.205:8051/API/Member/GetMembersList?searchString=&zone=0&blacklisted=false'
r = requests.get(link)
for item in r.json():
_name = item['NameOfOrganization']
phone = item['Phone']
print(_name,phone)
Output are like (should produce 3160 lines of results):
Aqib Travels & Tours Ltd. +88-029101468, 58151369
4S Tours & Travels Ltd 8954750
5M Logistics And Tours Ltd +880 2 48810030
The xpath you want could be simplified as //h3[#data-bind='text: NameOfOrganization'] to select the element and then view the text

Web page is showing weird unicode(?) letters: \u200e

How can I remove that? I Tried so many things and I am exhausted of trying to defeat this error by myself. I spent the last 3 hours looking at this and trying to get through it and I surrender to this code. Please help.
The first "for" statement grabs article titles from news.google.com
The second "for" statement grabs the time of submisssion from that article on news.google.com.
This is on django btw and this page shows the list of article titles and their time of submission in a list, going down. The weird unicode letters are popping up from the second "for" statement which is the time submissions. Here is my views.py:
def articles(request):
""" Grabs the most recent articles from the main news page """
import bs4, requests
list = []
list2 = []
url = 'https://news.google.com/'
r = requests.get(url)
try:
r.raise_for_status() == True
except ValueError:
print('Something went wrong.')
soup = bs4.BeautifulSoup(r.text, 'html.parser')
for (listarticles) in soup.find_all('h2', 'esc-lead-article-title'):
if listarticles is not None:
a = listarticles.text
list.append(a)
for articles_times in soup.find_all('span','al-attribution-timestamp'):
if articles_times is not None:
b = articles_times.text
list2.append(b)
list = zip(list,list2)
context = {'list':list}
return render(request, 'newz/articles.html', context)

Jsoup simple selector code help needed please

I'm having a hard time getting the wanted info from a very simple code.
For example, I have no problem collecting my data within this simple code:
<HTML>
<TABLE>
<TABLE WIDTH=100%><TR class=FSS-data-row-highlight>
<TD> Evgeni Malkin, Pit (C/RW)</TD>
<TD class=FSS-data-right> 6 pts in last 2 GP </TD>
</TR>
</TABLE>
</TABLE>
</HTML>
What I need is the 'Evgeni Malkin 6 pts in last 2' string which works fine within that code. But when connect to the whole page, it returns nothing. I guess it is because there are tables within tables but I can't figure out how to proceed. Here is my code:
Document doc = Jsoup.connect("http://forecaster.thehockeynews.com/hockeynews/hockey/statistics.cgi?mlb&mode=hotnot/").get();
Elements scanYearplace = doc.select("tr.FSS-data-row-highlight td");
String yearplace = scanYearplace.text();
In fact I need all to grab the infos on all the other players too but it would be a start if I could that one to work.
Any suggestions?
Thanks in advance!
--- see update below ---
Please realize this is a fragile in that any change to the site could potentially break this. You'll also want to do some error checking and whatnot. Also, I didn't see the "6 pts in last 2 GP" text like you have above, but you can grab whatever stat you want using this code. Just change the stats.get(4) to whatever you want.
Document doc = Jsoup.connect("http://forecaster.thehockeynews.com/hockeynews/hockey/statistics.cgi?mlb&mode=hotnot/").get();
for (Element e : doc.select(".FSS-data-row")) {
Element td = e.select("td.FSS-data-left > a").first();
String name = (td != null?td.text():null);
Elements stats = e.select(".FSS-data-right");
String goals = (stats.size() > 0?stats.get(4).text():null);
System.out.println(name + ":" + goals);
}
Sample output:
null:null
J. Benn:13
P. Sharp:20
P. Marleau:17
T. Oshie:14
The first null:null is because that is like the header row on the page.
-----UPDATE-----
The url you had in your post pointed to the wrong page. Here is updated code to get what I think you want..
Document doc = Jsoup.connect("http://forecaster.thehockeynews.com/hockeynews/hockey/statistics.cgi?&mode=hotnot").get();
for (Element e : doc.select("tr.FSS-data-row-highlight")) {
Element tdname = e.select("td > a").first();
String name = (tdname != null?tdname.text():null);
Element tdstat = e.select("td.FSS-data-right").first();
String stat = tdstat.text();
System.out.println(name + ":" + stat);
}
Sample output:
Mathieu Perreault:5 pts in last 2 GP 
Mikhail Grabovski:5 pts in last 2 GP 
James Neal:4 pts in last 2 GP 
Kris Versteeg:4 pts in last 2 GP 
Evgeni Malkin:12 pts in last 6 GP