I've been trying to make a generic method so that I'm able to parse in any URL and Classes;
They have been successful but for now I'd like to gather data from the text; instead of the title. Etc: "Xiaomi 70Mai Pro"
I've tried referencing from these two sources; but I'm still unsure...
WebScraper-Sample
Parse HTML Table for URL, Place into List
links = 'SampleLink... with table cell'
def getURLData(url): # scrape data from Link
try:
page = requests.get(url)
content = page.content
soup = BeautifulSoup(content, "html.parser")
return soup
except Exception as e:
print('Error.getURLData:', e)
return None
inputLink = getURLData(links)
def tableCheck(): # if there's a table cell;
data = []
for table_tag in inputLink.find_all('td', {'class': 'row1'}):
topic_title = table_tag.find('a', href=True)
if topic_title:
datum = {'topic_title': topic_title['title']}
data.append(datum)
return data
print(tableCheck())
this was the output
{'topic_title': 'This topic was started: Dec 6 2018, 12:20 PM'},
{'topic_title': 'This topic was started: Nov 19 2018, 10:30 AM'},
{'topic_title': 'This topic was started: Nov 28 2018, 09:16 PM'},
{'topic_title': 'This topic was started: Oct 3 2018, 11:10 AM'},
this is the cell i'm trying to extract data from; I've tried to use the topic_title = table_tag.find('a', href=True).text but I really doubt that would work; I'm still not so exposed with BeautifulSoup and I'm kind of stuck thinking how would I get the data; do I try another for loop? to extract the data within it?
<td class = "row1" valign = "middle" >
<div >
<div style = "float:left" >
<a href = "/topic/4667583" title = "This topic was started: Oct 3 2018,
11:10 AM" >
Xiaomi 70Mai Pro < /a >
</div >
<br >
</div >
</td
To add to the existing answer, the only modification you need to make is add the link text to your dictionary:
topic_title = table_tag.find('a', href=True)
if topic_title:
datum = {
'topic_title': topic_title['title'],
'topic_text': topic_title.text
}
data.append(datum)
Related
I am extracting some text (HTML) from a txt file. There are about 9 rows that need to extracted and slpit the data up into 4 columns (Field_01 to Field_04) which seems to works well in the VS terminal. However, when I export the data to a csv, 2 issues arise. The first is that the data that should be in the 2nd column is split between the 2nd 3rd and 4th column, while the data that should be in the 3rd and 4th column are pushed to a 5th and 6th. The second issues is that, in the terminal I get all 9 rows, but only one row is exported to the CSV file.
Here is my code...
import pandas as pd
from bs4 import BeautifulSoup
import schedule
import time
#import urllib.parse
#import requests
baseurl = 'https://hippeas.com'
data = open("/run/user/759001103/gvfs/smb-share:server=192.168.0.150,share=indexserver/code.txt", "r")
info = data.readlines()
#print(info)
for items in info:
if items.startswith(" <img src="):
reduce_imgurl = items.split('//')[-1]
if items.startswith(" <h3 class="):
reduce_name = items[39:-6]
if items.startswith(" <a href="):
reduce_link = items[11:-32]
if items.startswith(" <span class="):
reduce_price = items[55:-8]
#print(reduce_imgurl, reduce_name, baseurl + reduce_link, reduce_price)
dataset = {'Field_01':[reduce_imgurl],'Field_02':[reduce_name],'Field_03':[baseurl + reduce_link],'Field_04':[reduce_price]}
#print(dataset)
df = pd.DataFrame(dataset, columns=('Field_01','Field_02','Field_03','Field_04'))
print(df)
df.to_csv(r'/run/user/759001103/gvfs/smb-share:server=192.168.0.150,share=indexserver/Testcode.csv', index = False)
Here is what it shows in the terminal vs the results...
Terminal and csv result
The question should only focus on one issue, so this answer only refers to the actual issue. Furthermore, the content should be parsed with BeautifulSoup if it is HTML, but this is not used at all despite tagging.
The problem is that the CSV is overwritten over and over again in the for-loop. One solution would be to decouple the extracting part from the writing part:
...
dataset = []
for items in info:
if items.startswith(" <img src="):
reduce_imgurl = items.split('//')[-1]
if items.startswith(" <h3 class="):
reduce_name = items[39:-6]
if items.startswith(" <a href="):
reduce_link = items[11:-32]
if items.startswith(" <span class="):
reduce_price = items[55:-8]
dataset.append({'Field_01':reduce_imgurl,'Field_02':reduce_name,'Field_03':baseurl + reduce_link,'Field_04':reduce_price})
pd.DataFrame(dataset).to_csv(r'PATH_TO_FILE', index = False)
I have a partially good HTML, I need to create hyperlink, like:
Superotto: risorse audiovisive per superare i pregiudizi e celebrare
l’otto marzo, in “Indire Informa”, 5 marzo 2021,
https://www.indire.it/2021/03/05/superotto-risorse-audiovisive-per-superare-i-pregiudizi-e-celebrare-lotto-marzo/;
Sezione Superotto in
https://piccolescuole.indire.it/iniziative/la-scuola-allo-schermo/#superotto.
Has to become:
Superotto: risorse audiovisive per superare i pregiudizi e celebrare
l’otto marzo, in “Indire Informa”, 5 marzo 2021, < a
href="https://www.indire.it/2021/03/05/superotto-risorse-audiovisive-per-superare-i-pregiudizi-e-celebrare-lotto-marzo/" >https://www.indire.it/2021/03/05/superotto-risorse-audiovisive-per-superare-i-pregiudizi-e-celebrare-lotto-marzo/< /a >;
Sezione Superotto in < a
href="https://piccolescuole.indire.it/iniziative/la-scuola-allo-schermo/#superotto">https://piccolescuole.indire.it/iniziative/la-scuola-allo-schermo/#superotto< /a >.
Beautifulsoup seems to not find the http well, so I used this regex with the pure python findall, but I cannot substitute or compose the text. Right now I made:
links = re.findall(r"(http|ftp|https:\/\/)([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:\/~+#-]*[\w#?^=%&\/~+#-])", str(soup))
link_to_replace = []
for l in links:
link = ''.join(l)
if link in soup.find("body").text:
good_link = ""+link+""
fixed_text = soup.replace(link, good_link)
soup.replace_with(fixed_text)
I tried multiple solutions in the last two lines (this is just one), none worked.
Perhaps as follows, where I first identify the relevant anchor elements and strip out any other attributes besides the href, then later substitute the href link with the href html
import re
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://rivista.clionet.it/vol5/giorgi-zoppi-la-ricerca-indire-tra-uso-didattico-del-patrimonio-storico-culturale-e-promozione-delle-buone-pratiche/')
soup = bs(r.text, 'lxml')
item = soup.select_one('p:has(a[id="ft-note-16"])')
text = item.text
for tag in item.select('a:not([id])'):
href = tag['href']
tag.attrs = {'href': href}
text = re.sub(href, str(tag), text)
text = re.sub(item.a.text, '', text).strip()
print(text)
I have a soup object like:
<div class="list-card__select">
<div class="list-card__item-size">
Size:
75 м² </div>
I did
soup = BeautifulSoup(text, 'lxml')
number = item.find(class_='list-card__item-size').text
print(number)
Result: 'Size: 75 м²'
How can I get just: '75'
you can do this:
soup = BeautifulSoup(html,"html.parser")
data = soup.findAll("span", { "class":"comments" })
numbers = [d.text for d in data]
Provided that the pattern is always identical, a simple split() can be used.
item.find(class_='list-card__item-size').text.split(' ')[1]
Alternatives can be regex or you inspect other elements, javascript or api that hold this information directly.
If number is always positive then we also can use re package.
import re
string = "Size: 75 м²"
print( re.findall(r'\d+', string)[0] )
Output : 75
Goal: Query all the job records as a collection as part of job website scraping.
Steps: There are 100 job records and using "Inspect" in Google Chrome, it shows up as follows, when "Inspect"ing a single job record.
<div class="coveo-list-layout CoveoResult">
<div class="coveo-result-frame item-wrap">
<div class="content-main">
<div class="coveo-result-cell content-wrap">
Problem: The following code does not return the count as 100, it is just 0. All of the above mentioned class were used in find_all, but it does not return 100 records. Attached a snip of the "Inspect" to show the class associated with a single record. Output of Inspect on a single job record:
response = requests.get(url)
print(response)
<Response [200]>
response.reason
'OK'
soup = BeautifulSoup(response.text, 'html.parser')
cards = soup.find_all('div','content-list-layout CoveoResult')
len(cards)
0
cards = soup.find_all('div')
len(cards)
86
Code tried as follows: None of them work
cards = soup.find_all('div','content-list-layout CoveoResult')
cards = soup.find_all('div','content-list-layout')
cards = soup.find_all('div','coveo-result-frame item-wrap')
cards = soup.find_all('div','coveo-result-frame')
cards = soup.find_all('div','content-main')
cards = soup.find_all('div','coveo-result-cell content-wrap')
cards = soup.find_all('div','coveo-result-cell')
Next Steps: Need help with finding the class associated with a single record. As a debug I have generated the output of "cards = soup.
I am trying the sample code for the piracy report.
The line of code:
for incident in soup('td', width="90%"):
seraches the soup for an element td with the attribute width="90%", correct? It invokes the __init__ method of the BeautifulStoneSoup class, which eventually invokes SGMLParser.__init__(self)
Am I correct with the class flow above?
The soup looks like this in the report now:
<td class="fabrik_row___jos_fabrik_icc-ccs-piracymap2010___narrations" ><p>22.09.2010: 0236 UTC: Posn: 03:49.9N – 006:54.6E: Off Bonny River: Nigeria.<p/>
<p>About 21 armed pirates in three crafts boarded a pipe layer crane vessel undertow. All crew locked themselves in accommodations. Pirates were able to take one crewmember as hostage. Master called Nigerian naval vessel in vicinity. Later pirates released the crew and left the vessel. All crew safe.<p/></td>
There is no width markup in the text. I changed the line of code that is searching:
for incident in soup('td', class="fabrik_row___jos_fabrik_icc-ccs-piracymap2010___narrations"):
It appears that class is a reserved word, maybe?
How do I get the current example code to run, and has more changed in the application than just the HTML output?
The URL I am using:
urllib2.urlopen("http://www.icc-ccs.org/index.php?option=com_fabrik&view=table&tableid=534&calculations=0&Itemid=82")
There must be a better way....
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://www.icc-ccs.org/index.php?option=com_fabrik&view=table&tableid=534&calculations=0&Itemid=82")
soup = BeautifulSoup(page)
soup.find("table",{"class" : "fabrikTable"})
list1 = soup.table.findAll('p', limit=50)
i = 0
imax = 0
for item in list1 :
imax = imax + 1
while i < imax:
Itime = list1[i]
i = i + 2
Incident = list1[i]
i = i + 1
Inext = list1[i]
print "Time ", Itime
print "Incident", Incident
print " "
i = i + 1
class is a reserved word and will not work with that method.
This method works but does not return the list:
soup.find("tr", { "class" : "fabrik_row___jos_fabrik_icc-ccs-piracymap2010___narrations" })
And I confirmed the class flow for the parse.
The example will run, but the HTML must be parsed with different methods because the width='90%' is no longer in the HTML.
Still working on the proper methods; will post back when I get it working.