I am trying to pull all of the text and links by date in a table and so far can only get one entry (but not correctly as the link is not named correctly). I think nextsibling might work here but perhaps that's not the right solution.
Here's the html:
<ul class="indented">
<br>
<strong>May 15, 2019</strong>
<ul>
Sign up for more insight into FERC with our monthly news email, The FERC insight
Read More
</ul>
<br><br>
<strong>May 15, 2019</strong>
<ul>
FERC To Convene a Technical Conference regarding Columbia Gas Transmission, LLC on July 10, 2019
Notice <img src="/images/icon_pdf.gif" alt="PDF"> | Event Details
</ul>
<br><br>
Here's my code:
import requests
from bs4 import BeautifulSoup
url1 = ('https://www.ferc.gov/media/headlines.asp')
r = requests.get(url1)
# Create a BeautifulSoup object
soup = BeautifulSoup(r.content, 'lxml')
# Pull headline text from the ul class indented
headlines = soup.find_all("ul", class_="indented")
headline = headlines[0]
date = headline.select_one('strong').text.strip()
print(date)
headline_text = headline.select_one('ul').text.strip()
print(headline_text)
headline_link = headline.select_one('ul a')["href"]
headline_link = 'https://www.ferc.gov' + headline_link
print(headline_link)
I get the first date, text and link because I'm using select_one. I need to get all of the links and name them properly for each date. Would findnext work here or findnextsibling?
I believe this is what you are looking for; it gets the date, announcement and related links:
[start same as your code; thru soup declaration]
dates = soup.find_all("strong")
for date in dates:
if '2019' in date.text:
print(date.text)
print(date.nextSibling.nextSibling.text)
for ref in date.nextSibling.nextSibling.find_all('a'):
new_link = "https://www.ferc.gov" + ref['href']
print(new_link)
print('=============================')
Random part of the output:
May 15, 2019
FERC To Convene a Technical Conference regarding Columbia Gas Transmission, LLC on July 10, 2019
Notice
| Event Details
https://www.ferc.gov/CalendarFiles/20190515104556-RP19-763-000%20TC.pdf
https://www.ferc.gov/EventCalendar/EventDetails.aspx?ID=13414&CalType=%20&CalendarID=116&Date=07/10/2019&View=Listview
=============================
Related
I am new to python and using Beautiful soup is something which I am not familiar with. Basically Trying to scrape data using beautiful soup but I got stuck, actually for a particular profile I am extracting name, connection, location, company, etc.. from a profile. If in case any one of these Information is missing the code throws an IndexError..
here's the part of the code, what i am trying the way to scrape -
#experience setion
exp_section = soup.find('section', {'id': 'experience-section'})
exp_section = exp_section.find('ul')
div_tag = exp_section.find('div')
a_tag = div_tag.find('a')
job_title = a_tag.find('h3').get_text().strip()
company_name = a_tag.find_all('p')[1].get_text().strip()
joining_date = a_tag.find_all('h4')[0].find_all('span')[1].get_text().strip()
exp = a_tag.find_all('h4')[1].find_all('span')[1].get_text().strip()
info.append(company_name)
info.append(job_title)
info.append(joining_date)
info.append(exp)
info
Now in the profile i have checked the few details are missing thats why the code is not efficient to skip the missing info if its blank and it gives error -
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-33-dd2a258c45e4> in <module>
17 a_tag = div_tag.find('a')
18 job_title = a_tag.find('h3').get_text().strip()
---> 19 company_name = a_tag.find_all('p')[1].get_text().strip()
20 joining_date = a_tag.find_all('h4')[0].find_all('span')[1].get_text().strip()
21 exp = a_tag.find_all('h4')[1].find_all('span')[1].get_text().strip()
Logic - I want it to be like if there is any information which is missing then the code must skip that and proceed scraping other data instead of giving error
I want my OP to be -
name
location
connection
company
position
duration
tenure
Tre Sayles
San Jose, California, United States
500+ connections
Jacobs
Nan
5 yrs 9 mos
Nan
here's the code I have uploaded in wetransfer - https://wetransfer.com/downloads/c507a3d20a16e536bb8bd7aae9fd8e6d20210322074309/c32af9
Please help me to make the code more efficient and error free..thanks in advance!!!
EDIT - I have added how I want my OP should be
Possible duplicate question. You need to do if/else or try/catch statement for every columns you extract.
BeautifulSoup error handling when find returns NoneType
I'm trying to get the text inside a table in wikipedia, but I will do it for many cases (books in this case). I want to get the book genres.
Html code for the page
I need to extract the td containing the genre, when the text in Genre.
I did this:
page2 = urllib.request.urlopen(url2)
soup2 = BeautifulSoup(page2, 'html.parser')
for table in soup2.find_all('table', class_='infobox vcard'):
for tr in table.findAll('tr')[5:6]:
for td in tr.findAll('td'):
print(td.getText(separator="\n"))```
This gets me the genre but only in some pages due to the row count which differs.
Example of page where this does not work
https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye (table on the right side)
Anyone knows how to search through string with "genre"? Thank you
In this particular case, you don't need to bother with all that. Just try:
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/The_Catcher_in_the_Rye')
print(tables[0])
Output:
0 1
0 First edition cover First edition cover
1 Author J. D. Salinger
2 Cover artist E. Michael Mitchell[1][2]
3 Country United States
4 Language English
5 Genre Realistic fictionComing-of-age fiction
6 Published July 16, 1951
7 Publisher Little, Brown and Company
8 Media type Print
9 Pages 234 (may vary)
10 OCLC 287628
11 Dewey Decimal 813.54
From here you can use standard pandas methods to extract whatever you need.
I have got so far by using soup.findAll('span')
<span data-reactid="12">Previous Close</span>,
<span class="Trsdu(0.3s) " data-reactid="14">5.52</span>,
<span data-reactid="17"></span>,
<span class="Trsdu(0.3s) " data-reactid="19">5.49</span>,
<span data-reactid="38">Volume</span>,
<span class="Trsdu(0.3s) " data-reactid="40">1,164,604</span>,
...
I want a tabkle that shows me
Open 5.49
Volume 1,164,604
...
I tried soup.findAll('span').text but it gives error msg:
ResultSet object has no attribute 'text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
this is the source:
https://finance.yahoo.com/quote/gxl.ax?p=gxl.ax
Luckily the error gives us a hint:
You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
Try one of these:
soup.findAll('span')[0].text
soup.findAll('span')[i].text
soup.find('span').text
This is a generic problem when navigating many selector systems, CSS selectors included. To operate on an element it must be a single element rather than a set. findAll() returns a set (array), so you can either index into that array (e.g. [i]) or find the first match with find().
soup.findAll('span') will return object/elements in ResultSet. You'd have to iterate through those to print the text. So try:
spans = soup.findAll('span')
for ele in spans:
data = ele.text
print(data)
To take your output and put into a dataframe:
your_output = ['Previous Close', '5.52', 'Open', '5.49', 'Bid', 'Ask', "Day's Range", '52 Week Range', 'Volume', '1,164,604', 'Avg. Volume', '660,530']
headers = your_output[::2]
data = your_output[1::2]
df = pd.DataFrame([data], columns = headers)
Additional
You certainly can use BeautifulSoup to parse and throw into a dataframe by iterating through the elements. I would like to offer an aleternative to BeautifulSoup.
Pandas does most of the work for you if it can identify tables within the html, by using .read_html. You can achieve the dataframe type of table you are looking for using that.
import pandas as pd
tables = pd.read_html(url)
df = pd.concat( [ table for table in tables ] )
Output:
print (df)
0 1
0 Previous Close 5.50
1 Open 5.50
2 Bid 5.47 x 0
3 Ask 5.51 x 0
4 Day's Range 5.47 - 5.51
5 52 Week Range 3.58 - 6.49
6 Volume 634191
7 Avg. Volume 675718
0 Market Cap 660.137M
1 Beta (3Y Monthly) 0.10
2 PE Ratio (TTM) 31.49
3 EPS (TTM) 0.17
4 Earnings Date NaN
5 Forward Dividend & Yield 0.15 (2.82%)
6 Ex-Dividend Date 2019-02-12
7 1y Target Est 5.17
I have a large number of this page saved locally and am working to extract the content and put in a CSV. I have two questions and over two full days I've tried so many solutions it would be difficult to list them here.
Here's the page hosted online for reference: source page
and code:
import csv
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/my/path/to/local/files/filename.html"), "lxml")
table = soup.find('table', attrs={ "class" : "report_column"})
headers = [header.text for header in table.find_all('th')]
rows = []
for row in table.find_all('tr'):
rows.append([val.text.encode('utf8') for val in row.find_all('td')])
url = []
with open('output_file.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["","License Num","Status","Type/ Dup.","Expir. Date","Primary Owner and Premises Addr.","Mailing Address","Action","Conditions","Escrow","District Code","Geo Code"])
writer.writerows(row for row in rows if row)
The first question is I need to associate each row of the csv with the date which is shown at the page top, but also available in the href of all the cols headers. How would I extract that href and either join or somehow add an independent column in the csv?
The second question is, when getting cols with multiple line breaks (like Primary Owner and Mailing Address cols) I'm getting cell content that is one long string. Could you give me any tips on how I could delineate the line breaks with pipes or, ideally put them in unique cols eg Owner1, Owner2, Owner3, Owner4, one for each (up to 4) lines in the cell.
Thanks for any help!
Desired Output:
Right now I'm getting this in the Primary Owner col:
DBA:MECCA DELICATESSEN RAWLINGS, LINDA MAE 215 RESERVATION RDMARINA, CA 93933
And, ideally I could get four cols, one for each row (delineated by a "BR" in table):
col0 July 12, 2017 (date from page header)
col6 DBA:MECCA DELICATESSEN RAWLINGS
col7 LINDA MAE
col8 215 RESERVATION RD
col9 MARINA, CA 93933
so I am calling the twitter api:
openurl = urllib.urlopen("https://api.twitter.com/1/statuses/user_timeline.json?include_entities=true&contributor_details&include_rts=true&screen_name="+user+"&count=3600")
and it returns some long file like:
[{"entities":{"hashtags":[],"user_mentions":[],"urls":[{"url":"http:\/\/t.co\/Hd1ubDVX","indices":[115,135],"display_url":"amzn.to\/tPSKgf","expanded_url":"http:\/\/amzn.to\/tPSKgf"}]},"coordinates":null,"truncated":false,"place":null,"geo":null,"in_reply_to_user_id":null,"retweet_count":2,"favorited":false,"in_reply_to_status_id_str":null,"user":{"contributors_enabled":false,"lang":"en","profile_background_image_url_https":"https:\/\/si0.twimg.com\/profile_background_images\/151701304\/theme14.gif","favourites_count":0,"profile_text_color":"333333","protected":false,"location":"North America","is_translator":false,"profile_background_image_url":"http:\/\/a0.twimg.com\/profile_background_images\/151701304\/theme14.gif","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/1642783876\/idB005XNC8Z4_normal.png","name":"User Interface Books","profile_link_color":"009999","url":"http:\/\/twitter.com\/ReleasedBooks\/genres","utc_offset":-28800,"description":"All new user interface and graphic design book releases posted on their publication day","listed_count":11,"profile_background_color":"131516","statuses_count":1189,"following":false,"profile_background_tile":true,"followers_count":732,"profile_image_url":"http:\/\/a2.twimg.com\/profile_images\/1642783876\/idB005XNC8Z4_normal.png","default_profile":false,"geo_enabled":false,"created_at":"Mon Sep 20 21:28:15 +0000 2010","profile_sidebar_fill_color":"efefef","show_all_inline_media":false,"follow_request_sent":false,"notifications":false,"friends_count":1,"profile_sidebar_border_color":"eeeeee","screen_name":"User","id_str":"193056806","verified":false,"id":193056806,"default_profile_image":false,"profile_use_background_image":true,"time_zone":"Pacific Time (US & Canada)"},"possibly_sensitive":false,"in_reply_to_screen_name":null,"created_at":"Thu Nov 17 00:01:45 +0000 2011","in_reply_to_user_id_str":null,"retweeted":false,"source":"\u003Ca href=\"http:\/\/twitter.com\/ReleasedBooks\/genres\" rel=\"nofollow\"\u003EBook Releases\u003C\/a\u003E","id_str":"136957158075011072","in_reply_to_status_id":null,"id":136957158075011072,"contributors":null,"text":"Digital Media: Technological and Social Challenges of the Interactive World - by William Aspray - Scarecrow Press. http:\/\/t.co\/Hd1ubDVX"},{"entities":{"hashtags":[],"user_mentions":[],"urls":[{"url":"http:\/\/t.co\/GMCzTija","indices":[119,139],"display_u
Well,
the different objects are slit into tables and dictionaries and I want to extract the different parts but to do this I have to know how many objects the file has:
example:
[{1:info , 2:info}][{1:info , 2:info}][{1:info , 2:info}][{1:info , 2:info}]
so to extract the info from 1 in the first table I would:
[0]['1']
>>>>info
But to extract it from the last object in the table I need to know how many object the table has.
This is what my code looks like:
table_timeline = json.loads(twitter_timeline)
table_timeline_inner = table_timeline[x]
lines = 0
while lines < linesmax:
in_reply_to_user_id = table_timeline_inner['in_reply_to_status_id_str']
lines += 1
So how do I find the value of the last object in this table?
thanks
I'm not entirely sure this is what you're looking for, but to get the last item in a python list, use an index of -1. For example,
>>> alist = [{'position': 'first'}, {'position': 'second'}, {'position': 'third'}]
>>> print alist[-1]['position']
{'position': 'third'}
>>> print alist[-1]['position']
third