Extract specific columns from a given webpage - pandas

I am trying to read web page using python and save the data in csv format to be imported as pandas dataframe.
I have the following code that extracts the links from all the pages, instead I am trying to read certain column fields.
for i in range(10):
url='https://pythonexpress.in/workshop/'+str(i).zfill(3)
import urllib2
from bs4 import BeautifulSoup
try:
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]:
print i, anchor.text
except:
pass
Can I save these 9 columns as pandas dataframe?
df.columns=['Organiser', 'Instructors', 'Date', 'Venue', 'Level', 'participants', 'Section', 'Status', 'Description']

This returns the correct results for the first 10 pages - but it takes a lot of time for 100 pages. Any suggestions to make it faster?
import urllib2
from bs4 import BeautifulSoup
finallist=list()
for i in range(10):
url='https://pythonexpress.in/workshop/'+str(i).zfill(3)
try:
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
mylist=list()
for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]:
mylist.append(anchor.text)
finallist.append(mylist)
except:
pass
import pandas as pd
df=pd.DataFrame(finallist)
df.columns=['Organiser', 'Instructors', 'Date', 'Venue', 'Level', 'participants', 'Section', 'Status', 'Description']
df['Date'] = pd.to_datetime(df['Date'],infer_datetime_format=True)
df['participants'] = df['participants'].astype(int)

Related

bs4 can't get specific results

I am trying to get specific data from a website that is under a class that is used multiple times. So my thought was to search for the next biggest class and then use bs4 again to narrow my search results further. However, I get this error:
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
This is my code:
import requests
from bs4 import BeautifulSoup
def main():
responce()
def responce():
r = requests.get('https://robinhood.com/stocks/WISH')
soup = BeautifulSoup(r.content, 'html.parser')
responce = soup.find_all(class_="css-ktio0g""")
responce = responce.find_all(class_="css-6e9xj2")
print(responce)
main()
import requests
from bs4 import BeautifulSoup
from pprint import pp
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
goal = [x.text for x in soup.select('span.css-ktio0g')]
pp(goal)
main('https://robinhood.com/stocks/WISH')
Output:
['Piotr Szulczewski',
'—',
'San Francisco, California',
'2010',
'7.25B',
'—',
'—',
'155.46M',
'$12.39',
'$11.44',
'$11.97',
'76.70M',
'$32.85',
'$7.52',
'— per share',
'Expected Aug 11, After Hours',
'Sign up for a Robinhood Account to buy or sell ContextLogic stock and '
'options commission-free.']

Append href links into a dataframe list, getting all required info, but only links from last page appears

Using Beautiful Soup and pandas, I am trying to append all the links on a site into a list with the following code. I am able to scrape all pages with relevant information in the table. The code seems work to me somehow. But the small problem occurs is that just only links in the last page appears. The output is not what I expected. In the end, I'd like to append a list containing all 40 links (next to the required info) in 2 pages. I try scraping 2 pages first although there are 618 pages in total. Do you have any advice how to adjust the code so that each link is appended into the table? Many thanks in advance.
import pandas as pd
import requests
from bs4 import BeautifulSoup
hdr={'User-Agent':'Chrome/84.0.4147.135'}
dfs=[]
for page_number in range(2):
http= "http://example.com/&Page={}".format(page_number+1)
print('Downloading page %s...' % http)
url= requests.get(http,headers=hdr)
soup = BeautifulSoup(url.text, 'html.parser')
table = soup.find('table')
df_list= pd.read_html(url.text)
df = pd.concat(df_list)
dfs.append(df)
links = []
for tr in table.findAll("tr"):
trs = tr.findAll("td")
for each in trs:
try:
link = each.find('a')['href']
links.append(link)
except:
pass
df['Link'] = links
final_df = pd.concat(dfs)
final_df.to_csv('myfile.csv',index=False,encoding='utf-8-sig')
It's with your logic. You only add the links column to the last df since it's outside your loop. Get the links within the page loop, then add that to df, then you can append the df to your dfs list:
import pandas as pd
import requests
from bs4 import BeautifulSoup
hdr={'User-Agent':'Chrome/84.0.4147.135'}
dfs=[]
for page_number in range(2):
http= "http://example.com/&Page={}".format(page_number+1)
print('Downloading page %s...' % http)
url= requests.get(http,headers=hdr)
soup = BeautifulSoup(url.text, 'html.parser')
table = soup.find('table')
df_list= pd.read_html(url.text)
df = pd.concat(df_list)
links = []
for tr in table.findAll("tr"):
trs = tr.findAll("td")
for each in trs:
try:
link = each.find('a')['href']
links.append(link)
except:
pass
df['Link'] = links
dfs.append(df)
final_df = pd.concat(dfs)
final_df.to_csv('myfile.csv',index=False,encoding='utf-8-sig')

KeyError: 0 when converting bs4 xml to pandas df

I am trying to import xml to pandas using bs4.
The bs4 import works, but getting pandas to recognise the xml is problematic.
import requests
import bs4
import pandas as pd
url = 'https://www.federalreserve.gov/data.xml'
geturl = requests.get(url).text
data = bs4.BeautifulSoup(geturl, 'lxml')
df = pd.DataFrame(data)
print(df.head())
I am expecting the df to show the first 5 rows of data, but instead i get the following error:
KeyError: 0
Why is pandas producing this KeyError: 0?
Many thanks!
There are five different charts in the xml file. Which one do you want? This is an example using the first chart:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# xml url
xml = 'https://www.federalreserve.gov/data.xml'
# GET request and create soup
r = requests.get(xml)
soup = BeautifulSoup(r.text, 'xml')
# list comprehension to create a list of all the charts in the xml file
charts = [chart for chart in soup.findAll('chart')]
# list comprehension to get the observation index and value of the first chart (i.e, charts[0])
data = [[ob['index'], ob['value']] for ob in charts[0].findAll('observation')]
# create DataFrame
df = pd.DataFrame(data, columns=['Date', 'Value'])
df.head()
Date Value
0 1-Aug-07 870261.00
1 8-Aug-07 865453.00
2 15-Aug-07 864931.00
3 22-Aug-07 862775.00
4 29-Aug-07 872873.00
Update
You can iterate through all the charts and append to a dict. You will then call each DataFrame by the title of the chart:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# xml url
xml = 'https://www.federalreserve.gov/data.xml'
# GET request and create soup
r = requests.get(xml)
soup = BeautifulSoup(r.text, 'xml')
# list comprehension to create a list of all the charts in the xml file
charts = [chart for chart in soup.findAll('chart')]
# empty dict
df_list = {}
for chart in charts:
# list comprehension to get the observation index and value
data = [[ob['index'], ob['value']] for ob in chart.findAll('observation')]
# create DataFrame
df = pd.DataFrame(data, columns=['Date', 'Value'])
# create key from the the chart title and append df
df_list[chart['title']] = []
df_list[chart['title']].append(df)
# calling the second chart
df_list['Selected Assets of the Federal Reserve'][0].head()
Date Value
0 1-Aug-07 870261.00
1 8-Aug-07 865453.00
2 15-Aug-07 864931.00
3 22-Aug-07 862775.00
4 29-Aug-07 872873.00

How to use pd.DataFrame method to manually create a dataframe from info scraped using beautifulsoup4

I made it to the point where all tr data data has been scraped and I am able to get a nice printout. But when I go to implement the pd.DataFrame as in df= pd.DataFrame({"A": a}) etc, I get a syntax error
Here is a list of my imported libraries in the Jupyter Notebook:
import pandas as pd
import numpy as np
import bs4 as bs
import requests
import urllib.request
import csv
import html5lib
from pandas.io.html import read_html
import re
Here is my code:
source = urllib.request.urlopen('https://www.zipcodestogo.com/Texas/').read()
soup = bs.BeautifulSoup(source,'html.parser')
table_rows = soup.find_all('tr')
table_rows
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
texas_info = pd.DataFrame({
"title": Texas
"Zip Code" : [Zip Code],
"City" :[City],
})
texas_info.head()
I expect to get a dataframe with two columns, one being the 'Zip Code' and the other the 'Cities'
If you want to create manually, with bs4 4.7.1 you can use :not, :contains and :nth-of-type pseudo classes to isolate the two columns of interest, then construct a dict then convert to df
import pandas as pd
import urllib
from bs4 import BeautifulSoup as bs
source = urllib.request.urlopen('https://www.zipcodestogo.com/Texas/').read()
soup = bs(source,'lxml')
zips = [item.text for item in soup.select('.inner_table:contains(Texas) td:nth-of-type(1):not([colspan])')]
cities = [item.text for item in soup.select('.inner_table:contains(Texas) td:nth-of-type(2):not([colspan])')]
d = {'Zips': zips,'Cities': cities}
df = pd.DataFrame(d)
df = df[1:].reset_index(drop = True)
You could combine selectors into one line:
import pandas as pd
import urllib
from bs4 import BeautifulSoup as bs
source = urllib.request.urlopen('https://www.zipcodestogo.com/Texas/').read()
soup = bs(source,'lxml')
items = [item.text for item in soup.select('.inner_table:contains(Texas) td:nth-of-type(1):not([colspan]), .inner_table:contains(Texas) td:nth-of-type(2):not([colspan])')]
d = {'Zips': items[0::2],'Cities': items[1::2]}
df = pd.DataFrame(d)
df = df[1:].reset_index(drop = True)
print(df)
I note you want to create manually but worth knowing for future readers that you could just use pandas read_html
import pandas as pd
table = pd.read_html('https://www.zipcodestogo.com/Texas/')[1]
table.columns = table.iloc[1]
table = table[2:]
table = table.drop(['Zip Code Map', 'County'], axis=1).reset_index(drop=True)
print(table)
Try creating the DataFrame and perform the for loop to append each row in the table into the DataFrame.
df = pd.DataFrame()
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
zipCode = row[0] # assuming first column
city = row[1] # assuming second column
df = df.append({"Zip Code": zipCode, "City" : city}, ignore_index=True)
If you only need these two columns, you should not include title in the DataFrame (that will create another column); that line also happened to be where the syntax error occurred because of the missing comma.

input custom text in youtube text field using selenium python

I'm making a text scraper for youtube in which I want to enter data and search videos and collect data of it. I'm facing problems in entering data in the text field. Can anyone suggest me a method to do that?
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
soup = BeautifulSoup(driver.page_source, 'lxml') #Use the page as source
page = driver.get('https://freight.rivigo.com/dashboard/home')
import sys
from importlib import reload
reload
elem = driver.find_element_by_tag_name("body")
no_of_pagedowns = 120
while no_of_pagedowns:
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.5)
no_of_pagedowns-=1
soup = BeautifulSoup(driver.page_source, 'lxml')
In between this code I want to add a custom text in input field, lets say "comedy" and want to get data on that. I'm stuck on how to input data and I'm quite new to this so any sort of help will be helpful.
That page is NOT pointing to YouTube. Check out the working code sample below for an idea of what you can do with the YouTube API.
# https://medium.com/greyatom/youtube-data-in-python-6147160c5833
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#from youtube_data import youtube_search
test = youtube_search("Nine Inch Nails")
test.keys()
test['commentCount'][:5]
df = pd.DataFrame(data=test)
df.head()
df1 = df[['title','viewCount','channelTitle','commentCount','likeCount','dislikeCount','tags','favoriteCount','videoId','channelId','categoryId']]
df1.columns = ['Title','viewCount','channelTitle','commentCount','likeCount','dislikeCount','tags','favoriteCount','videoId','channelId','categoryId']
df1.head()
#import numpy as np
#numeric_dtype = ['viewCount','commentCount','likeCount','dislikeCount','favoriteCount']
#for i in numeric_dtype:
# df1[i] = df[i].astype(int)
NIN = df1[df1['channelTitle']=='Nine Inch Nails']
NIN.head()
NIN = NIN.sort_values(ascending=False,by='viewCount')
plt.bar(range(NIN.shape[0]),NIN['viewCount'])
plt.xticks(range(NIN.shape[0]),NIN['Title'],rotation=90)
plt.ylabel('viewCount in 100 millions')
plt.show()