Append href links into a dataframe list, getting all required info, but only links from last page appears - pandas

Using Beautiful Soup and pandas, I am trying to append all the links on a site into a list with the following code. I am able to scrape all pages with relevant information in the table. The code seems work to me somehow. But the small problem occurs is that just only links in the last page appears. The output is not what I expected. In the end, I'd like to append a list containing all 40 links (next to the required info) in 2 pages. I try scraping 2 pages first although there are 618 pages in total. Do you have any advice how to adjust the code so that each link is appended into the table? Many thanks in advance.
import pandas as pd
import requests
from bs4 import BeautifulSoup
hdr={'User-Agent':'Chrome/84.0.4147.135'}
dfs=[]
for page_number in range(2):
http= "http://example.com/&Page={}".format(page_number+1)
print('Downloading page %s...' % http)
url= requests.get(http,headers=hdr)
soup = BeautifulSoup(url.text, 'html.parser')
table = soup.find('table')
df_list= pd.read_html(url.text)
df = pd.concat(df_list)
dfs.append(df)
links = []
for tr in table.findAll("tr"):
trs = tr.findAll("td")
for each in trs:
try:
link = each.find('a')['href']
links.append(link)
except:
pass
df['Link'] = links
final_df = pd.concat(dfs)
final_df.to_csv('myfile.csv',index=False,encoding='utf-8-sig')

It's with your logic. You only add the links column to the last df since it's outside your loop. Get the links within the page loop, then add that to df, then you can append the df to your dfs list:
import pandas as pd
import requests
from bs4 import BeautifulSoup
hdr={'User-Agent':'Chrome/84.0.4147.135'}
dfs=[]
for page_number in range(2):
http= "http://example.com/&Page={}".format(page_number+1)
print('Downloading page %s...' % http)
url= requests.get(http,headers=hdr)
soup = BeautifulSoup(url.text, 'html.parser')
table = soup.find('table')
df_list= pd.read_html(url.text)
df = pd.concat(df_list)
links = []
for tr in table.findAll("tr"):
trs = tr.findAll("td")
for each in trs:
try:
link = each.find('a')['href']
links.append(link)
except:
pass
df['Link'] = links
dfs.append(df)
final_df = pd.concat(dfs)
final_df.to_csv('myfile.csv',index=False,encoding='utf-8-sig')

Related

Extract Title Tags BeautifulSoup

I need help because I wanted to write code for finding out the title tags on a website. Although I used the code from another question and applied it to this scenario, there are no title tags whenever I print 'Beschreibung'.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib.parse
webseite = 'https://www.entega.de/sitemap/'
response = requests.get(webseite)
response.status_code
soup = BeautifulSoup (response.content, 'html.parser')
result_container = soup.find_all('div', {'class':'clearfix'})
url_part_1 = 'https://www.entega.de/sitemap/'
url_part_2 = []
for item in result_container:
for link in item.find_all ('a', {'class':'modSitemap__lvl1Link ui-link' }):
url_part_2.append (link.get ('href'))
url_joined = []
for i in url_part_2:
url_joined.append (urllib.parse.urljoin(url_part_1,i))
Überschrift = []
Beschreibung = []
Verlinkungen = []
for link in url_joined:
response = requests.get (link)
soup = BeautifulSoup (response.content, 'html.parser')
Beschreibung.append(soup.find_all('a', title=True, class_='modSitemap__lvl1Link ui-link'))
You are getting nothing because these links don't have an <a class="modSitemap__lvl1Link ui-link"> tag. They do have classes that start with that string though. You could expand to that. Or you can simply just get the <a> tags that have a title attribute.
So change your loop to either:
import re
for link in url_joined:
response = requests.get (link)
soup = BeautifulSoup (response.content, 'html.parser')
Beschreibung.append(soup.find_all('a', {'class':re.compile("^modSitemap__lvl1Link ui-link")}, title=True, ))
or
for link in url_joined:
response = requests.get (link)
soup = BeautifulSoup (response.content, 'html.parser')
Beschreibung.append(soup.find_all('a', title=True))

Scrape url link in table by BS4

I tried to scrape the hyperlinks in the tag (a herf) of the table. However, it doesn't work. Can you help to improve this code?
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
dfs = pd.DataFrame()
for i in range(1,11):
driver = webdriver.Chrome()
driver.get('https://racing.hkjc.com/racing/information/English/racing/RaceCard.aspx?RaceDate=2021/02/06&Racecourse=ST&RaceNo='+str(i)+'')
res = driver.execute_script('return document.documentElement.outerHTML')
time.sleep(3)
driver.quit()
soup = BeautifulSoup(res, 'lxml')
h_table = soup.find('table', {'class':'table_bd f_tac f_fs13'})
def tableDataText(h_table):
rows = []
trs = h_table.find_all('tr')
headerow = [td.get_text(strip=True) for td in trs[0].find_all('th')] # header row
if headerow: # if there is a header row include first
rows.append(headerow)
trs = trs[1:]
for tr in trs: # for every table row
rows.append([td.get_text(strip=True) for td in tr.find_all('td')]) # data row
return rows
result_table = tableDataText(h_table)
df = pd.DataFrame(result_table[1:], columns=result_table[0])
dfs = pd.concat([dfs, df], ignore_index=True)
Your question and the expected result is not that clear and should be improved - If just wanna grab all the urls from the href you can go with:
from bs4 import BeautifulSoup
from selenium import webdriver
linkList = []
for i in range(1,11):
driver = webdriver.Chrome()
driver.get('https://racing.hkjc.com/racing/information/English/racing/RaceCard.aspx?RaceDate=2021/02/06&Racecourse=ST&RaceNo='+str(i)+'')
time.sleep(6)
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()
for a in soup.select('table#racecardlist table a'):
linkList.append('https://racing.hkjc.com'+a['href'])
linkList

KeyError: 0 when converting bs4 xml to pandas df

I am trying to import xml to pandas using bs4.
The bs4 import works, but getting pandas to recognise the xml is problematic.
import requests
import bs4
import pandas as pd
url = 'https://www.federalreserve.gov/data.xml'
geturl = requests.get(url).text
data = bs4.BeautifulSoup(geturl, 'lxml')
df = pd.DataFrame(data)
print(df.head())
I am expecting the df to show the first 5 rows of data, but instead i get the following error:
KeyError: 0
Why is pandas producing this KeyError: 0?
Many thanks!
There are five different charts in the xml file. Which one do you want? This is an example using the first chart:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# xml url
xml = 'https://www.federalreserve.gov/data.xml'
# GET request and create soup
r = requests.get(xml)
soup = BeautifulSoup(r.text, 'xml')
# list comprehension to create a list of all the charts in the xml file
charts = [chart for chart in soup.findAll('chart')]
# list comprehension to get the observation index and value of the first chart (i.e, charts[0])
data = [[ob['index'], ob['value']] for ob in charts[0].findAll('observation')]
# create DataFrame
df = pd.DataFrame(data, columns=['Date', 'Value'])
df.head()
Date Value
0 1-Aug-07 870261.00
1 8-Aug-07 865453.00
2 15-Aug-07 864931.00
3 22-Aug-07 862775.00
4 29-Aug-07 872873.00
Update
You can iterate through all the charts and append to a dict. You will then call each DataFrame by the title of the chart:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# xml url
xml = 'https://www.federalreserve.gov/data.xml'
# GET request and create soup
r = requests.get(xml)
soup = BeautifulSoup(r.text, 'xml')
# list comprehension to create a list of all the charts in the xml file
charts = [chart for chart in soup.findAll('chart')]
# empty dict
df_list = {}
for chart in charts:
# list comprehension to get the observation index and value
data = [[ob['index'], ob['value']] for ob in chart.findAll('observation')]
# create DataFrame
df = pd.DataFrame(data, columns=['Date', 'Value'])
# create key from the the chart title and append df
df_list[chart['title']] = []
df_list[chart['title']].append(df)
# calling the second chart
df_list['Selected Assets of the Federal Reserve'][0].head()
Date Value
0 1-Aug-07 870261.00
1 8-Aug-07 865453.00
2 15-Aug-07 864931.00
3 22-Aug-07 862775.00
4 29-Aug-07 872873.00

input custom text in youtube text field using selenium python

I'm making a text scraper for youtube in which I want to enter data and search videos and collect data of it. I'm facing problems in entering data in the text field. Can anyone suggest me a method to do that?
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
soup = BeautifulSoup(driver.page_source, 'lxml') #Use the page as source
page = driver.get('https://freight.rivigo.com/dashboard/home')
import sys
from importlib import reload
reload
elem = driver.find_element_by_tag_name("body")
no_of_pagedowns = 120
while no_of_pagedowns:
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.5)
no_of_pagedowns-=1
soup = BeautifulSoup(driver.page_source, 'lxml')
In between this code I want to add a custom text in input field, lets say "comedy" and want to get data on that. I'm stuck on how to input data and I'm quite new to this so any sort of help will be helpful.
That page is NOT pointing to YouTube. Check out the working code sample below for an idea of what you can do with the YouTube API.
# https://medium.com/greyatom/youtube-data-in-python-6147160c5833
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#from youtube_data import youtube_search
test = youtube_search("Nine Inch Nails")
test.keys()
test['commentCount'][:5]
df = pd.DataFrame(data=test)
df.head()
df1 = df[['title','viewCount','channelTitle','commentCount','likeCount','dislikeCount','tags','favoriteCount','videoId','channelId','categoryId']]
df1.columns = ['Title','viewCount','channelTitle','commentCount','likeCount','dislikeCount','tags','favoriteCount','videoId','channelId','categoryId']
df1.head()
#import numpy as np
#numeric_dtype = ['viewCount','commentCount','likeCount','dislikeCount','favoriteCount']
#for i in numeric_dtype:
# df1[i] = df[i].astype(int)
NIN = df1[df1['channelTitle']=='Nine Inch Nails']
NIN.head()
NIN = NIN.sort_values(ascending=False,by='viewCount')
plt.bar(range(NIN.shape[0]),NIN['viewCount'])
plt.xticks(range(NIN.shape[0]),NIN['Title'],rotation=90)
plt.ylabel('viewCount in 100 millions')
plt.show()

Extract specific columns from a given webpage

I am trying to read web page using python and save the data in csv format to be imported as pandas dataframe.
I have the following code that extracts the links from all the pages, instead I am trying to read certain column fields.
for i in range(10):
url='https://pythonexpress.in/workshop/'+str(i).zfill(3)
import urllib2
from bs4 import BeautifulSoup
try:
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]:
print i, anchor.text
except:
pass
Can I save these 9 columns as pandas dataframe?
df.columns=['Organiser', 'Instructors', 'Date', 'Venue', 'Level', 'participants', 'Section', 'Status', 'Description']
This returns the correct results for the first 10 pages - but it takes a lot of time for 100 pages. Any suggestions to make it faster?
import urllib2
from bs4 import BeautifulSoup
finallist=list()
for i in range(10):
url='https://pythonexpress.in/workshop/'+str(i).zfill(3)
try:
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
mylist=list()
for anchor in soup.find_all('div', {'class':'col-xs-8'})[:9]:
mylist.append(anchor.text)
finallist.append(mylist)
except:
pass
import pandas as pd
df=pd.DataFrame(finallist)
df.columns=['Organiser', 'Instructors', 'Date', 'Venue', 'Level', 'participants', 'Section', 'Status', 'Description']
df['Date'] = pd.to_datetime(df['Date'],infer_datetime_format=True)
df['participants'] = df['participants'].astype(int)