Using a for loop to create a new column in pandas dataframe - pandas

I have been trying to create a web crawler to scrape data from a website called Baseball Reference. When defining my crawler I realized that the different players have a unique id at the end of their URL containing the first 6 letters of their last name, three zeroes and the first 3 letters of their first name.
I have a pandas dataframe already containing columns 'first' and 'last' containing each players first and last names along with a lot of other data that i downloaded from this same website.
my def for my crawler function is as follows so far:
def bbref_crawler(ID):
url = 'https://www.baseball-reference.com/register/player.fcgi?id=' + str(ID)
source_code = requests.get(url)
page_soup = soup(source_code.text, features='lxml')
And the code that I have so far trying to obtain the player id's is as follows:
for x in nwl_offense:
while len(nwl_offense['last']) > 6:
id_last = len(nwl_offense['last']) - 1
while len(nwl_offense['first']) > 3:
id_first = len(nwl_offense['first']) - 1
nwl_offense['player_id'] = (str(id_first) + '000' + str(id_last))
When I run the for / while loop it just never stops running and I am not sure how else to go about achieving the goal I set out for of automating the player id into another column of that dataframe, so i can easily use the crawler to obtain more information on the players that I need for a project.
This is what the first 5 rows of the dataframe, nwl_offense look like:
print(nwl_offense.head())
Rk Name Age G ... WRC+ WRC
WSB OWins
0 1.0 Brian Baker 20.0 14.0 ... 733.107636 2.007068 0.099775 0.189913
1 2.0 Drew Beazley 21.0 46.0 ... 112.669541 29.920766 -0.456988 2.655892
2 3.0 Jarrett Bickel 21.0 33.0 ... 85.017293 15.245547 1.419822 1.502232
3 4.0 Nate Boyle 23.0 21.0 ... 1127.591556 1.543534 0.000000 0.139136
4 5.0 Seth Brewer* 22.0 12.0 ... 243.655365 1.667671 0.099775 0.159319

As stated in the comments, I wouldn't try to create a function to make the ids, as there will likely be some "quirky" ones in there that might not follow that logic.
If you're just go through each letter search they have it divided by and get the id directly by the player url.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.baseball-reference.com/register/player.fcgi'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
player_register_search = {}
searchLinks = soup.find('div', {'id':'div_players'}).find_all('li')
for each in searchLinks:
links = each.find_all('a', href=True)
for link in links:
print(link)
player_register_search[link.text] = 'https://www.baseball-reference.com/' + link['href']
tot = len(player_register_search)
playerIds = {}
for count, (k, link)in enumerate(player_register_search.items(), start=1):
print(f'{count} of {tot} - {link}')
response = requests.get(link)
soup = BeautifulSoup(response.text, 'html.parser')
kLower = k.lower()
playerSection = soup.find('div', {'id':f'all_players_{kLower}'})
h2 = playerSection.find('h2').text
#print('\t',h2)
player_links = playerSection.find_all('a', href=True)
for player in player_links:
playerName = player.text.strip()
playerId = player['href'].split('id=')[-1].strip()
if playerName not in playerIds.keys():
playerIds[playerName] = []
#print(f'\t{playerName}: {playerId}')
playerIds[playerName].append(playerId)
df = pd.DataFrame({'Player' : list(playerIds.keys()),
'id': list(playerIds.values())})
Output:
print(df)
Player id
0 Scott A'Hara [ahara-000sco]
1 A'Heasy [ahease001---]
2 Al Aaberg [aaberg001alf]
3 Kirk Aadland [aadlan001kir]
4 Zach Aaker [aaker-000zac]
... ...
323628 Mike Zywica [zywica001mic]
323629 Joseph Zywiciel [zywici000jos]
323630 Bobby Zywicki [zywick000bob]
323631 Brandon Zywicki [zywick000bra]
323632 Nate Zyzda [zyzda-000nat]
[323633 rows x 2 columns]
TO GET JUST THE PLAYERS FROM YOUR DATAFRAME:
THIS IS JUST AN EXAMPLE OF YOUR DATAFRAME. DO NOT INCLUDE THIS IN YOUR CODE
# Sample of the dataframe
nwl_offense = pd.DataFrame({'first':['Evan', 'Kelby'],
'last':['Albrecht', 'Golladay']})
Use this:
# YOU DATAFRAME - GET LIST OF NAMES
player_interest_list = list(nwl_offense['Name'])
nwl_players = df.loc[df['Player'].isin(player_interest_list)]
Output:
print(nwl_players)
Player id
3095 Evan Albrecht [albrec001eva, albrec000eva]
108083 Kelby Golladay [gollad000kel]

Related

Selenium/Webscrape this field

My code runs fine and prints the title for all rows but the rows with dropdowns.
For example, row 4 has a dropdown if clicked. I implemented a try which would in theory initiate the dropdown, to then pull the titles.
But when i execute click() and try to print, for the rows with these drop downs, they are not printing.
Expected output- Print all titles including the ones in dropdown.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
productlist=soup.find_all('div',class_='card item-container session')
for property in productlist:
sessiontitle=property.find('h4',class_='session-title card-title').text
print(sessiontitle)
try:
ifDropdown=driver.find_elements_by_class_name('item-expand-action expand')
ifDropdown.click()
time.sleep(4)
newTitle=driver.find_element_by_class_name('card-title').text
print(newTitle)
except:
newTitle='none'
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_soup(content):
return BeautifulSoup(content, 'lxml')
def my_filter(req, content):
try:
r = req.get(content['href'])
soup = get_soup(r.text)
return [x.text for x in soup.select('.card-title')[1:]]
except TypeError:
return 'N/A'
def main(url):
with requests.Session() as req:
for page in range(1, 2):
print(f"Extracting Page# {page}\n")
params = {
"p": page
}
r = req.get(url, params=params)
soup = get_soup(r.text)
goal = {x.select_one('.session-title').text: my_filter(
req, x.select_one('.item-expand-action')) for x in soup.select('.card')}
df = pd.DataFrame(goal.items(), columns=['Title', 'Menu'])
print(df)
main('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
Output:
Title Menu
0 Educational sessions on-demand N/A
1 Special Symposia on-demand N/A
2 Multidisciplinary sessions on-demand N/A
3 Illumina - Diagnosing Non-Small Cell Lung Canc... [Illumina gives an update on their IVD road ma...
4 MSD - Homologous Recombination Deficiency: BRC... [Welcome and Introductions, Homologous Recombi...
5 Servier - The clinical value of IDH inhibition... [Isocitric dehydrogenase: an actionable geneti...
6 AstraZeneca - Redefining Breast Cancer – Biolo... [Welcome and Opening, Redefining Breast Cancer...
7 ITM Isotopen Technologien München AG - A Globa... [Welcome & Introduction, Changes in the Incide...
8 MSD - The Role of Biomarkers in Patient Manage... [Welcome and Introductions, The Role of Pd-L1 ...
9 AstraZeneca - Re-evaluating the role of gBRCA ... [Welcome and introduction, What do we know abo...
10 Novartis - Unmet needs in oncogene-driven NSCL... [Welcome and introduction, Unmet needs in onco...
11 Opening session N/A

Creating a table off of 247sports website

I am trying to create a pandas dataframe based off of the top 1000 recruits from the 2022 football recruiting class from the 247sports website in a google colab notebook. I currently am using the following code so far:
#Importing all necessary packages
import pandas as pd
import time
import datetime as dt
import os
import re
import requests
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#import twofourseven
from bs4 import BeautifulSoup
from splinter import Browser
from kora.selenium import wd
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import requests
from geopy.geocoders import Nominatim
from sklearn.model_selection import KFold
from sklearn.metrics import log_loss, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
year = '2022'
url = 'https://247sports.com/Season/' + str(year) + '-Football/CompositeRecruitRankings?InstitutionGroup=HighSchool'
# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"}
response = requests.get(url, headers = headers).content
soup = BeautifulSoup(response, "html.parser")
data = []
for tag in soup.find_all("li", class_="rankings-page__list-item"): # `[1:]` Since the first result is a table header
# meta = tag.find_all("span", class_="meta")
rank = tag.find_next("div", class_="primary").text
TwoFourSeven_rank = tag.find_next("div", class_="other").text
name = tag.find_next("a", class_="rankings-page__name-link").text
school = tag.find_next("span", class_="meta").text
position = tag.find_next("div", class_="position").text
height_weight = tag.find_next("div", class_="metrics").text
rating = tag.find_next("span", class_="score").text
nat_rank = tag.find_next("a", class_="natrank").text
state_rank = tag.find_next("a", class_="sttrank").text
pos_rank = tag.find_next("a", class_="posrank").text
data.append(
{
"Rank": rank,
"247 Rank": TwoFourSeven_rank,
"Name": name,
"School": school,
"Class of": year,
"Position": position,
"Height & Weight": height_weight,
"Rating": rating,
"National Rank": nat_rank,
"State Rank": state_rank,
"Position Rank": pos_rank,
# "School": ???,
}
)
print(rank)
df = pd.DataFrame(data)
data
Ideally, I would also like to grab the school name the recruit chose from the logo on the table, but I am not sure how to go about that. For example, I would like to print out "Florida State" for the school column from this "row" of data.
Along with that, I do get an output of printing ranks, but afterwards, I get the following error that won't allow me to collect and/or print out additional data:
AttributeError Traceback (most recent call last)
<ipython-input-11-56f4779601f8> in <module>()
16 # meta = tag.find_all("span", class_="meta")
17
---> 18 rank = tag.find_next("div", class_="primary").text
19 # TwoFourSeven_rank = tag.find_next("div", class_="other").text
20 name = tag.find_next("a", class_="rankings-page__name-link").text
AttributeError: 'NoneType' object has no attribute 'text'
Lastly, I do understand that this webpage only displays 50 recruits without having my python code click the "Load more" tab via selenium, but I am not 100% sure how to incorporate that in the most efficient and legible way possible. If anyone knows a good way to do all this, I'd greatly appreciate it. Thanks in advance.
Use try/except as some of the elements will not be present. Also no need to use Selenium. Simple requests will do.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://247sports.com/Season/2022-Football/CompositeRecruitRankings/?InstitutionGroup=HighSchool'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Mobile Safari/537.36'}
rows = []
page = 0
while True:
page +=1
print('Page: %s' %page)
payload = {'Page': '%s' %page}
response = requests.get(url, headers=headers, params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
athletes = soup.find_all('li',{'class':'rankings-page__list-item'})
if len(athletes) == 0:
break
continue_loop = True
while continue_loop == True:
for athlete in athletes:
if athlete.text.strip() == 'Load More':
continue_loop = False
continue
primary_rank = athlete.find('div',{'class':'rank-column'}).find('div',{'class':'primary'}).text.strip()
try:
other_rank = athlete.find('div',{'class':'rank-column'}).find('div',{'class':'other'}).text.strip()
except:
other_rank = ''
name = athlete.find('div',{'class':'recruit'}).find('a').text.strip()
link = 'https://247sports.com' + athlete.find('div',{'class':'recruit'}).find('a')['href']
highschool = ' '.join([x.strip() for x in athlete.find('div',{'class':'recruit'}).find('span',{'class':'meta'}).text.strip().split('\n')])
pos = athlete.find('div',{'class':'position'}).text.strip()
ht = athlete.find('div',{'class':'metrics'}).text.split('/')[0].strip()
wt = athlete.find('div',{'class':'metrics'}).text.split('/')[1].strip()
rating = athlete.find('span',{'class':'score'}).text.strip()
nat_rank = athlete.find('a',{'class':'natrank'}).text.strip()
pos_rank = athlete.find('a',{'class':'posrank'}).text.strip()
st_rank = athlete.find('a',{'class':'sttrank'}).text.strip()
try:
team = athlete.find('div',{'class':'status'}).find('img')['title']
except:
team = ''
row = {'Primary Rank':primary_rank,
'Other Rank':other_rank,
'Name':name,
'Link':link,
'Highschool':highschool,
'Position':pos,
'Height':ht,
'weight':wt,
'Rating':rating,
'National Rank':nat_rank,
'Position Rank':pos_rank,
'State Rank':st_rank,
'Team':team}
rows.append(row)
df = pd.DataFrame(rows)
**Output: first 10 rows of 1321 rows - **
print(df.head(10).to_string())
Primary Rank Other Rank Name Link Highschool Position Height weight Rating National Rank Position Rank State Rank Team
0 1 1 Quinn Ewers https://247sports.com/Player/Quinn-Ewers-45572600 Southlake Carroll (Southlake, TX) QB 6-3 206 1.0000 1 1 1 Ohio State
1 2 3 Travis Hunter https://247sports.com/Player/Travis-Hunter-46084728 Collins Hill (Suwanee, GA) CB 6-1 165 0.9993 2 1 1 Florida State
2 3 2 Walter Nolen https://247sports.com/Player/Walter-Nolen-46083769 St. Benedict at Auburndale (Cordova, TN) DL 6-4 300 0.9991 3 1 1
3 4 14 Domani Jackson https://247sports.com/Player/Domani-Jackson-46057101 Mater Dei (Santa Ana, CA) CB 6-1 185 0.9966 4 2 1 USC
4 5 10 Zach Rice https://247sports.com/Player/Zach-Rice-46086346 Liberty Christian Academy (Lynchburg, VA) OT 6-6 282 0.9951 5 1 1
5 6 4 Gabriel Brownlow-Dindy https://247sports.com/Player/Gabriel-Brownlow-Dindy-46084792 Lakeland (Lakeland, FL) DL 6-3 275 0.9946 6 2 1
6 7 5 Shemar Stewart https://247sports.com/Player/Shemar-Stewart-46080267 Monsignor Pace (Opa Locka, FL) DL 6-5 260 0.9946 7 3 2
7 8 20 Denver Harris https://247sports.com/Player/Denver-Harris-46081216 North Shore (Houston, TX) CB 6-1 180 0.9944 8 3 2
8 9 33 Travis Shaw https://247sports.com/Player/Travis-Shaw-46057330 Grimsley (Greensboro, NC) DL 6-5 310 0.9939 9 4 1
9 10 23 Devon Campbell https://247sports.com/Player/Devon-Campbell-46093947 Bowie (Arlington, TX) IOL 6-3 310 0.9937 10 1 3

Pandas to mark both if cell value is a substring of another

A column with short and full form of people names, I want to unify them, if the name is a part of the other name. e.g. "James.J" and "James.Jones", I want to tag them both as "James.J".
data = {'Name': ["Amelia.Smith",
"Lucas.M",
"James.J",
"Elijah.Brown",
"Amelia.S",
"James.Jones",
"Benjamin.Johnson"]}
df = pd.DataFrame(data)
I can't figure out how to do it in Pandas. So only a xlrd way, with similarity ratio by SequenceMatcher (and sort it manually in Excel):
import xlrd
from xlrd import open_workbook,cellname
import xlwt
from xlutils.copy import copy
workbook = xlrd.open_workbook("C:\\TEM\\input.xlsx")
old_sheet = workbook.sheet_by_name("Sheet1")
from difflib import SequenceMatcher
wb = copy(workbook)
sheet = wb.get_sheet(0)
for row_index in range(0, old_sheet.nrows):
current = old_sheet.cell(row_index, 0).value
previous = old_sheet.cell(row_index-1, 0).value
sro = SequenceMatcher(None, current.lower(), previous.lower(), autojunk=True).ratio()
if sro > 0.7:
sheet.write(row_index, 1, previous)
sheet.write(row_index-1, 1, previous)
wb.save("C:\\TEM\\output.xls")
What's the nice Pandas way to do it/ Thank you.
using pandas, making use of str.split and .map with some boolean conditions to identify the dupes.
df1 = df['Name'].str.split('.',expand=True).rename(columns={0 : 'FName', 1 :'LName'})
df2 = df1.loc[df1['FName'].duplicated(keep=False)]\
.assign(ky=df['Name'].str.len())\
.sort_values('ky')\
.drop_duplicates(subset=['FName'],keep='first').drop('ky',1)
df['NewName'] = df1['FName'].map(df2.assign(newName=df2.agg('.'.join,1))\
.set_index('FName')['newName'])
print(df)
Name NewName
0 Amelia.Smith Amelia.S
1 Lucas.M NaN
2 James.J James.J
3 Elijah.Brown NaN
4 Amelia.S Amelia.S
5 James.Jones James.J
6 Benjamin.Johnson NaN
Here is an example of using apply with a custom function. For small dfs this should be fine; this will not scale well for large dfs. A more sophisticated data structure for memo would be an ok place to start to improve performance without degrading readability too much:
df = df.sort_values("Name")
def short_name(row, col="Name", memo=[]):
name = row[col]
for m_name in memo:
if name.startswith(m_name):
return m_name
memo.append(name)
return name
df["short_name"] = df.apply(short_name, axis=1)
df = df.sort_index()
output:
Name short_name
0 Amelia.Smith Amelia.S
1 Lucas.M Lucas.M
2 James.J James.J
3 Elijah.Brown Elijah.Brown
4 Amelia.S Amelia.S
5 James.Jones James.J
6 Benjamin.Johnson Benjamin.Johnson

'int' object has no attribute 'replace' error in python3.x

I don't get why this error occurs. Coz from my point of view the three columns 'WWBO','IBO','DBO' has exact same structure but when I apply 'replace' only WWBO works. Does it have sth with fillna?
Need your help!
import requests
from bs4 import BeautifulSoup as bs
#Read url
URL = "https://www.the-numbers.com/box-office-records/worldwide/all- movies/cumulative/released-in-2019"
data = requests.get(URL).text
#parse url
soup = bs(data, "html.parser")
#find the tables you want
table = soup.findAll("table")[1:]
#read it into pandas
df = pd.read_html(str(table))
#concat both the tables
df = pd.concat([df[0],df[1]])
df = df.rename(columns={'Rank':'Rank',
'Movie':'Title',
'Worldwide Box Office':'WWBO',
'Domestic Box Office':'DBO',
'International Box Office':'IBO',
'DomesticShare':'Share'})
#drop columns
market = df.drop(columns=['Rank','Share'])
market = market.fillna(0)
#replace $ -> ''
market['WWBO'] = market['WWBO'].map(lambda s: s.replace('$',''))
market['IBO'] = market['IBO'].map(lambda s: s.replace('$',''))
market['DBO'] = market['DBO'].map(lambda s: s.replace('$',''))
market
Error is:::
AttributeError: 'int' object has no attribute 'replace'
it is Pandas bugs auto casting '0' values to int, to solutions for this either eliminate the 0 value or cast the columns to string as below
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
#Read url
URL = "https://www.the-numbers.com/box-office-records/worldwide/all-movies/cumulative/released-in-2019"
data = requests.get(URL).text
#parse url
soup = bs(data, "html.parser")
#find the tables you want
table = soup.findAll("table")[1:]
#read it into pandas
df = pd.read_html(str(table))
#concat both the tables
df = pd.concat([df[0],df[1]])
df = df.rename(columns={'Rank':'Rank',
'Movie':'Title',
'Worldwide Box Office':'WWBO',
'Domestic Box Office':'DBO',
'International Box Office':'IBO',
'DomesticShare':'Share'})
#drop columns
market = df.drop(columns=['Rank','Share'])
market = market.fillna(0)
#replace $ -> ''
market['WWBO'] = market['WWBO'].map(lambda s: s.replace('$',''))
market['IBO']=market['IBO'].astype(str)
market['IBO'] = market['IBO'].map(lambda s: s.replace('$',''))
market['DBO']=market['DBO'].astype(str)
market['DBO'] = market['DBO'].map(lambda s: s.replace('$',''))
>> market[['WWBO','IBO','DBO']]
WWBO IBO DBO
0 2,622,240,021 1,842,814,023 779,425,998
1 1,121,905,659 696,535,598 425,370,061
2 692,163,684 692,163,684 0
3 518,883,574 358,491,094 160,392,480
4 402,976,036 317,265,826 85,710,210
5 358,234,705 220,034,625 138,200,080
6 342,904,508 231,276,537 111,627,971
7 326,150,303 326,150,303 0
8 293,766,097 192,548,368 101,217,729
9 255,832,826 255,832,826 0
10 253,940,650 79,203,380 174,737,270
11 245,303,505 134,268,500 111,035,005
12 190,454,964 84,648,456 105,806,508
13 155,313,390 98,312,634 57,000,756
clearly one or more of these fields(market['WWBO'], market['IBO'], market['DBO']) have integer values and you are trying to perform string operation i.e. replace over it that's it is throwing error that
AttributeError: 'int' object has no attribute 'replace'
could you first print those values and see what are they or if you have many then its better to perform type check first like
if market['WWBO'].dtype == object:
market['WWBO'].map(lambda s: s.replace('$',''))
else:
pass
let me know if this works for you or not

Adding multiple dictionaries into a single Dataframe pandas

I have a set of python dictionaries that I have obtained by means of a for loop. I am trying to have these added to Pandas Dataframe.
Output for a variable called output
{'name':'Kevin','age':21}
{'name':'Steve','age':31}
{'name':'Mark','age':11}
I am trying to append each of these dictionary into a single Dataframe. I tried to perform the below but it just added the first row.
df = pd.DataFrame(output)
Could anyone advice as to where am going wrong and have all the dictionaries added to the Dataframe.
Update on the loop statement
The below code helps to read xml and convert it to a dataframe. Right now I see I am able to loop in through multiple xml files and created dictionaries for each xml file. I am trying to see how could I add each of these dictionaries to a single Dataframe:
def f(elem, result):
result[elem.tag] = elem.text
cs = elem.getchildren()
for c in cs:
result = f(c, result)
return result
result = {}
for file in allFiles:
tree = ET.parse(file)
root = tree.getroot()
result = f(root, result)
print(result)
You can append each dictionary to list and last call DataFrame constructor:
out = []
for file in allFiles:
tree = ET.parse(file)
root = tree.getroot()
result = f(root, result)
out.append(result)
df = pd.DataFrame(out)
We can add these dicts to a list:
ds = []
for ...: # your loop
ds += [d] # where d is one of the dicts
When we have the list of dicts, we can simply use pd.DataFrame on that list:
ds = [
{'name':'Kevin','age':21},
{'name':'Steve','age':31},
{'name':'Mark','age':11}
]
pd.DataFrame(ds)
Output:
name age
0 Kevin 21
1 Steve 31
2 Mark 11
Update:
And it's not a problem if different dicts have different keys, e.g.:
ds = [
{'name':'Kevin','age':21},
{'name':'Steve','age':31,'location': 'NY'},
{'name':'Mark','age':11,'favorite_food': 'pizza'}
]
pd.DataFrame(ds)
Output:
age favorite_food location name
0 21 NaN NaN Kevin
1 31 NaN NY Steve
2 11 pizza NaN Mark
Update 2:
Building up on our previous discussion in Python - Converting xml to csv using Python pandas we can do:
results = []
for file in glob.glob('*.xml'):
tree = ET.parse(file)
root = tree.getroot()
result = f(root, {})
result['filename'] = file # added filename to our results
results += [result]
pd.DataFrame(results)