nba combine web parsing:"'NoneType' object has no attribute 'find_all' - beautifulsoup

Well, I wanted the table of player and info but i ran into a problem which i don't understand.
the problem is in the 8-9th line; trs = table.find_all('tr')
from bs4 import BeautifulSoup
import requests
url = 'https://www.nbadraft.net/2019-nba-draft-combine-measurements'
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'html.parser')
table = soup.find('div', attrs={'class':'content'}).tbody
trs = table.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
row = [td.text for td in tds]
---EXPECTED--
[['player','info','info',etc]]
---ACTUAL---
trs = table.find_all('tr')
AttributeError: 'NoneType' object has no attribute 'find_all'

The problem is you are selecting first <tbody>, which is just a header. All <tr> are in the next <tbody>. This script selects all rows and prints them.
from bs4 import BeautifulSoup
import requests
url = 'https://www.nbadraft.net/2019-nba-draft-combine-measurements'
resp = requests.get(url)
soup = BeautifulSoup(resp.content, 'lxml')
rows = []
for tr in soup.select('.content table tr'):
tds = tr.find_all('td')
rows.append([td.text.strip() for td in tds])
for row in rows:
print(''.join( '{: <22}'.format(t) for t in row ))
Prints:
Player Height w/o Shoes Height w/ Shoes Weight Wingspan Standing Reach Body Fat % Hand Length Hand Width
Nickeil Alexander Walker SG6' 4.25'' 6' 5.5'' 203.8 6' 9.5' 8' 6'' ' 5.90% 8.50 8.75
RJ Barrett SF - - - - - - - -
Charles Bassey C 6' 8.75'' 6' 10'' 239.0 7' 3.5'' 9' 1.5'' 8.50% 9.25 9.50
Darius Bazley PF 6' 7.75'' 6' 9'' 208.4 7' 0' 8' 11'' 3.60% ' 9.00 9.75
Bol Bol C 7' 0.75'' 7' 2.5'' 208.0 7' 7'' 9' 7.5'' 7.10% 9.25 9.50
Jordan Bone SG 6' 1.5'' 6' 2.75'' 179.0 6' 3.25'' 7' 11'' 5.00% 7.50 9.25
Brian Bowen II SF 6' 6.25'' 6' 7.5'' 200.0 6' 10' 8' 7'' 6.50% ' 8.50 9.75
Ky Bowman PG 6' 1'' 6' 2.25'' 181.2 6' 7'' 8' 2'' 4.90% 8.25 9.00
Ignas Brazdeikis SF 6' 5.75'' 6' 7.25'' 220.8 6' 9.25'' 8' 6'' 6.00% 8.75 9.50
Oshae Brissett SF-PF 6' 7'' 6' 8'' 203.2 7' 0'' 8' 8'' 2.90% 9.00 9.50
Moses Brown C 7' 1.25'' 7' 2.5'' 237.2 7' 4.75'' 9' 5'' 7.80% 9.50 10.25
Brandon Clarke PF/C 6' 7.25'' 6' 8.25'' 207.2 6' 8.25'' 8' 6'' 4.90% 8.25 9.50
Nicolas Claxton C 6' 10'' 6' 11.75'' 216.6 7' 2.5'' 9' 2'' 4.50% 9.25 9.50
... and so on.

Related

How to convert years to intervals in Pandas

I have the data in the following format:
enter image description here
I want to convert the year column to intervals (decade), so that I have the decade column in the format 1950-1959, 1960-1969 and so on (without removing the company name).
So that I can find companies with the highest revenue for the decade and then plot the top 5 companies along with revenues (for all intervals) using seaborn.
I tried the following script.
df_Fortune.groupby(['Year', 'Company']).sum().sort_values(['Year', 'Revenue (in millions)'], ascending=[1, 0])
The result is a multi-index (I guess) and I don't know how to convert Year into decades.
Create a sample for demo.
import pandas as pd
import numpy as np
# create a dataframe with 100 rows random with column year random between 1950-2019
df = pd.DataFrame({'year': np.random.randint(1950, 2020, 100)})
df['revenue'] = np.random.randint(1000, 10000, 100)
df.sort_values(by='year', inplace=True)
df.reset_index(drop=True, inplace=True)
df['year_interval'] = pd.cut(df['year'], bins=range(1950, 2025, 5), labels=range(1950, 2020, 5), include_lowest=True)
df['year_interval'] = df['year_interval'].astype(str) + '-' + (df['year_interval'].astype(int) + 4).astype(str)
df['company'] =['Walmart', 'Amazon', 'Apple', 'CVS Health', 'UnitedHealth Group', 'Exxon Mobil', 'Berkshire Hathaway', 'Alphabet', 'McKesson', 'AmerisourceBergen', 'Costco Wholesale', 'Cigna', 'AT&T', 'Microsoft', 'Cardinal Health', 'Chevron', 'Home Depot', 'Walgreens Boots Alliance', 'Marathon Petroleum', 'Elevance Health', 'Kroger', 'Ford Motor', 'Verizon Communications', 'JPMorgan Chase', 'General Motors', 'Centene', 'Meta Platforms', 'Comcast', 'Phillips 66', 'Valero Energy', 'Dell Technologies', 'Target', 'Fannie Mae', 'United Parcel Service', 'Lowe\'s', 'Bank of America', 'Johnson & Johnson', 'Archer Daniels Midland', 'FedEx', 'Humana', 'Wells Fargo', 'State Farm Insurance', 'Pfizer', 'Citigroup', 'PepsiCo', 'Intel', 'Procter & Gamble', 'General Electric', 'IBM', 'MetLife', 'Prudential Financial', 'Albertsons', 'Walt Disney', 'Energy Transfer', 'Lockheed Martin', 'Freddie Mac', 'Goldman Sachs Group', 'Raytheon Technologies', 'HP', 'Boeing', 'Morgan Stanley', 'HCA Healthcare', 'AbbVie', 'Dow', 'Tesla', 'Allstate', 'American International Group', 'Best Buy', 'Charter Communications', 'Sysco', 'Merck', 'New York Life Insurance', 'Caterpillar', 'Cisco Systems', 'TJX', 'Publix Super Markets', 'ConocoPhillips', 'Liberty Mutual Insurance Group', 'Progressive', 'Nationwide', 'Tyson Foods', 'Bristol-Myers Squibb', 'Nike', 'Deere', 'American Express', 'Abbott Laboratories', 'StoneX Group', 'Plains GP Holdings', 'Enterprise Products Partners', 'TIAA', 'Oracle', 'Thermo Fisher Scientific', 'Coca-Cola', 'General Dynamics', 'CHS', 'USAA', 'Northwestern Mutual', 'Nucor', 'Exelon', 'Massachusetts Mutual Life Insurance']
df
###
year revenue year_interval company
0 1951 8951 1950-1954 Walmart
1 1954 7270 1950-1954 Amazon
2 1955 7148 1950-1954 Apple
3 1955 5661 1950-1954 CVS Health
4 1955 5179 1950-1954 UnitedHealth Group
.. ... ... ... ...
95 2016 4945 2015-2019 USAA
96 2016 6860 2015-2019 Northwestern Mutual
97 2017 6535 2015-2019 Nucor
98 2018 6235 2015-2019 Exelon
99 2019 8624 2015-2019 Massachusetts Mutual Life Insurance
[100 rows x 4 columns]
Finding company having max revenue of each year_interval
df_max = df.groupby('year_interval')['revenue'].max().reset_index()
df_result = df_max.merge(df, on=['year_interval', 'revenue'], how='left')
df_result
###
year_interval revenue year company
0 1950-1954 8951 1951 Walmart
1 1955-1959 8891 1959 McKesson
2 1960-1964 9643 1962 Cigna
3 1965-1969 9723 1970 Elevance Health
4 1970-1974 9396 1973 General Motors
5 1975-1979 7048 1978 Comcast
6 1980-1984 9776 1982 United Parcel Service
7 1985-1989 9216 1986 State Farm Insurance
8 1990-1994 8788 1994 Morgan Stanley
9 1995-1999 7339 1997 Best Buy
10 2000-2004 9750 2003 Liberty Mutual Insurance Group
11 2005-2009 9986 2008 Deere
12 2010-2014 9438 2014 Coca-Cola
13 2015-2019 8624 2019 Massachusetts Mutual Life Insurance
Plot,
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid")
plt.gcf().set_size_inches(15, 6)
ax = sns.barplot(x="year_interval", y="revenue", hue="company", data=df_result, dodge=False)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.legend(bbox_to_anchor=(1.15, 1), loc=2, borderaxespad=0.)
for p in ax.patches:
ax.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
plt.tight_layout()
plt.show()
Save extrema of intervals separately
df['decade1'] = df['year'] - df['year'] % 10
df['decade2'] = df['year'] + (10 - df['year'] % 10)
Also, you can save it as a string in the format you wanted
df['decade'] = df['decade1'].astype(str).str.cat(df['decade1'].values.astype(str), sep='-')
Other than this you might find ways to handle time series data in pandas of which I am unaware.

Unpivot multiple variables in a python pandas dataframe

I've got a big dataframe which shows the amount of each product and their costs for different products. However, I want to transform (unpivot) the dataframe into a long dataframe with each product name as an ID and their amounts and costs in two different columns. I've tried the ‍pd.melt‍‍‍ and ireshape functions, but neither seems to work.
Here is an example of what I am trying to do.
Here is my table
df = pd.DataFrame({ 'Year': [year1, year2,year3],
'A': [200,300,400],
'B': [500,600,300],
'C': [450,369,235],
'A cost': [7000, 4000, 7000 ],
'B cost': [9000, 4000, 6000],
'C cost': [1100, 4300, 2320],
})
print(df)
current data frame:
Desired data frame:
Create MultiIndex by splitting columnsnames and setting Year to index, then replace missing values with amount and reshape by DataFrame.stack:
df = pd.DataFrame({ 'Year': 'year1,year2,year3'.split(','),
'A': [200,300,400],
'B': [500,600,300],
'C': [450,369,235],
'A cost': [7000, 4000, 7000 ],
'B cost': [9000, 4000, 6000],
'C cost': [1100, 4300, 2320],
})
print(df)
df1 = df.set_index('Year')
df1.columns = df1.columns.str.split(expand=True)
f = lambda x: 'amount' if pd.isna(x) else x
df1 = df1.rename(columns=f).stack(0).rename_axis(['Year','product']).reset_index()
print (df1)
Year product amount cost
0 year1 A 200 7000
1 year1 B 500 9000
2 year1 C 450 1100
3 year2 A 300 4000
4 year2 B 600 4000
5 year2 C 369 4300
6 year3 A 400 7000
7 year3 B 300 6000
8 year3 C 235 2320
This can be achieved quite simply with janitor's pivot_longer:
# pip install janitor
import janitor
out = (df
.pivot_longer(index='Year', names_to=('product', '.value'),
names_pattern=r'(\S+)\s*(\S*)')
.rename(columns={'': 'amount'})
)
output:
Year product amount cost
0 year1 A 200 7000
1 year2 A 300 4000
2 year3 A 400 7000
3 year1 B 500 9000
4 year2 B 600 4000
5 year3 B 300 6000
6 year1 C 450 1100
7 year2 C 369 4300
8 year3 C 235 2320

Adding a ranked column [duplicate]

This question already has answers here:
Add a sequential counter column on groups to a pandas dataframe
(4 answers)
Closed 12 months ago.
Given input as
query recommend
0 orange strawberry
1 orange pear
2 orange lemon
3 apple nothing
4 meat beer
5 meat juice
How to provide a ranked column with respect to query, starting from 1? the expected output is
query recommend rank
0 orange strawberry 1
1 orange pear 2
2 orange lemon 3
3 apple nothing 1
4 meat beer 1
5 meat juice 2
Here's the code for input
df_output = pd.DataFrame( {'query': {0: 'orange', 1: 'orange', 2: 'orange', 3: 'apple', 4: 'meat', 5: 'meat'}, 'recommend': {0: 'strawberry', 1: 'pear', 2: 'lemon', 3: 'nothing', 4: 'beer', 5: 'juice'}} )
Use:
df = pd.DataFrame({'query':['s', 'o', 'o', 'o', 'a'], 'recommend':['f', 's', 'p', 'l', 't']})
s=df.groupby('query').agg({'query': [lambda x: list(x.index), lambda x: range(1,len(x)+1)]})
s = s.explode([s.columns[0], s.columns[1]])
s.columns = s.columns.droplevel(0)
s = s.reset_index()
df['temp']=df.index
cols = df.columns.to_list()
cols.append('<lambda_1>')
cols.remove('temp')
cols[0]='query_x'
s.merge(df, left_on='<lambda_0>', right_on='temp')[cols]
output:

Pandas groupby with no function and display each unique value only once

I'm not sure Pandas does this or not.
Given a dataframe that looks like this:
I want to group by Id and then list sorted by descending order on Score, like this:
Id Name Score Index
-------------------------------------------
123 John Smith 1.0 AM
Johnny Smith 0.92 PP
345 John Smith 1.0 WL
789 John Smith 1.0 WL
Jonathan Smith 0.91 PP
011 Jon Smithson 0.80 AM
012 Jon Smythe 0.77 WL
One of the main requirements here is that we only want each distinct Id to be displayed one time. If there is a way to accomplish this but the format is not exactly like the above, but conveys the same message, that's ok. Note that there is no function to be applied to any column (mean, sum, etc.).
Here is the code to reproduce the Dataframe:
import pandas as pd
arrays = [['123', '345', '789', '123', '789', '011','012'],
['John Smith', 'John Smith', 'John Smith', 'Johnny Smith', 'Jonathan Smith', 'Jon Smithson', 'John Smythe'],
[1.0, 1.0, 1.0, 0.92, 0.91, 0.80, 0.77],
['AM', 'WL','WL', 'PP', 'PP', 'AM', 'WL']]
index = pd.MultiIndex.from_arrays(arrays, names=('Id', 'Name', 'Score', 'Index'))
df = pd.DataFrame(index=index)
df
You could do this:
df = df.sort_values('Id').set_index(['Id', 'Name'])
>>> print(df)
Score Index
Id Name
11 Jon-Smithson 0.80 AM
12 Jon-Smythe 0.77 WL
123 John-Smith 1.00 AM
Johnny-Smith 0.92 PP
345 John-Smith 1.00 WL
789 Jonathan-Smith 0.91 PP
John-Smith 1.00 WL
Or this:
df = df.sort_values('Id').set_index('Id')
>>> print(df)
Name Score Index
Id
11 Jon-Smithson 0.80 AM
12 Jon-Smythe 0.77 WL
123 John-Smith 1.00 AM
123 Johnny-Smith 0.92 PP
345 John-Smith 1.00 WL
789 Jonathan-Smith 0.91 PP
789 John-Smith 1.00 WL
If the Id column is numeric (of dtype int) and you want it to be zero-padded, you can do the following, and then use one of the above solutions, substituting new_df for df:
i = df['Id'].astype(str)
i = i.str.rjust(i.str.len().max(), '0')
new_df = df.copy()
new_df['Id'] = i
e.g.
>>> new_df = new_df.sort_values('Id').set_index(['Id', 'Name'])
>>> print(new_df)
Score Index
Id Name
011 Jon-Smithson 0.80 AM
012 Jon-Smythe 0.77 WL
123 John-Smith 1.00 AM
Johnny-Smith 0.92 PP
345 John-Smith 1.00 WL
789 Jonathan-Smith 0.91 PP
John-Smith 1.00 WL

Dataframe count items per basket

df = pd.DataFrame({
'Order Date': [1-1-21, 1-1-21, 1-1-21, 1-1-21, 1-1-21, 1-2-21, 1-3-21, 1-3-21, 1-3-21],
'Invoice No': [A1, A1, A2, A2, A2, B3, C1, C1, C2],
'Product': ['eggs', 'ice', 'candy', 'toy', 'paper', 'book', 'computer', 'mouse', 'shoe'],
'Warehouse': ['LA', 'LA', 'LA', 'LA', 'LA', 'NY', 'NY', 'NY', 'LA']
})
Hi all, I would like to group the items by date, and also count the item per basket.
On Jan 1 I sold 5 (2+3) items average of 2.5, where as Jan 2 of 1 item, and Jan 3 of items per basket in 2 in NY and 1 in LA.
Desired outcome:
LA NY
Jan 1 2.5 0
Jan 2 0 1
Jan 3 1 2
I have tried df.groupby(['Order Date', 'Warehouse']).count().unstack(), I want something like nunique().
Thanks all.
g = df.groupby(["Order Date", "Warehouse"])["Invoice No"]
df_res = (
(g.count() / g.nunique())
.reset_index()
.pivot(index="Order Date", columns="Warehouse")
.fillna(0)
)
print(df_res)
Prints:
Invoice No
Warehouse LA NY
Order Date
2021-01-01 2.5 0.0
2021-01-02 0.0 1.0
2021-01-03 1.0 2.0