Empty Dataframe after being populated from URL - pandas

html_data = requests.get('https://www.macrotrends.net/stocks/charts/GME/gamestop/revenue')
soup = BeautifulSoup(html_data.text, 'lxml')
all_tables = soup.find_all('table', attrs={'class': 'historical_data_table table'})
gme_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for table in all_tables:
if table.find('th').getText().startswith("Gamestop Quarterly Revenue"):
for row in table.find_all("tr"):
col = row.find_all("td")
if len(col) == 2:
date = col[0].text
revenue = col[1].text.replace('$', '').replace(',', '')
gme_revenue = gme_revenue.append({"Date": date, "Revenue": revenue}, ignore_index=True)
however, when I try to make a table, it comes up empty as
Empty DataFrame
Columns: [Date, Revenue]
Index: []
and after I do a test, this appears:
gme_revenue.empty
>>>True
unsure on why my data frame is empty. I've even copied the code from another data frame and it still doesn't work.
Help is appreciated.

Change
if table.find('th').getText().startswith("Gamestop Quarterly Revenue"):
to
if 'Quarterly' in table.find('th').text:
and it should work
Output:
Date Revenue
0 2020-10-31 1005
1 2020-07-31 942
2 2020-04-30 1021
3 2020-01-31 2194
4 2019-10-31 1439
... ... ...
59 2006-01-31 1667
60 2005-10-31 534
61 2005-07-31 416
62 2005-04-30 475
63 2005-01-31 709
64 rows × 2 columns

Related

Get value of variable quantile per group

I have data that is categorized in groups, with a given quantile percentage per group. I want to create a threshold for each group that seperates all values within the group based on the quantile percentage. So if one group has q=0.8, I want the lowest 80% values given 1, and the upper 20% values given 0.
So, given the data like this:
I want object 1, 2 and 5 to get result 1 and the other 3 result 0. In total my data consists of 7.000.000 rows with 14.000 groups. I tried doing this with groupby.quantile but therefore I need a constant quantile measure, whereas my data has a different one for each group.
Setup:
num = 7_000_000
grp_num = 14_000
qua = np.around(np.random.uniform(size=grp_num), 2)
df = pd.DataFrame({
"Group": np.random.randint(low=0, high=grp_num, size=num),
"Quantile": 0.0,
"Value": np.random.randint(low=100, high=300, size=num)
}).sort_values("Group").reset_index(0, drop=True)
def func(grp):
grp["Quantile"] = qua[grp.Group]
return grp
df = df.groupby("Group").apply(func)
Answer: (This is basically a for loop, so for performance you can try to apply numba to this)
def func2(grp):
return grp.Value < grp.Value.quantile(grp.Quantile.iloc[0])
df["result"] = df.groupby("Group").apply(func2).reset_index(0, drop=True)
print(df)
Outputs:
Group Quantile Value result
0 0 0.33 156 1
1 0 0.33 259 0
2 0 0.33 166 1
3 0 0.33 183 0
4 0 0.33 111 1
... ... ... ... ...
6999995 13999 0.83 194 1
6999996 13999 0.83 227 1
6999997 13999 0.83 215 1
6999998 13999 0.83 103 1
6999999 13999 0.83 115 1
[7000000 rows x 4 columns]
CPU times: user 14.2 s, sys: 362 ms, total: 14.6 s
Wall time: 14.7 s

pandas dataframe how to shift rows based on date

I am trying to assess the impact of a promotional campaign on our customers. The goal is to assess revenue from the point the promotion was offered. However promotion was offered for different customers at different points. How do I rearrange the data to Month 0, Month 1, Month 2, Month 3. Month 0 being the month the customer first got the promotion.
With below self explanatory code you can get your desired output:
# Create DataFrame
import pandas as pd
df = pd.DataFrame({"Account":[1,2,3,4,5,6],\
"May-18":[181,166,221,158,210,159],\
"Jun-18":[178,222,230,189,219,200],\
"Jul-18":[184,207,175,167,201,204],\
"Aug-18":[161,174,178,233,223,204],\
"Sep-18":[218,209,165,165,204,225],\
"Oct-18":[199,206,205,196,212,205],\
"Nov-18":[231,196,189,218,234,235],\
"Dec-18":[173,178,189,218,234,205],\
"Promotion Month":["Sep-18","Aug-18","Jul-18","May-18","Aug-18","Jun-18"]})
df = df.set_index("Account")
cols = ["May-18","Jun-18","Jul-18","Aug-18","Sep-18","Oct-18","Nov-18","Dec-18","Promotion Month"]
df = df[cols]
# Define function to select the four months after promotion
def selectMonths(row):
cols = df.columns.to_list()
colMonth0 = cols.index(row["Promotion Month"])
colsOut = cols[colMonth0:colMonth0+4]
out = pd.Series(row[colsOut].to_list())
return out
# Apply the function and set the index and columns of output DataFrame
out = df.apply(selectMonths, axis=1)
out.index = df.index
out.columns=["Month 0","Month 1","Month 2","Month 3"]
Then the output you get is:
>>> out
Month 0 Month 1 Month 2 Month 3
Account
1 218 199 231 173
2 174 209 206 196
3 175 178 165 205
4 158 189 167 233
5 223 204 212 234
6 200 204 204 225

Summing columns and rows

How do I add up rows and columns.
The last column Sum needs to be the sum of the rows R0+R1+R2.
The last row needs to be the sum of these columns.
import pandas as pd
# initialize list of lists
data = [['AP',16,20,78], ['AP+', 10,14,55], ['SP',32,26,90],['Total',0, 0, 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Type', 'R0', 'R1', 'R2'])
The result:
Type R0 R1 R2 Sum
0 AP 16 20 78 NaN
1 AP+ 10 14 55 NaN
2 SP 32 26 90 NaN
3 Total 0 0 0 NaN
Let us try .iloc position selection
df.iloc[-1,1:]=df.iloc[:-1,1:].sum()
df['Sum']=df.iloc[:,1:].sum(axis=1)
df
Type R0 R1 R2 Sum
0 AP 16 20 78 114
1 AP+ 10 14 55 79
2 SP 32 26 90 148
3 Total 58 60 223 341
In general it may be better practice to specify column names:
import pandas as pd
# initialize list of lists
data = [['AP',16,20,78], ['AP+', 10,14,55], ['SP',32,26,90],['Total',0, 0, 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Type', 'R0', 'R1', 'R2'])
# List columns
cols_to_sum=['R0', 'R1', 'R2']
# Access last row and sum columns-wise
df.loc[df.index[-1], cols_to_sum] = df[cols_to_sum].sum(axis=0)
# Create 'Sum' column summing row-wise
df['Sum']=df[cols_to_sum].sum(axis=1)
df
Type R0 R1 R2 Sum
0 AP 16 20 78 114
1 AP+ 10 14 55 79
2 SP 32 26 90 148
3 Total 58 60 223 341

Column names after transposing a dataframe

I have a small dataframe - six rows (not counting the header) and 53 columns (a store name, and the rest weekly sales for the past year). Each row contains a particular store and each column the store's name and sales for each week. I need to transpose the data so that the weeks appear as rows, the stores appear as columns, and their sales appear as the rows.
To generate the input data:
df_store = pd.read_excel(SourcePath+SourceFile, sheet_name='StoreSales', header=0, usecols=['StoreName'])
# Number rows of all irrelevant stores.
row_numbers = [x+1 for x in df_stores[(df_store['StoreName'] != 'Store1') & (df_store['StoreName'] != 'Store2')
& (df_store['StoreName'] !='Store3')].index]
# Read in entire Excel file, skipping the rows of irrelevant stores.
df_store = pd.read_excel(SourcePath+SourceFile, sheet_name='StoreSales', header=0, usecols = "A:BE",
skiprows = row_numbers, converters = {'StoreName' : str})
# Transpose dataframe
df_store_t = df_store.transpose()
My output puts index numbers above each store name ( 0 to 5), and then each column starts out as StoreName (above the week), then each store name. Yet, I cannot manipulate them by their names.
Is there a way to clear those index numbers so that I can work directly with the resulting column names (e.g., rename "StoreName" to "WeekEnding" and make reference to each store columns ("Store1", "Store2", etc.?)
IIUC, you need to set_index first, then transpose, T:
See this example:
df = pd.DataFrame({'Store':[*'ABCDE'],
'Week 1':np.random.randint(50,200, 5),
'Week 2':np.random.randint(50,200, 5),
'Week 3':np.random.randint(50,200, 5)})
Input Dataframe:
Store Week 1 Week 2 Week 3
0 A 99 163 148
1 B 119 86 92
2 C 145 98 162
3 D 144 143 199
4 E 50 181 177
Now, set_index and transpose:
df_out = df.set_index('Store').T
df_out
Output:
Store A B C D E
Week 1 99 119 145 144 50
Week 2 163 86 98 143 181
Week 3 148 92 162 199 177

Insert items from MultiIndexed dataframe into regular dataframe based on time

I have this regular dataframe indexed by 'Date', called ES:
Price Day Hour num_obs med abs_med Ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203
I have this other dataframe indexed by the following MultiIndex. The first index goes from 0 to 23 and the second index goes from 0 to 55. In other words we have daily 5 minute increment data.
5min_Ret
0 0 2.235875e-06
5 9.814064e-07
10 -1.453213e-06
15 4.295757e-06
20 5.884896e-07
25 -1.340122e-06
30 9.470660e-06
35 1.178204e-06
40 -1.111621e-05
45 1.159005e-05
50 6.148861e-06
55 1.070586e-05
1 0 1.485287e-05
5 3.018576e-06
10 -1.513273e-05
15 -1.105312e-05
20 3.600874e-06
...
I want to create a column in the original dataframe, ES, that has the appropriate '5min_Ret' at each appropriate hour/5minute combo.
I've tried multiple things: looping over rows, finding some apply function. But nothing has worked so far. I feel like I'm overlooking a simple and Pythonic solution here.
The expected output creates a new column called '5min_ret' to the original dataframe in which each row corresponds to the correct hour/5minute pair from the smaller dataframe containing the 5min_ret
Price Day Hour num_obs med abs_med Ret 5min_ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364 xxxx
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562 xxxx
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132 xxxx
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132 xxxx
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203 xxxx
I think one way is to use merge on hour and minute. First create a column 'min' in ES from the datetimeindex such as:
ES['min'] = ES.index.minute
Now you can merge with your multiindex DF containing the column '5min_Ret' that I named df_multi such as:
ES = ES.merge(df_multi.reset_index(), left_on = ['hour','min'],
right_on = ['level_0','level_1'], how='left')
Here you merge on 'hour' and 'min' from ES with 'level_0' and 'level_1', which are created from your multiindex of df_multi when you do reset_index, and on the value of the left df (being ES)
You should get a new column in ES named '5min_Ret' with the value you are looking for. You can drop the colum 'min' if you don't need it anymore by ES = ES.drop('min',axis=1)