Empty Dataframe after being populated from URL - pandas

html_data = requests.get('https://www.macrotrends.net/stocks/charts/GME/gamestop/revenue')
soup = BeautifulSoup(html_data.text, 'lxml')
all_tables = soup.find_all('table', attrs={'class': 'historical_data_table table'})
gme_revenue = pd.DataFrame(columns=["Date", "Revenue"])
for table in all_tables:
if table.find('th').getText().startswith("Gamestop Quarterly Revenue"):
for row in table.find_all("tr"):
col = row.find_all("td")
if len(col) == 2:
date = col[0].text
revenue = col[1].text.replace('$', '').replace(',', '')
gme_revenue = gme_revenue.append({"Date": date, "Revenue": revenue}, ignore_index=True)
however, when I try to make a table, it comes up empty as
Empty DataFrame
Columns: [Date, Revenue]
Index: []
and after I do a test, this appears:
unsure on why my data frame is empty. I've even copied the code from another data frame and it still doesn't work.
Help is appreciated.

if table.find('th').getText().startswith("Gamestop Quarterly Revenue"):
if 'Quarterly' in table.find('th').text:
and it should work
Date Revenue
0 2020-10-31 1005
1 2020-07-31 942
2 2020-04-30 1021
3 2020-01-31 2194
4 2019-10-31 1439
... ... ...
59 2006-01-31 1667
60 2005-10-31 534
61 2005-07-31 416
62 2005-04-30 475
63 2005-01-31 709
64 rows × 2 columns


Get value of variable quantile per group

I have data that is categorized in groups, with a given quantile percentage per group. I want to create a threshold for each group that seperates all values within the group based on the quantile percentage. So if one group has q=0.8, I want the lowest 80% values given 1, and the upper 20% values given 0.
So, given the data like this:
I want object 1, 2 and 5 to get result 1 and the other 3 result 0. In total my data consists of 7.000.000 rows with 14.000 groups. I tried doing this with groupby.quantile but therefore I need a constant quantile measure, whereas my data has a different one for each group.
num = 7_000_000
grp_num = 14_000
qua = np.around(np.random.uniform(size=grp_num), 2)
df = pd.DataFrame({
"Group": np.random.randint(low=0, high=grp_num, size=num),
"Quantile": 0.0,
"Value": np.random.randint(low=100, high=300, size=num)
}).sort_values("Group").reset_index(0, drop=True)
def func(grp):
grp["Quantile"] = qua[grp.Group]
return grp
df = df.groupby("Group").apply(func)
Answer: (This is basically a for loop, so for performance you can try to apply numba to this)
def func2(grp):
return grp.Value < grp.Value.quantile(grp.Quantile.iloc[0])
df["result"] = df.groupby("Group").apply(func2).reset_index(0, drop=True)
Group Quantile Value result
0 0 0.33 156 1
1 0 0.33 259 0
2 0 0.33 166 1
3 0 0.33 183 0
4 0 0.33 111 1
... ... ... ... ...
6999995 13999 0.83 194 1
6999996 13999 0.83 227 1
6999997 13999 0.83 215 1
6999998 13999 0.83 103 1
6999999 13999 0.83 115 1
[7000000 rows x 4 columns]
CPU times: user 14.2 s, sys: 362 ms, total: 14.6 s
Wall time: 14.7 s

pandas dataframe how to shift rows based on date

I am trying to assess the impact of a promotional campaign on our customers. The goal is to assess revenue from the point the promotion was offered. However promotion was offered for different customers at different points. How do I rearrange the data to Month 0, Month 1, Month 2, Month 3. Month 0 being the month the customer first got the promotion.
With below self explanatory code you can get your desired output:
# Create DataFrame
import pandas as pd
df = pd.DataFrame({"Account":[1,2,3,4,5,6],\
"Promotion Month":["Sep-18","Aug-18","Jul-18","May-18","Aug-18","Jun-18"]})
df = df.set_index("Account")
cols = ["May-18","Jun-18","Jul-18","Aug-18","Sep-18","Oct-18","Nov-18","Dec-18","Promotion Month"]
df = df[cols]
# Define function to select the four months after promotion
def selectMonths(row):
cols = df.columns.to_list()
colMonth0 = cols.index(row["Promotion Month"])
colsOut = cols[colMonth0:colMonth0+4]
out = pd.Series(row[colsOut].to_list())
return out
# Apply the function and set the index and columns of output DataFrame
out = df.apply(selectMonths, axis=1)
out.index = df.index
out.columns=["Month 0","Month 1","Month 2","Month 3"]
Then the output you get is:
>>> out
Month 0 Month 1 Month 2 Month 3
1 218 199 231 173
2 174 209 206 196
3 175 178 165 205
4 158 189 167 233
5 223 204 212 234
6 200 204 204 225

Summing columns and rows

How do I add up rows and columns.
The last column Sum needs to be the sum of the rows R0+R1+R2.
The last row needs to be the sum of these columns.
import pandas as pd
# initialize list of lists
data = [['AP',16,20,78], ['AP+', 10,14,55], ['SP',32,26,90],['Total',0, 0, 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Type', 'R0', 'R1', 'R2'])
The result:
Type R0 R1 R2 Sum
0 AP 16 20 78 NaN
1 AP+ 10 14 55 NaN
2 SP 32 26 90 NaN
3 Total 0 0 0 NaN
Let us try .iloc position selection
Type R0 R1 R2 Sum
0 AP 16 20 78 114
1 AP+ 10 14 55 79
2 SP 32 26 90 148
3 Total 58 60 223 341
In general it may be better practice to specify column names:
import pandas as pd
# initialize list of lists
data = [['AP',16,20,78], ['AP+', 10,14,55], ['SP',32,26,90],['Total',0, 0, 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Type', 'R0', 'R1', 'R2'])
# List columns
cols_to_sum=['R0', 'R1', 'R2']
# Access last row and sum columns-wise
df.loc[df.index[-1], cols_to_sum] = df[cols_to_sum].sum(axis=0)
# Create 'Sum' column summing row-wise
Type R0 R1 R2 Sum
0 AP 16 20 78 114
1 AP+ 10 14 55 79
2 SP 32 26 90 148
3 Total 58 60 223 341

Column names after transposing a dataframe

I have a small dataframe - six rows (not counting the header) and 53 columns (a store name, and the rest weekly sales for the past year). Each row contains a particular store and each column the store's name and sales for each week. I need to transpose the data so that the weeks appear as rows, the stores appear as columns, and their sales appear as the rows.
To generate the input data:
df_store = pd.read_excel(SourcePath+SourceFile, sheet_name='StoreSales', header=0, usecols=['StoreName'])
# Number rows of all irrelevant stores.
row_numbers = [x+1 for x in df_stores[(df_store['StoreName'] != 'Store1') & (df_store['StoreName'] != 'Store2')
& (df_store['StoreName'] !='Store3')].index]
# Read in entire Excel file, skipping the rows of irrelevant stores.
df_store = pd.read_excel(SourcePath+SourceFile, sheet_name='StoreSales', header=0, usecols = "A:BE",
skiprows = row_numbers, converters = {'StoreName' : str})
# Transpose dataframe
df_store_t = df_store.transpose()
My output puts index numbers above each store name ( 0 to 5), and then each column starts out as StoreName (above the week), then each store name. Yet, I cannot manipulate them by their names.
Is there a way to clear those index numbers so that I can work directly with the resulting column names (e.g., rename "StoreName" to "WeekEnding" and make reference to each store columns ("Store1", "Store2", etc.?)
IIUC, you need to set_index first, then transpose, T:
See this example:
df = pd.DataFrame({'Store':[*'ABCDE'],
'Week 1':np.random.randint(50,200, 5),
'Week 2':np.random.randint(50,200, 5),
'Week 3':np.random.randint(50,200, 5)})
Input Dataframe:
Store Week 1 Week 2 Week 3
0 A 99 163 148
1 B 119 86 92
2 C 145 98 162
3 D 144 143 199
4 E 50 181 177
Now, set_index and transpose:
df_out = df.set_index('Store').T
Store A B C D E
Week 1 99 119 145 144 50
Week 2 163 86 98 143 181
Week 3 148 92 162 199 177

Insert items from MultiIndexed dataframe into regular dataframe based on time

I have this regular dataframe indexed by 'Date', called ES:
Price Day Hour num_obs med abs_med Ret
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203
I have this other dataframe indexed by the following MultiIndex. The first index goes from 0 to 23 and the second index goes from 0 to 55. In other words we have daily 5 minute increment data.
0 0 2.235875e-06
5 9.814064e-07
10 -1.453213e-06
15 4.295757e-06
20 5.884896e-07
25 -1.340122e-06
30 9.470660e-06
35 1.178204e-06
40 -1.111621e-05
45 1.159005e-05
50 6.148861e-06
55 1.070586e-05
1 0 1.485287e-05
5 3.018576e-06
10 -1.513273e-05
15 -1.105312e-05
20 3.600874e-06
I want to create a column in the original dataframe, ES, that has the appropriate '5min_Ret' at each appropriate hour/5minute combo.
I've tried multiple things: looping over rows, finding some apply function. But nothing has worked so far. I feel like I'm overlooking a simple and Pythonic solution here.
The expected output creates a new column called '5min_ret' to the original dataframe in which each row corresponds to the correct hour/5minute pair from the smaller dataframe containing the 5min_ret
Price Day Hour num_obs med abs_med Ret 5min_ret
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364 xxxx
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562 xxxx
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132 xxxx
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132 xxxx
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203 xxxx
I think one way is to use merge on hour and minute. First create a column 'min' in ES from the datetimeindex such as:
ES['min'] = ES.index.minute
Now you can merge with your multiindex DF containing the column '5min_Ret' that I named df_multi such as:
ES = ES.merge(df_multi.reset_index(), left_on = ['hour','min'],
right_on = ['level_0','level_1'], how='left')
Here you merge on 'hour' and 'min' from ES with 'level_0' and 'level_1', which are created from your multiindex of df_multi when you do reset_index, and on the value of the left df (being ES)
You should get a new column in ES named '5min_Ret' with the value you are looking for. You can drop the colum 'min' if you don't need it anymore by ES = ES.drop('min',axis=1)