I have a correlation matrix of stock returns in a Pandas DataFrame and I want to extract the top/bottom 10 correlated pairs from the matrix.
Sample DataFrame:
import pandas as pd
import numpy as np
data = np.random.randint(5,30,size=500)
df = pd.DataFrame(data.reshape((50,10)))
corr = df.corr()
This is my function to get the top/bottom 10 correlated pairs by 1) first returning a multi-indexed series (high) for highest correlated pairs, and then 2) unstacking back into a DataFrame (high_df):
def get_rankings(corr_matrix):
#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)
ranked_corr = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
.stack()
.sort_values(ascending=False))
high = ranked_corr[:10]
high_df = high.unstack().fillna("")
return high_df
get_rankings(corr)
My current DF output looks something like this:
6 4 5 7 8 3 9
3 0.359 0.198
1 0.275
4 0.257
2 0.176 0.154
0 0.153 0.164
5 0.156
But I want it to look this in either 2/3 columns:
ID1 ID2 Corr
0 9 0.304471
2 8 0.271009
2 3 0.147702
7 9 0.146176
0 7 0.144549
7 8 0.111888
4 6 0.098619
1 7 0.092338
1 4 0.09091
3 6 0.079688
It needs to be in a DataFrame so I can pass it to a grid widget, which only accepts DataFrames. Can anyone help me rehash the shape of the unstacked DF?
Related
I have 3 DataFrames like below.
A =
ops lat
0 9,453 13,536
1 8,666 14,768
2 8,377 15,278
3 8,236 15,536
4 8,167 15,668
5 8,099 15,799
6 8,066 15,867
7 8,029 15,936
8 7,997 16,004
9 7,969 16,058
10 7,962 16,073
B =
ops lat
0 9,865 12,967
1 8,908 14,366
2 8,546 14,976
3 8,368 15,294
4 8,289 15,439
5 8,217 15,571
6 8,171 15,662
7 8,130 15,741
8 8,093 15,809
9 8,072 15,855
10 8,058 15,882
C =
ops lat
0 9,594 13,332
1 8,718 14,670
2 8,396 15,242
3 8,229 15,553
4 8,137 15,725
5 8,062 15,875
6 8,008 15,982
7 7,963 16,070
8 7,919 16,159
9 7,892 16,218
10 7,874 16,255
How do I merge them into a single dataframe where ops column is a sum and lat column will be average of these three dataframes.
pd.concat() - seems to append the dataframes.
There are likely many ways, but to keep it on the same line of thinking as you had with pd.concat, the below will work.
First, concat your dataframes together and then we will calculate .sum() and .mean() on our newly created dataframe and construct our final table with those two fields.
Dummy Data and Example Below:
import pandas as pd
data = {'Name':['node1','node1','node1','node2','node2','node3'],
'Value':[1000,20000,40000,30000,589,682],
'Value2':[303,2084,494,2028,4049,112]}
df1 = pd.DataFrame(data)
data2 = {'Name':['node1','node1','node1','node2','node2','node3'],
'Value':[1000,20000,40000,30000,589,682],
'Value2':[8,234,75,123,689,1256]}
df2 = pd.DataFrame(data2)
joined = pd.concat([df1,df2])
final = pd.DataFrame({'Sum_Col': [joined["Value"].sum()],
'Mean_Col': [joined["Value2"].mean()]})
display(final)
I am trying to understand how the method apply() can be used with series and dataframes.
As shown below, when the np.max() function is used with the apply() method with the dataframe it is returning the max value for each column. But when used with the series, it is just returning the series. My expectation was that it would return the max value of the series. That is, the result would be similar to series.max(). Why is apply() performing differently on series and on dataframes?
import pandas as pd
import numpy as np
my_df = pd.DataFrame(np.random.randint(10, size=(4,3)), columns = list('ABC'))
my_df
Output:
A B C
0 2 4 7
1 9 6 6
2 4 4 8
3 8 8 1
df_max = my_df.apply(np.max)
df_max
Output:
A 9
B 8
C 8
dtype: int32
se_max = my_df['A'].apply(np.max)
se_max
Output:
0 2
1 9
2 4
3 8
Name: A, dtype: int32
By default, apply works along the first dimension of the object. In a dataframe, the first dimension is vertical, and apply applies the function to each column. In a series, the first (and the only) dimension is horizontal, and apply applies the function to each row.
Based on the simplifed sample dataframe
import pandas as pd
import numpy as np
timestamps = pd.date_range(start='2017-01-01', end='2017-01-5', inclusive='left')
values = np.arange(0,len(timestamps))
df = pd.DataFrame({'A': values ,'B' : values*2},
index = timestamps )
print(df)
A B
2017-01-01 0 0
2017-01-02 1 2
2017-01-03 2 4
2017-01-04 3 6
I want to use a roll-forward window of size 2 with a stride of 1 to create a resulting dataframe like
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6
I.e., each window step should create a data item with the two values of A and B in this window and the A and B values immediately to the right of the window as target values.
My first idea was to use pandas
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html
But that seems to only work in combination with aggregate functions such as sum, which is a different use case.
Any ideas on how to implement this rolling-window-based sampling approach?
Here is one way to do it:
window_size = 3
new_df = pd.concat(
[
df.iloc[i : i + window_size, :]
.T.reset_index()
.assign(other_index=i)
.set_index(["other_index", "index"])
.set_axis([f"timestep_{j}" for j in range(1, window_size)] + ["target"], axis=1)
for i in range(df.shape[0] - window_size + 1)
]
)
new_df.index.names = ["", ""]
print(df)
# Output
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6
I have a pandas dataframe that looks like the below, and I am trying to obtain the decile ranking for each column's row and then create a new column for each feature within the dataframe:
I'm not sure if I'm explaining this well, but I ultimately want to produce a dataframe that looks as follows:
You can use qcut - https://pandas.pydata.org/docs/reference/api/pandas.qcut.html
EDIT: If you want to get results relative to the row (as specified in comment below), you can use apply (and add suffix to rename the columns), for example:
test = pd.DataFrame({"a": [-0.1095, 0.1801, 0.0623, 0.1003, -0.0725],
"b": [-0.1895, 0.2001, 0.0523, 0.1203, -0.0225],
"c": [-0.0695, 0.2121, 0.1023, 0.2023, -0.0325],
"d": [-0.0495, 0.2401, 0.1223, 0.1603, -0.0125]},
index = ["11/30/1984", "12/31/1984", "1/31/1985", "2/26/1985", "3/31/1985"])
test2 = test.apply(lambda x: pd.qcut(x, 10, duplicates='drop', labels = False), axis=1)\
.add_suffix('_decile_row')
pd.concat([test, test_2], axis=1)
Which will produce:
a b c d a_decile_row b_decile_row c_decile_row d_decile_row
11/30/1984 -0.110 -0.190 -0.070 -0.050 3 0 6 9
12/31/1984 0.180 0.200 0.212 0.240 0 3 6 9
1/31/1985 0.062 0.052 0.102 0.122 3 0 6 9
2/26/1985 0.100 0.120 0.202 0.160 0 3 9 6
3/31/1985 -0.072 -0.022 -0.033 -0.013 0 6 3 9
Given a pandas crosstab, how do you convert that into a stacked dataframe?
Assume you have a stacked dataframe. First we convert it into a crosstab. Now I would like to revert back to the original stacked dataframe. I searched a problem statement that addresses this requirement, but could not find any that hits bang on. In case I have missed any, please leave a note to it in the comment section.
I would like to document the best practice here. So, thank you for your support.
I know that pandas.DataFrame.stack() would be the best approach. But one needs to be careful of the the "level" stacking is applied to.
Input: Crosstab:
Label a b c d r
ID
1 0 1 0 0 0
2 1 1 0 1 1
3 1 0 0 0 1
4 1 0 0 1 0
6 1 0 0 0 0
7 0 0 1 0 0
8 1 0 1 0 0
9 0 1 0 0 0
Output: Stacked DataFrame:
ID Label
0 1 b
1 2 a
2 2 b
3 2 d
4 2 r
5 3 a
6 3 r
7 4 a
8 4 d
9 6 a
10 7 c
11 8 a
12 8 c
13 9 b
Step-by-step Explanation:
First, let's make a function that would create our data. Note that it randomly generates the stacked dataframe, and so, the final output may differ from what I have given below.
Helper Function: Make the Stacked And Crosstab DataFrames
import numpy as np
import pandas as pd
# Make stacked dataframe
def _create_df():
"""
This dataframe will be used to create a crosstab
"""
B = np.array(list('abracadabra'))
A = np.arange(len(B))
AB = list()
for i in range(20):
a = np.random.randint(1,10)
b = np.random.randint(1,10)
AB += [(a,b)]
AB = np.unique(np.array(AB), axis=0)
AB = np.unique(np.array(list(zip(A[AB[:,0]], B[AB[:,1]]))), axis=0)
AB_df = pd.DataFrame({'ID': AB[:,0], 'Label': AB[:,1]})
return AB_df
original_stacked_df = _create_df()
# Make crosstab
crosstab_df = pd.crosstab(original_stacked_df['ID'],
original_stacked_df['Label']).reindex()
What to expect?
You would expect a function to regenerate the stacked dataframe from the crosstab. I would provide my own solution to this in the answer section. If you could suggest something better that would be great.
Other References:
Closest stackoverflow discussion: pandas stacking a dataframe
Misleading stackoverflow question-topic: change pandas crossstab dataframe into plain table format:
You can just do stack
df[df.astype(bool)].stack().reset_index().drop(0,1)
The following produces the desired outcome.
def crosstab2stacked(crosstab):
stacked = crosstab.stack(dropna=True).reset_index()
stacked = stacked[stacked.replace(0,np.nan)[0].notnull()].drop(columns=[0])
return stacked
# Make original dataframe
original_stacked_df = _create_df()
# Make crosstab dataframe
crosstab_df = pd.crosstab(original_stacked_df['ID'],
original_stacked_df['Label']).reindex()
# Recontruct stacked dataframe
recon_stacked_df = crosstab2stacked(crosstab = crosstab_df)
Check if original == reconstructed:
np.alltrue(original_stacked_df == recon_stacked_df)
Output: True