Converting a pandas crosstab into a stacked dataframe (a regular table) - pandas

Given a pandas crosstab, how do you convert that into a stacked dataframe?
Assume you have a stacked dataframe. First we convert it into a crosstab. Now I would like to revert back to the original stacked dataframe. I searched a problem statement that addresses this requirement, but could not find any that hits bang on. In case I have missed any, please leave a note to it in the comment section.
I would like to document the best practice here. So, thank you for your support.
I know that pandas.DataFrame.stack() would be the best approach. But one needs to be careful of the the "level" stacking is applied to.
Input: Crosstab:
Label a b c d r
ID
1 0 1 0 0 0
2 1 1 0 1 1
3 1 0 0 0 1
4 1 0 0 1 0
6 1 0 0 0 0
7 0 0 1 0 0
8 1 0 1 0 0
9 0 1 0 0 0
Output: Stacked DataFrame:
ID Label
0 1 b
1 2 a
2 2 b
3 2 d
4 2 r
5 3 a
6 3 r
7 4 a
8 4 d
9 6 a
10 7 c
11 8 a
12 8 c
13 9 b
Step-by-step Explanation:
First, let's make a function that would create our data. Note that it randomly generates the stacked dataframe, and so, the final output may differ from what I have given below.
Helper Function: Make the Stacked And Crosstab DataFrames
import numpy as np
import pandas as pd
# Make stacked dataframe
def _create_df():
"""
This dataframe will be used to create a crosstab
"""
B = np.array(list('abracadabra'))
A = np.arange(len(B))
AB = list()
for i in range(20):
a = np.random.randint(1,10)
b = np.random.randint(1,10)
AB += [(a,b)]
AB = np.unique(np.array(AB), axis=0)
AB = np.unique(np.array(list(zip(A[AB[:,0]], B[AB[:,1]]))), axis=0)
AB_df = pd.DataFrame({'ID': AB[:,0], 'Label': AB[:,1]})
return AB_df
original_stacked_df = _create_df()
# Make crosstab
crosstab_df = pd.crosstab(original_stacked_df['ID'],
original_stacked_df['Label']).reindex()
What to expect?
You would expect a function to regenerate the stacked dataframe from the crosstab. I would provide my own solution to this in the answer section. If you could suggest something better that would be great.
Other References:
Closest stackoverflow discussion: pandas stacking a dataframe
Misleading stackoverflow question-topic: change pandas crossstab dataframe into plain table format:

You can just do stack
df[df.astype(bool)].stack().reset_index().drop(0,1)

The following produces the desired outcome.
def crosstab2stacked(crosstab):
stacked = crosstab.stack(dropna=True).reset_index()
stacked = stacked[stacked.replace(0,np.nan)[0].notnull()].drop(columns=[0])
return stacked
# Make original dataframe
original_stacked_df = _create_df()
# Make crosstab dataframe
crosstab_df = pd.crosstab(original_stacked_df['ID'],
original_stacked_df['Label']).reindex()
# Recontruct stacked dataframe
recon_stacked_df = crosstab2stacked(crosstab = crosstab_df)
Check if original == reconstructed:
np.alltrue(original_stacked_df == recon_stacked_df)
Output: True

Related

Creating a dataframe using roll-forward window on multivariate time series

Based on the simplifed sample dataframe
import pandas as pd
import numpy as np
timestamps = pd.date_range(start='2017-01-01', end='2017-01-5', inclusive='left')
values = np.arange(0,len(timestamps))
df = pd.DataFrame({'A': values ,'B' : values*2},
index = timestamps )
print(df)
A B
2017-01-01 0 0
2017-01-02 1 2
2017-01-03 2 4
2017-01-04 3 6
I want to use a roll-forward window of size 2 with a stride of 1 to create a resulting dataframe like
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6
I.e., each window step should create a data item with the two values of A and B in this window and the A and B values immediately to the right of the window as target values.
My first idea was to use pandas
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html
But that seems to only work in combination with aggregate functions such as sum, which is a different use case.
Any ideas on how to implement this rolling-window-based sampling approach?
Here is one way to do it:
window_size = 3
new_df = pd.concat(
[
df.iloc[i : i + window_size, :]
.T.reset_index()
.assign(other_index=i)
.set_index(["other_index", "index"])
.set_axis([f"timestep_{j}" for j in range(1, window_size)] + ["target"], axis=1)
for i in range(df.shape[0] - window_size + 1)
]
)
new_df.index.names = ["", ""]
print(df)
# Output
timestep_1 timestep_2 target
0 A 0 1 2
B 0 2 4
1 A 1 2 3
B 2 4 6

Trying to convert column to be row indexes, set_index error

data_new. set_index('Usual Mode of Transport to Work')
jupyter notebook
Trying to convert column to be row indexes, however, it shows up as NaN? How do i resolve it? Thanks. Im a beginner in python.
Lets start with a toy dataframe
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,5,size=(5, 4)), columns=list('ABCD'))
print(df)
A B C D
0 3 1 2 1
1 2 2 3 4
2 2 4 4 1
3 1 0 3 2
4 1 2 4 0
Now, let's set column A as the index
df.set_index('A')
B C D
A
3 1 2 1
2 2 3 4
2 4 4 1
1 0 3 2
1 2 4 0
This sets the index correctly but doesn't save this newly indexed dataframe in the original data frame variable, i.e., df. So when you check the value of df you see find the original dataframe.
To save the new indexing, you can do one of the following
df = df.set_index('A)
or
df.set_index('A', inplace=True)
Coming to the NaN values, I believe it has got something to do with using Jupyter notebook. Since Jupyter allows jumping between cells, it does not necessarily follow the linear execution order like traditional coding. This can get confusing. You can use the "Variable View" in Jupyter to cross-check if you are passing the value you intend to. I hope this can help you figure out the NaN issue.

pandas finding duplicate rows with different label

I have the case where I want to sanity check labeled data. I have hundreds of features and want to find points which have the same features but different label. These found cluster of disagreeing labels should then be numbered and put into a new dataframe.
This isn't hard but I am wondering what the most elegant solution for this is.
Here an example:
import pandas as pd
df = pd.DataFrame({
"feature_1" : [0,0,0,4,4,2],
"feature_2" : [0,5,5,1,1,3],
"label" : ["A","A","B","B","D","A"]
})
result_df = pd.DataFrame({
"cluster_index" : [0,0,1,1],
"feature_1" : [0,0,4,4],
"feature_2" : [5,5,1,1],
"label" : ["A","B","B","D"]
})
In order to get the output you want (both de-duplication and cluster_index), you can use a groupby approach:
g = df.groupby(['feature_1', 'feature_2'])['label']
(df.assign(cluster_index=g.ngroup()) # get group name
.loc[g.transform('size').gt(1)] # filter the non-duplicates
# line below only to have a nice cluster_index range (0,1…)
.assign(cluster_index= lambda d: d['cluster_index'].factorize()[0])
)
output:
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1
First get all duplicated values per feature columns and then if necessary remove duplciated by all columns (here in sample data not necessary), last add GroupBy.ngroup for groups indices:
df = df[df.duplicated(['feature_1','feature_2'],keep=False)].drop_duplicates()
df['cluster_index'] = df.groupby(['feature_1', 'feature_2'])['label'].ngroup()
print (df)
feature_1 feature_2 label cluster_index
1 0 5 A 0
2 0 5 B 0
3 4 1 B 1
4 4 1 D 1

python - List of Lists into pandas dataframe including name of columns

I would like to transfer a list of lists into a dataframe with columns based on the lists in the list.
This is still easy.
list = [[....],[....],[...]]
df = pd.DataFrame(list)
df = df.transpose()
The problem is: I would like to give the columns a column-name based on entries I have in another list:
list_two = [A,B,C,...]
This is my issue Im still struggling with.
Is there any approach to solve this problem?
Thanks a lot in advance for your help.
Best regards
Sascha
Use zip with dict for dictionary of lists and pass to DataFrame:
L= [[1,2,3,5],[4,8,9,8],[1,2,5,3]]
list_two = list('ABC')
df = pd.DataFrame(dict(zip(list_two, L)))
print (df)
A B C
0 1 4 1
1 2 8 2
2 3 9 5
3 5 8 3
Or if pass index parameter after transpose get columns names by this list:
df = pd.DataFrame(L, index=list_two).T
print (df)
A B C
0 1 4 1
1 2 8 2
2 3 9 5
3 5 8 3

Pandas Groupby -- efficient selection/filtering of groups based on multiple conditions?

I am trying to
filter dataframe groups in Pandas, based on multiple (any) conditions.
but I cannot seem to get to a fast Pandas 'native' one-liner.
Here I generate an example dataframe of 2*n*n rows and 4 columns:
import itertools
import random
n = 100
lst = range(0, n)
df = pd.DataFrame(
{'A': list(itertools.chain.from_iterable(itertools.repeat(x, n*2) for x in lst)),
'B': list(itertools.chain.from_iterable(itertools.repeat(x, 1*2) for x in lst)) * n,
'C': random.choices(list(range(100)), k=2*n*n),
'D': random.choices(list(range(100)), k=2*n*n)
})
resulting in dataframes such as:
A B C D
0 0 0 26 49
1 0 0 29 80
2 0 1 70 92
3 0 1 7 2
4 1 0 90 11
5 1 0 19 4
6 1 1 29 4
7 1 1 31 95
I want to
select groups grouped by A and B,
filtered groups down to where any values in the group are greater than 50 in both columns C and D,
A "native" Pandas one-liner would be the following:
test.groupby([test.A, test.B]).filter(lambda x: ((x.C>50).any() & (x.D>50).any()) )
which produces
A B C D
2 0 1 70 92
3 0 1 7 2
This is all fine for small dataframes (say n < 20).
But this solution takes quite long (for example, 4.58 s when n = 100) for large dataframes.
I have an alternative, step-by-step solution which achieves the same result, but runs much faster (28.1 ms when n = 100):
test_g = test.assign(key_C = test.C>50, key_D = test.D>50).groupby([test.A, test.B])
test_C_bool = test_g.key_C.transform('any')
test_D_bool = test_g.key_D.transform('any')
test[test_C_bool & test_D_bool]
but arguably a bit more ugly. My questions are:
Is there a better "native" Pandas solution for this task? , and
Is there a reason for the sub-optimal performance of my version of the "native" solution?
Bonus question:
In fact I only want to extract the groups and not together with their data. I.e., I only need
A B
0 1
in the above example. Is there a way to do this with Pandas without going through the intermediate step I did above?
This is similar to your second approach, but chained together:
mask = (df[['C','D']].gt(50) # in the case you have different thresholds for `C`, `D` [50, 60]
.all(axis=1) # check for both True on the rows
.groupby([df['A'],df['B']]) # normal groupby
.transform('max') # 'any' instead of 'max' also works
)
df.loc[mask]
If you don't want the data, you can forgo the transform:
mask = df[['C','D']].min(axis=1).gt(50).groupby([df['A'],df['B']]).any()
mask[mask].index
# out
# MultiIndex([(0, 1)],
# names=['A', 'B'])