How to create a heatmap kind of visulalization for data - pandas

I have pandas dataframe that has the following structure
BID
EID
B1
1001,1002
B2
1001,1003,1006
B3
1004,1006,1008,1005
B4
1001,1002,1003,10004,1005,1008
I want to report the COUNT of how many common EIDs are there amongst the BIDs.
I want visualization in the following format
B1
B2
B3
B4
B1
n/a
1
0
2
B2
1
n/a
2
3
B3
0
2
n/a
4
B4
2
3
4
n/a
How can i achieve this ? Also higher the number in the cell, i want to highlight dark as it appears in heat map. Appreciate your help.
My logic is ..
Create a pandas dataframe with BID as index
Loop through each BID and compare it with other BIDs
Create a new column
Each of new column will be a list (This list will contain the count of EIDs)
How to Convert this dataframe to heat map ?
or any easy logic that I can implement ?

OK, I'm new here but here is my solution :)
first of all for each BID create a column with his data.
tmp = df["EID"].apply(pd.Series).set_index(df["BID"].values).T
next build the structure of the correlation with corr()
corr_df=tmp.corr()
Then change the values of the corr
import itertools
for a, b in itertools.combinations_with_replacement(tmp.columns, 2,):
corr_df.loc[[a],[b]]=len(set(tmp[a]) & set(tmp[b])) # len of equal items
corr_df.loc[[b],[a]]=len(set(tmp[a]) & set(tmp[b]))
corr_df.loc[[a],[a]]=np.NAN
print(corr_df)
output:
B1 B2 B3 B4
B1 NaN 1.0 0.0 2.0
B2 1.0 NaN 1.0 2.0
B3 0.0 1.0 NaN 3.0
B4 2.0 2.0 3.0 NaN
heatmap code:
import plotly.graph_objects as go # for visualization
fig = go.Figure(data=go.Heatmap(
z=corr_df,
x=corr_df.columns,
y=corr_df.columns,
hoverongaps = False,
colorscale="Greys",))
fig.update_layout(
title="HeatMap",
width=600
)
fig.show()
output:

Related

How to merge common indices when creating MultiIndex DataFrame

I have a DataFrame that looks like this:
Method Dataset foo bar
0 A1 B1 10 20
1 A1 B2 10 20
2 A1 B2 10 20
3 A2 B1 10 20
4 A3 B1 10 20
5 A1 B1 10 20
6 A2 B2 10 20
7 A3 B2 10 20
I'd like to use Method and Dataset columns to turn this into a MultiIndex DataFrame. So I tried doing:
df.set_index(["Method", "Dataset"], inplace=True)
df.sort_index(inplace=True)
Which gives:
Method Dataset
A1 B1 10 20
B1 10 20
B2 10 20
B2 10 20
A2 B1 10 20
B2 10 20
A3 B1 10 20
B2 10 20
This is almost what I want but I was expecting to see common values in Dataset index to also be merged under one value, i.e. similar to Method index:
foo bar
Method Dataset
A1 B1 10 20
10 20
B2 10 20
10 20
A2 B1 10 20
B2 10 20
A3 B1 10 20
B2 10 20
How can I achieve that?
(This might not make a big difference to how you'd use a DataFrame but I'm trying to use the to_latex() method which is sensitive to these things)
I suggest you do this at the very end right before you write the DataFrame to_latex, otherwise you can have issues with data processing.
We will make the duplicated entries in the last level the empty string and reconstruct the entire MultiIndex.
import pandas as pd
import numpy as np
df.index = pd.MultiIndex.from_arrays([
df.index.get_level_values('Method'),
np.where(df.index.duplicated(), '', df.index.get_level_values('Dataset'))
], names=['Method', 'Dataset'])
foo bar
Method Dataset
A1 B1 10 20
10 20
B2 10 20
10 20
A2 B1 10 20
B2 10 20
A3 B1 10 20
B2 10 20
If you want to make this a bit more flexible for any number of levels (even just a simple Index) we can use this function which will replace in the last level:
def white_out_index(idx):
"""idx : pd.MultiIndex or pd.Index"""
i0 = [idx.get_level_values(i) for i in range(idx.nlevels-1)]
i0.append(np.where(idx.duplicated(), '', idx.get_level_values(-1)))
return pd.MultiIndex.from_arrays(i0, names=idx.names)
df.index = white_out_index(df.index)

append lists of different length to dataframe pandas

Consider I have multiple lists
A = ['acc_num=1', 'A1', 'A2']
B = ['acc_num=2', 'B1', 'B2', 'B3','B4']
C = ['acc_num=3', 'C1']
How to I put them in dataframe to export to excel as:
acc_num _1 _2 _3 _4
_1 1 A1 A2
_2 2 B1 B2 B3 B4
_3 3 C1
Hi here is a solution for you in 3 basic steps:
Create a DataFrame just by passing a list of your lists
Manipulate the acc_num column and remove the starting string "acc_num=" this is done with a string method on the vectorized column (but that goes maybe to far for now)
Rename the Column Header / Names as you wish by passing a dictionary {} to the df.rename
The Code:
# Create a Dataframe from your lists
df = pd.DataFrame([A,B,C])
# Change Column 0 and remove initial string
df[0] = df[0].str.replace('acc_num=','')
# Change the name of Column 0
df.rename(columns={0:"acc_num"},inplace=True)
Final result:
Out[26]:
acc_num 1 2 3 4
0 1 A1 A2 None None
1 2 B1 B2 B3 B4
2 3 C1 None None None

is it possible to obtain 'groupby-transform-apply' style results with the function return series rather than scaler?

I want to achieve the following behavior:
res = df.groupby(['dimension'], as_index=False)['metric'].transform(lambda x: foo(x))
where foo(x) returns a series the same size as the input which is df['metric']
however, this will throw the following error:
ValueError: transform must return a scalar value for each group
i know i can use a for loop style, but how can i achieve this in a groupby manner?
e.g.
df:
col1 col2 col3
0 A1 B1 1
1 A1 B1 2
2 A2 B2 3
and i want to achieve:
col1 col2 col3
0 A1 B1 1 - (1+2)/2
1 A1 B1 2 - (1+2)/2
2 A2 B2 3 - 3
If you want to return a Series you should use apply instead of transform:
res = df.groupby(['dimension'], as_index=False)['metric'].apply(lambda x: foo(x))
Transform as the error states must return a scalar value that would be put in every rows for each group. But apply will work with a Series returned for each group.
If this doesn't work, provide input and expected output to understand better your problem.
You can do this using transform:
df['col3']=(df.col3-df.groupby(['col1','col2'])['col3'].transform('sum'))/2
Or using apply(slower):
df['col3']=df.groupby(['col1','col2'])['col3'].apply(lambda x: (x-x.sum())/2)
col1 col2 col3
0 A1 B1 -1.0
1 A1 B1 -0.5
2 A2 B2 0.0

Python - reshape, pivot, unstack - multiindex

I have a dataframe below that I am trying to reshape. I've looked up how to do it, but am getting back multiple answers and when trying to implement getting errors as having duplicate index, or I'll get just 1 wide row dataframe. The options that I have been trying are unstack, pivot, and ravel. What would be the best and easiest way to reshape without iterating rows, which I know I could work out, but I also know there is a better way.
For sake of clarity, I provided a screen shot of an example of what I have, and what I'm trying to do:
Here's what I have (but with thousands of rows)
I'm trying to move rows underneith that have same Customer, Week, and Type to be on 1 single Row:
To look something like this:
EDIT: As asked below, just a a quick sample of the data set. I should have provided from the start.
import pandas as pd
d = {'Customer': ['Store_A']*12,
'Class': ['1A','1A','2B','2B','3C','3C']*2,
'Week':['08/19/2018','08/26/2018']*6,
'Type':['Food']*6 + ['Beverage']*6,
'Value': [None,None,1,1.5,1.1,1.2,None,None,0.96,0.70,0.96,0.96]}
test_df = pd.DataFrame(data=d)
You can avoid duplicated columns names in pandas, so I recommended add counter for it:
g = test_df.groupby(['Customer','Week', 'Type']).cumcount().astype(str)
df = test_df.set_index(['Customer','Week', 'Type', g]).unstack().sort_index(axis=1, level=1)
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
Customer Week Type Class_0 Value_0 Class_1 Value_1 Class_2 \
0 Store_A 08/19/2018 Beverage 1A NaN 2B 0.96 3C
1 Store_A 08/19/2018 Food 1A NaN 2B 1.00 3C
2 Store_A 08/26/2018 Beverage 1A NaN 2B 0.70 3C
3 Store_A 08/26/2018 Food 1A NaN 2B 1.50 3C
Value_2
0 0.96
1 1.10
2 0.96
3 1.20

pandas dataframe group by and agg

I am new to ipython and I am trying to do something with dataframe grouping . I have a dataframe like below
df_test = pd.DataFrame({"A": range(4), "B": ["B1", "B2", "B1", "B2"], "C": ["C1", "C1", np.nan, "C2"]})
df_test
A B C
0 0 B1 C1
1 1 B2 C1
2 2 B1 NaN
3 3 B2 C2
I would like to achieve following things:
1) group by B but creating multilevel column instead of grouped to rows with B1 and B2 as index, B1 and B2 are basically count
2) column A and C are agg function applied with something like {'C':['count'],'A':['sum']}
B
A B1 B2 C
0 6 2 2 3
how ? Thanks
You are doing separate actions to each column. You can hack this by aggregating A and C and then taking the value counts of B separately and then combine the data back together.
ac = df_test.agg({'A':'sum', 'C':'count'})
b = df_test['B'].value_counts()
pd.concat([ac, b]).sort_index().to_frame().T
A B1 B2 C
0 6 2 2 3