create multi-indexed dataframe - pandas

I do not know how to create a multi-indexed df (that has unequal number of 2nd-indices). here is a sample:
data = [{'caterpillar': [('Сatérpillar',
{'fuzz': 0.82,
'levenshtein': 0.98,
'jaro_winkler': 0.9192,
'hamming': 0.98}),
('caterpiⅼⅼaʀ',
{'fuzz': 0.73,
'levenshtein': 0.97,
'jaro_winkler': 0.9114,
'hamming': 0.97}),
('cÂteԻpillÂr',
{'fuzz': 0.73,
'levenshtein': 0.97,
'jaro_winkler': 0.881,
'hamming': 0.97})]},
{'elementis': [('elEmENtis',
{'fuzz': 1.0, 'levenshtein': 1.0, 'jaro_winkler': 1.0, 'hamming': 1.0}),
('ÊlemĚntis',
{'fuzz': 0.78,
'levenshtein': 0.98,
'jaro_winkler': 0.863,
'hamming': 0.98}),
('еlÈmÈntis',
{'fuzz': 0.67,
'levenshtein': 0.97,
'jaro_winkler': 0.8333,
'hamming': 0.97})]},
{'gibson': [('giBᏚon',
{'fuzz': 0.83,
'levenshtein': 0.99,
'jaro_winkler': 0.9319,
'hamming': 0.99}),
('ɡibsoN',
{'fuzz': 0.83,
'levenshtein': 0.99,
'jaro_winkler': 0.9206,
'hamming': 0.99}),
('giЬႽon',
{'fuzz': 0.67,
'levenshtein': 0.98,
'jaro_winkler': 0.84,
'hamming': 0.98}),
('glbsՕn',
{'fuzz': 0.67,
'levenshtein': 0.98,
'jaro_winkler': 0.8333,
'hamming': 0.98})]}]
I want a df like this (note: 'Other Name' has differing number of values for each 'Orig Name':
Orig Name| Other Name| fuzz| levenstein| Jaro-Winkler| Hamming
------------------------------------------------------------------------
caterpillar Сatérpillar 0.82 0.98. 0.9192 0.98
caterpiⅼⅼaʀ 0.73 0.97 0.9114 0.97
cÂteԻpillÂr 0.73 0.97 0.881 0.97
gibson giBᏚon 0.83. 0.99 0.9319 0.99
ɡibsoN 0.83 0.99. 0.9206 0.99
giЬႽon 0.67. 0.98 0.84 0.98
glbsՕn 0.67. 0.98. 0.8333 0.98
elementis .........
--------------------------------------------------------------------------
I tried :
orig_name_list = [x for d in data for x, v in d.items()]
value_list = [v for d in data for x, v in d.items()]
other_names = [tup[0] for tup_list in value_list for tup in tup_list]
algos = ['fuzz', 'levenshtein', 'jaro_winkler', 'hamming']
Not sure how to proceed from there. Suggestions are appreciated.

Let's try concat:
pd.concat([pd.DataFrame([x[1]]).assign(OrigName=k, OtherName=x[0])
for df in data for k,d in df.items() for x in d])
Output:
fuzz levenshtein jaro_winkler hamming OrigName OtherName
0 0.82 0.98 0.9192 0.98 caterpillar Сatérpillar
0 0.73 0.97 0.9114 0.97 caterpillar caterpiⅼⅼaʀ
0 0.73 0.97 0.8810 0.97 caterpillar cÂteԻpillÂr
0 1.00 1.00 1.0000 1.00 elementis elEmENtis
0 0.78 0.98 0.8630 0.98 elementis ÊlemĚntis
0 0.67 0.97 0.8333 0.97 elementis еlÈmÈntis
0 0.83 0.99 0.9319 0.99 gibson giBᏚon
0 0.83 0.99 0.9206 0.99 gibson ɡibsoN
0 0.67 0.98 0.8400 0.98 gibson giЬႽon
0 0.67 0.98 0.8333 0.98 gibson glbsՕn

One way to do this is to reformat your data for json record consumption via the pd.json_normalize function. Your json is currently not formatted correctly to be stored into a dataframe easily:
new_data = []
for entry in data:
new_entry = {}
for name, matches in entry.items():
new_entry["name"] = name
new_entry["matches"] = []
for match in matches:
match[1]["match"] = match[0]
new_entry["matches"].append(match[1])
new_data.append(new_entry)
df = pd.json_normalize(new_data, "matches", ["name"]).set_index(["name", "match"])
print(df)
fuzz levenshtein jaro_winkler hamming
name match
caterpillar Сatérpillar 0.82 0.98 0.9192 0.98
caterpiⅼⅼaʀ 0.73 0.97 0.9114 0.97
cÂteԻpillÂr 0.73 0.97 0.8810 0.97
elementis elEmENtis 1.00 1.00 1.0000 1.00
ÊlemĚntis 0.78 0.98 0.8630 0.98
еlÈmÈntis 0.67 0.97 0.8333 0.97
gibson giBᏚon 0.83 0.99 0.9319 0.99
ɡibsoN 0.83 0.99 0.9206 0.99
giЬႽon 0.67 0.98 0.8400 0.98
glbsՕn 0.67 0.98 0.8333 0.98

Related

Plot secondary x_axis in ggplot

Dear All seniors and members,
Hope you are doing great. I have data set, which I like to plot the secondary x-axis in ggplot. I could not make it to work for the last 4 hours. below is my dataset.
Pathway ES NES p_value q_value Group
1 HALLMARK_HYPOXIA 0.49 2.25 0.000 0.000 Top
2 HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 0.44 2.00 0.000 0.000 Top
3 HALLMARK_UV_RESPONSE_DN 0.45 1.98 0.000 0.000 Top
4 HALLMARK_TGF_BETA_SIGNALING 0.48 1.77 0.003 0.004 Top
5 HALLMARK_HEDGEHOG_SIGNALING 0.52 1.76 0.003 0.003 Top
6 HALLMARK_ESTROGEN_RESPONSE_EARLY 0.38 1.73 0.000 0.004 Top
7 HALLMARK_KRAS_SIGNALING_DN 0.37 1.69 0.000 0.005 Top
8 HALLMARK_INTERFERON_ALPHA_RESPONSE 0.37 1.54 0.009 0.021 Top
9 HALLMARK_TNFA_SIGNALING_VIA_NFKB 0.32 1.45 0.005 0.048 Top
10 HALLMARK_NOTCH_SIGNALING 0.42 1.42 0.070 0.059 Top
11 HALLMARK_COAGULATION 0.32 1.39 0.031 0.067 Top
12 HALLMARK_MITOTIC_SPINDLE 0.30 1.37 0.025 0.078 Top
13 HALLMARK_ANGIOGENESIS 0.40 1.37 0.088 0.074 Top
14 HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.35 1.23 0.173 0.216 Top
15 HALLMARK_OXIDATIVE_PHOSPHORYLATION -0.65 -3.43 0.000 0.000 Bottom
16 HALLMARK_MYC_TARGETS_V1 -0.49 -2.56 0.000 0.000 Bottom
17 HALLMARK_E2F_TARGETS -0.45 -2.37 0.000 0.000 Bottom
18 HALLMARK_DNA_REPAIR -0.46 -2.33 0.000 0.000 Bottom
19 HALLMARK_ADIPOGENESIS -0.42 -2.26 0.000 0.000 Bottom
20 HALLMARK_FATTY_ACID_METABOLISM -0.41 -2.06 0.000 0.000 Bottom
21 HALLMARK_PEROXISOME -0.43 -2.01 0.000 0.000 Bottom
22 HALLMARK_MYC_TARGETS_V2 -0.43 -1.84 0.003 0.001 Bottom
23 HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.42 -1.83 0.003 0.001 Bottom
24 HALLMARK_ALLOGRAFT_REJECTION -0.34 -1.78 0.000 0.003 Bottom
25 HALLMARK_MTORC1_SIGNALING -0.32 -1.67 0.000 0.004 Bottom
26 HALLMARK_P53_PATHWAY -0.29 -1.52 0.000 0.015 Bottom
27 HALLMARK_UV_RESPONSE_UP -0.28 -1.41 0.013 0.036 Bottom
28 HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY -0.35 -1.39 0.057 0.040 Bottom
29 HALLMARK_HEME_METABOLISM -0.26 -1.34 0.014 0.061 Bottom
30 HALLMARK_G2M_CHECKPOINT -0.23 -1.20 0.080 0.172 Bottom
I like to plot like the following plot (plot # 1)
Here is my current codes chunks.
ggplot(data, aes(reorder(Pathway, NES), NES, fill= Group)) +
theme_classic() + geom_col() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 8),
axis.title = element_text(face = "bold", size = 12),
axis.text = element_text(face = "bold", size = 8), plot.title = element_text(hjust = 0.5)) + labs(x="Pathway", y="Normalized Enrichment Score",
title="2Gy_5f vs. 0Gy") + coord_flip()
This code produces the following plot (plot # 2)
So I would like to generate the plot where I have secondary x-axis with q_value (same like the first bar plot I have attached). Any help is greatly appreciated. Note: I used coord_flip so it turn angle of x-axis.
Kind Regards,
synat
[1]: https://i.stack.imgur.com/dBFIS.jpg
[2]: https://i.stack.imgur.com/yDbC5.jpg
Maybe you don't need a secondary axis per se to get the plot style you seek.
library(tidyverse)
ggplot(data, aes(x = NES, y = reorder(Pathway, NES), fill= Group)) +
theme_classic() +
geom_col() +
geom_text(aes(x = 2.5, y = reorder(Pathway, NES), label = q_value), hjust = 0) +
annotate("text", x = 2.5, y = length(data$Pathway) + 1, hjust = 0, fontface = "bold", label = "q_value" ) +
coord_cartesian(xlim = c(NA, 3),
ylim = c(NA, length(data$Pathway) + 1),
clip = "off") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1, size = 8),
axis.title = element_text(face = "bold", size = 12),
axis.text = element_text(face = "bold", size = 8),
plot.title = element_text(hjust = 0.5)) +
labs(x="Pathway", y="Normalized Enrichment Score",
title="2Gy_5f vs. 0Gy")
And for future reference you can read in data in the format you pasted like so:
data <- read_table(
"
Pathway ES NES p_value q_value Group
HALLMARK_HYPOXIA 0.49 2.25 0.000 0.000 Top
HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 0.44 2.00 0.000 0.000 Top
HALLMARK_UV_RESPONSE_DN 0.45 1.98 0.000 0.000 Top
HALLMARK_TGF_BETA_SIGNALING 0.48 1.77 0.003 0.004 Top
HALLMARK_HEDGEHOG_SIGNALING 0.52 1.76 0.003 0.003 Top
HALLMARK_ESTROGEN_RESPONSE_EARLY 0.38 1.73 0.000 0.004 Top
HALLMARK_KRAS_SIGNALING_DN 0.37 1.69 0.000 0.005 Top
HALLMARK_INTERFERON_ALPHA_RESPONSE 0.37 1.54 0.009 0.021 Top
HALLMARK_TNFA_SIGNALING_VIA_NFKB 0.32 1.45 0.005 0.048 Top
HALLMARK_NOTCH_SIGNALING 0.42 1.42 0.070 0.059 Top
HALLMARK_COAGULATION 0.32 1.39 0.031 0.067 Top
HALLMARK_MITOTIC_SPINDLE 0.30 1.37 0.025 0.078 Top
HALLMARK_ANGIOGENESIS 0.40 1.37 0.088 0.074 Top
HALLMARK_WNT_BETA_CATENIN_SIGNALING 0.35 1.23 0.173 0.216 Top
HALLMARK_OXIDATIVE_PHOSPHORYLATION -0.65 -3.43 0.000 0.000 Bottom
HALLMARK_MYC_TARGETS_V1 -0.49 -2.56 0.000 0.000 Bottom
HALLMARK_E2F_TARGETS -0.45 -2.37 0.000 0.000 Bottom
HALLMARK_DNA_REPAIR -0.46 -2.33 0.000 0.000 Bottom
HALLMARK_ADIPOGENESIS -0.42 -2.26 0.000 0.000 Bottom
HALLMARK_FATTY_ACID_METABOLISM -0.41 -2.06 0.000 0.000 Bottom
HALLMARK_PEROXISOME -0.43 -2.01 0.000 0.000 Bottom
HALLMARK_MYC_TARGETS_V2 -0.43 -1.84 0.003 0.001 Bottom
HALLMARK_CHOLESTEROL_HOMEOSTASIS -0.42 -1.83 0.003 0.001 Bottom
HALLMARK_ALLOGRAFT_REJECTION -0.34 -1.78 0.000 0.003 Bottom
HALLMARK_MTORC1_SIGNALING -0.32 -1.67 0.000 0.004 Bottom
HALLMARK_P53_PATHWAY -0.29 -1.52 0.000 0.015 Bottom
HALLMARK_UV_RESPONSE_UP -0.28 -1.41 0.013 0.036 Bottom
HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY -0.35 -1.39 0.057 0.040 Bottom
HALLMARK_HEME_METABOLISM -0.26 -1.34 0.014 0.061 Bottom
HALLMARK_G2M_CHECKPOINT -0.23 -1.20 0.080 0.172 Bottom")
Created on 2021-11-23 by the reprex package (v2.0.1)

Apply transformation to masked dataframe

I have this matrix df.head():
0 1 2 3 4 5 6 7 8 9 ... 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857
0 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0 0.0 0.00000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0 0.0 0.00000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0 0.0 30.88689 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.00000 0.0 0.0 0.0 0.00000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 42.43819 0.0 0.0 0.0 0.00000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 rows × 1858 columns
And I need to apply a transformation to it every time a value other than 0.0 is found, dividing the value by 0.32
So far I have the mask, like so:
normalize = 0.32
mask = (df>=0.0)
df = df.where(mask)
How do I apply such a transformation on a very large dataframe, after masking it?
You don't need mask, just divide your dataframe by 0.32.
df / 0.32
>>> df
A B
0 0 3
1 5 0
>>> df / 0.32
A B
0 0.000 9.375
1 15.625 0.000
If you needed to use mask, try;
mask = (df.eq(0))
df.where(mask, df/0.32)

How to create a pandas dataframe from csv where one column contains nested dictionary?

I have a CSV file and in one column there is a nested dictionary with the values of classification report, in a format like this one:
{'A': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 60},
'B': {'precision': 0.42, 'recall': 0.09, 'f1-score': 0.14, 'support': 150},
'micro avg': {'precision': 0.31, 'recall': 0.31, 'f1-score': 0.31, 'support': 1710},
'macro avg': {'precision': 0.13, 'recall': 0.08, 'f1-score': 0.071, 'support': 1710},
'weighted avg': {'precision': 0.29, 'recall': 0.31, 'f1-score': 0.26, 'support': 1710}}
I would like to get key_value1_level as a column in a data frame. So, is it possible to get the following result?
A_precision A_recall ...weighted_avg_precision weighted_avg_recall weighted_avg_f1-score weighted avg_support
0.0 0.0 0.29 0.31 0.26 1710
Thank you
You can use pd.json_normalize on that dictionary:
dct = {
"A": {"precision": 0.0, "recall": 0.0, "f1-score": 0.0, "support": 60},
"B": {"precision": 0.42, "recall": 0.09, "f1-score": 0.14, "support": 150},
"micro avg": {
"precision": 0.31,
"recall": 0.31,
"f1-score": 0.31,
"support": 1710,
},
"macro avg": {
"precision": 0.13,
"recall": 0.08,
"f1-score": 0.071,
"support": 1710,
},
"weighted avg": {
"precision": 0.29,
"recall": 0.31,
"f1-score": 0.26,
"support": 1710,
},
}
df = pd.json_normalize(dct, sep="_")
print(df)
Prints:
A_precision A_recall A_f1-score A_support B_precision B_recall B_f1-score B_support micro avg_precision micro avg_recall micro avg_f1-score micro avg_support macro avg_precision macro avg_recall macro avg_f1-score macro avg_support weighted avg_precision weighted avg_recall weighted avg_f1-score weighted avg_support
0 0.0 0.0 0.0 60 0.42 0.09 0.14 150 0.31 0.31 0.31 1710 0.13 0.08 0.071 1710 0.29 0.31 0.26 1710

Calculate ratio using groupby

IM new using python
I created this dataframe:
d2= {'id': ['x2', 'x2', 'x2', 'x2', 'x3', 'x3', 'x3'], 'cod': [101001, 101001, 101001, 101001, 101002, 101002, 101002],
'flag': ['IN', 'IN', 'IN','CMP', 'IN', 'OUT', 'CMP'], 'col': [100, 100, 100, 300, 100, 300, 100]
}
df2 = pd.DataFrame(data=d2)
I want to calculate a ratio : (sum(IN)/sum(all) groupby id*cod.
The expected output should be
d2= {'id': ['x2', 'x2', 'x2', 'x2', 'x3', 'x3', 'x3'], 'cod': [101001, 101001, 101001, 101001, 101002, 101002, 101002],
'flag': ['IN', 'IN', 'IN','CMP', 'IN', 'OUT', 'CMP'], 'col': [0.5, 0.5, 0.5, 0.5, 0.2, 0.2, 0.2]
}
df2 = pd.DataFrame(data=d2)
Please tell me if im not clear. Thank you
First replace non matched values to 0 in DataFrame.where, aggregate sum and ast divide columns:
df3 = (df2.assign(new = df2['col'].where(df2['flag'].eq('IN'), 0))
.groupby(['id','cod'])
.transform('sum'))
df2['rat'] = df3['new'].div(df3['col'])
print (df2)
id cod flag col rat
0 x2 101001 IN 100 0.5
1 x2 101001 IN 100 0.5
2 x2 101001 IN 100 0.5
3 x2 101001 CMP 300 0.5
4 x3 101002 IN 100 0.2
5 x3 101002 OUT 300 0.2
6 x3 101002 CMP 100 0.2
You could create a temporary column (new), and use the temporary column, combined with groupby and transform, to get the ratio for each row::
(df2
.assign(
new = np.where(df2.flag == "IN", df2.col, 0),
ratio = lambda df : df.groupby(['id', 'cod'])
.pipe(lambda df: df['new']
.transform('sum')
.div(df['col'].transform('sum'))
)
)
)
id cod flag col new ratio
0 x2 101001 IN 100 100 0.5
1 x2 101001 IN 100 100 0.5
2 x2 101001 IN 100 100 0.5
3 x2 101001 CMP 300 0 0.5
4 x3 101002 IN 100 100 0.2
5 x3 101002 OUT 300 0 0.2
6 x3 101002 CMP 100 0 0.2
df2["col"] = df2.groupby(["id", "cod"], as_index=False)["col"].transform(
lambda x: x[df2.iloc[x.index]["flag"] == "IN"].sum() / x.sum(),
)
print(df2)
Prints:
id cod flag col
0 x2 101001 IN 0.5
1 x2 101001 IN 0.5
2 x2 101001 IN 0.5
3 x2 101001 CMP 0.5
4 x3 101002 IN 0.2
5 x3 101002 OUT 0.2
6 x3 101002 CMP 0.2

How to use if conditions in Pandas?

I am working on pandas and I have four column
Name Sensex_index Start_Date End_Date
AAA 0.5 20/08/2016 25/09/2016
AAA 0.8 26/08/2016 29/08/2016
AAA 0.4 30/08/2016 31/08/2016
AAA 0.9 01/09/2016 05/09/2016
AAA 0.5 12/09/2016 22/09/2016
AAA 0.3 24/09/2016 29/09/2016
ABC 0.9 01/01/2017 15/01/2017
ABC 0.5 23/01/2017 30/01/2017
ABC 0.7 02/02/2017 15/03/2017
If the sensex index of same name increases from lower index and moves to higher index, then the Termination date is the previous value, for example, I am looking for the following output,
Name Sensex_index Actual_Start Termination_Date
AAA 0.5 20/08/2016 31/08/2016
AAA 0.8 20/08/2016 31/08/2016
AAA 0.4 20/08/2016 31/08/2016 [high to low; low to high,terminate]
AAA 0.9 01/09/2016 29/09/2016
AAA 0.5 01/09/2016 29/09/2016
AAA 0.3 01/09/2016 29/09/2016 [end of AAA]
ABC 0.9 01/01/2017 30/01/2017
ABC 0.5 01/01/2017 30/01/2017 [high to low; low to high,terminate]
ABC 0.7 02/02/2017 15/03/2017 [end of ABC]
#Setup
df = pd.DataFrame(data = [['AAA', 0.5, '20/08/2016', '25/09/2016'],
['AAA', 0.8, '26/08/2016', '29/08/2016'],
['AAA', 0.4, '30/08/2016', '31/08/2016'],
['AAA', 0.9, '01/09/2016', '05/09/2016'],
['AAA', 0.5, '12/09/2016', '22/09/2016'],
['AAA', 0.3, '24/09/2016', '29/09/2016'],
['ABC', 0.9, '01/01/2017', '15/01/2017'],
['ABC', 0.5, '23/01/2017', '30/01/2017'],
['ABC', 0.7, '02/02/2017', '15/03/2017']], columns = ['Name', 'Sensex_index', 'Start_Date', 'End_Date'])
#Find the rows where price change from high to low and then to high
df['change'] = df.groupby('Name')['Sensex_index'].apply(lambda x: x.rolling(3,center=True).apply(lambda y: True if (y[1]<y[0] and y[1]<y[2]) else False))
#Find the last row for each name
df.iloc[df.groupby('Name')['change'].tail(1).index, -1] = 1.0
#Set End_Date as Termination_Date for those changing points
df['Termination_Date'] = df.apply(lambda x: x.End_Date if x.change>0 else np.nan, axis=1)
#Set Actual_Start
df['Actual_Start'] = df.apply(lambda x: x.Start_Date if (x.name==0
or x.Name!= df.iloc[x.name-1]['Name']
or df.iloc[x.name-1]['change']>0)
else np.nan, axis=1)
#back fill the Termination_Date for other rows.
df.Termination_Date.fillna(method='bfill', inplace=True)
#forward fill the Actual_Start for other rows.
df.Actual_Start.fillna(method='ffill', inplace=True)
print(df)