I am trying to assign values to a multicolumn dataframe that are stored in another normal dataframe. The 2 dataframes share the same index, however when attempting to assign the values for all columns of the normal dataframe to a slice of the multicolumn dataframe Nan values appear.
MWE
import pandas as pd
df = pd.DataFrame.from_dict(
{
("old", "mean"): {"high": 0.0, "med": 0.0, "low": 0.0},
("old", "std"): {"high": 0.0, "med": 0.0, "low": 0.0},
("new", "mean"): {"high": 0.0, "med": 0.0, "low": 0.0},
("new", "std"): {"high": 0.0, "med": 0.0, "low": 0.0},
}
)
temp = pd.DataFrame.from_dict(
{
"old": {
"high": 2.6798302797288174,
"med": 10.546654056177656,
"low": 16.46382603916123,
},
"new": {
"high": 15.91881231611413,
"med": 16.671967271277495,
"low": 26.17872356316402,
},
}
)
df.loc[:, (slice(None), "mean")] = temp
print(df)
Output:
old new
mean std mean std
high NaN 0.0 NaN 0.0
med NaN 0.0 NaN 0.0
low NaN 0.0 NaN 0.0
Is this expected behaviour or am I doing something horrible that I am not supposed?
Create MultiIndex in temp for align data and then you can set new values by DataFrame.update:
temp.columns = pd.MultiIndex.from_product([temp.columns, ['mean']])
print (temp)
old new
mean mean
high 2.679830 15.918812
med 10.546654 16.671967
low 16.463826 26.178724
df.update(temp)
print(df)
old new
mean std mean std
high 2.679830 0.0 15.918812 0.0
med 10.546654 0.0 16.671967 0.0
low 16.463826 0.0 26.178724 0.0
Related
I have a CSV file and in one column there is a nested dictionary with the values of classification report, in a format like this one:
{'A': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 60},
'B': {'precision': 0.42, 'recall': 0.09, 'f1-score': 0.14, 'support': 150},
'micro avg': {'precision': 0.31, 'recall': 0.31, 'f1-score': 0.31, 'support': 1710},
'macro avg': {'precision': 0.13, 'recall': 0.08, 'f1-score': 0.071, 'support': 1710},
'weighted avg': {'precision': 0.29, 'recall': 0.31, 'f1-score': 0.26, 'support': 1710}}
I would like to get key_value1_level as a column in a data frame. So, is it possible to get the following result?
A_precision A_recall ...weighted_avg_precision weighted_avg_recall weighted_avg_f1-score weighted avg_support
0.0 0.0 0.29 0.31 0.26 1710
Thank you
You can use pd.json_normalize on that dictionary:
dct = {
"A": {"precision": 0.0, "recall": 0.0, "f1-score": 0.0, "support": 60},
"B": {"precision": 0.42, "recall": 0.09, "f1-score": 0.14, "support": 150},
"micro avg": {
"precision": 0.31,
"recall": 0.31,
"f1-score": 0.31,
"support": 1710,
},
"macro avg": {
"precision": 0.13,
"recall": 0.08,
"f1-score": 0.071,
"support": 1710,
},
"weighted avg": {
"precision": 0.29,
"recall": 0.31,
"f1-score": 0.26,
"support": 1710,
},
}
df = pd.json_normalize(dct, sep="_")
print(df)
Prints:
A_precision A_recall A_f1-score A_support B_precision B_recall B_f1-score B_support micro avg_precision micro avg_recall micro avg_f1-score micro avg_support macro avg_precision macro avg_recall macro avg_f1-score macro avg_support weighted avg_precision weighted avg_recall weighted avg_f1-score weighted avg_support
0 0.0 0.0 0.0 60 0.42 0.09 0.14 150 0.31 0.31 0.31 1710 0.13 0.08 0.071 1710 0.29 0.31 0.26 1710
I have a Pandas DataFrame that I need to:
group by the ID column (not in index)
forward fill rows to the right with the previous value (multiple columns) only if it's not a NaN (np.nan)
For each ID categorical value and each metric column (see the aX columns in the examples below) there is only value (the others when having multiple rows are NaN - np.nan).
Take this as an example:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: my_df = pd.DataFrame([
...: {"id": 1, "a1": 100.0, "a2": np.nan, "a3": np.nan, "a4": 90.0},
...: {"id": 1, "a1": np.nan, "a2": np.nan, "a3": 80.0, "a4": np.nan},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": np.nan},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": np.nan, "a4": 30.0},
...: ])
In [4]: my_df.head(len(my_df))
Out[4]:
id a1 a2 a3 a4
0 1 100.0 NaN NaN 90.0
1 1 NaN NaN 80.0 NaN
2 20 NaN NaN 100.0 NaN
3 20 NaN NaN NaN 30.0
I have many more columns like a1 to a4.
I would like to:
pretend np.nan is zero 0.0 when on the same column and different row (with same ID) there is a number so I can sum them together like with groupby and subsequent aggregation functions
forward fill to the right on the same unique row (by ID) only if somewhere on a previous column to the left there was a number
So basically in the example this means that:
for ID 1 "a2"=100.0
for ID 2 "a1" and "a2" are both np.nan
See here:
In [5]: wanted_df = pd.DataFrame([
...: {"id": 1, "a1": 100.0, "a2": 100.0, "a3": 80.0, "a4": 90.0},
...: {"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": 30.0},
...: ])
In [6]: wanted_df.head(len(wanted_df))
Out[6]:
id a1 a2 a3 a4
0 1 100.0 100.0 80.0 90.0
1 20 NaN NaN 100.0 30.0
In [7]:
The forward filling to the right should apply to multiple columns on the same row,
not only for the closest row to the right.
When I use my_df.interpolate(method='pad', axis=1,limit=None,limit_direction='forward',limit_area=None,downcast=None,) then I still get multiple rows for the same ID.
When I use my_df.groupby("id").sum() then I see 0.0 everywhere rather than retaining the NaN values in those scenarios defined above.
When I use my_df.groupby("id").apply(np.sum) the ID columns is summed as well, so this is wrong as it should be retained.
How do I do this?
One idea is use min_count=1 to sum:
df = my_df.groupby("id").sum(min_count=1)
print (df)
a1 a2 a3 a4
id
1 100.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
Or if need first non missing value is possible use GroupBy.first:
df = my_df.groupby("id").first()
print (df)
a1 a2 a3 a4
id
1 100.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
More problematic is if multiple non missing values per groups and need all of them:
#added 20 to a1
my_df = pd.DataFrame([
{"id": 1, "a1": 100.0, "a2": np.nan, "a3": np.nan, "a4": 90.0},
{"id": 1, "a1": 20, "a2": np.nan, "a3": 80.0, "a4": np.nan},
{"id": 20, "a1": np.nan, "a2": np.nan, "a3": 100.0, "a4": np.nan},
{"id": 20, "a1": np.nan, "a2": np.nan, "a3": np.nan, "a4": 30.0},
])
print (my_df)
id a1 a2 a3 a4
0 1 100.0 NaN NaN 90.0
1 1 20.0 NaN 80.0 NaN
2 20 NaN NaN 100.0 NaN
3 20 NaN NaN NaN 30.0
def f(x):
return x.apply(lambda x: pd.Series(x.dropna().to_numpy()))
df1 = (my_df.set_index('id')
.groupby("id")
.apply(f)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
id a1 a2 a3 a4
0 1 100.0 NaN 80.0 90.0
1 1 20.0 NaN NaN NaN
2 20 NaN NaN 100.0 30.0
First and second solution working differently:
df2 = my_df.groupby("id").sum(min_count=1)
print (df2)
a1 a2 a3 a4
id
1 120.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
df3 = my_df.groupby("id").first()
print (df3)
a1 a2 a3 a4
id
1 100.0 NaN 80.0 90.0
20 NaN NaN 100.0 30.0
If same type of values, here numbers is possible also use:
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
f = lambda x: pd.DataFrame(justify(x.to_numpy(),
invalid_val=np.nan,
axis=0,
side='up'), columns=my_df.columns.drop('id'))
.dropna(how='all')
df1 = (my_df.set_index('id')
.groupby("id")
.apply(f)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
id a1 a2 a3 a4
0 1 100.0 NaN 80.0 90.0
1 1 20.0 NaN NaN NaN
2 20 NaN NaN 100.0 30.0
I think I'm following the instructions to a t but I still get this error I don't understand. I have a DatFrame and a Series, both with the same MultiIndex consisting of levels "Woche" and "cluster":
DataFrame "weekly":
Cat Base Major
Woche cluster
18w46 0 9.0 NaN
D 5.0 NaN
E 35.0 NaN
F 7.0 50.0
G 80.0 15.0
Series "df2":
Woche cluster
18w46 0 9
D 4
E 1
F 5
G 94
Name: Bruch, dtype: int64
weekly = weekly.join(df2)
gives this error: TypeError: cannot append a non-category item to a CategoricalIndex.
I don't get it. weekly.index.is_categorical() and df2.index.is_categorical() both yield False.
What am I doing wrong?
The problem might be that weekly.columns -- the columns, not the index -- is a CategoricalIndex.
For example,
import numpy as np
import pandas as pd
nan = np.nan
weekly = pd.DataFrame(
{
"Woche": ["18w46", "18w46", "18w46", "18w46", "18w46"],
"cluster": ["0", "D", "E", "F", "G"],
"Base": [9.0, 5.0, 35.0, 7.0, 80.0],
"Major": [nan, nan, nan, 50.0, 15.0],
}
).set_index(["Woche", "cluster"])
weekly.columns = pd.CategoricalIndex(weekly.columns)
df2 = pd.DataFrame(
{
"Woche": ["18w46", "18w46", "18w46", "18w46", "18w46"],
"cluster": ["0", "D", "E", "F", "G"],
"Bruch": [9, 4, 1, 5, 94],
}
).set_index(["Woche", "cluster"])["Bruch"]
weekly.join(df2)
raises TypeError: cannot append a non-category item to a CategoricalIndex.
If weekly.columns.is_categorical() is True, the problem could be avoided by making weekly.columns a regular pd.Index:
weekly.columns = weekly.columns.tolist()
import pandas as pd
import numpy as np
d = {
'Fruit':['Guava','Orange','Lemon'],
'ID1':[1,2,11],
'ID2':[3,4,12],
'ID3':[5,6,np.nan],
'ID4':[7,8,14],
'ID5':[9,10,np.nan],
'ID6':[11,np.nan,np.nan],
'ID7':[13,np.nan,np.nan],
'ID8':[15,np.nan,np.nan],
'ID9':[17,np.nan,np.nan],
'Category':['Myrtaceae','Citrus','Citrus']
}
df = pd.DataFrame(data = d)
df
How to convert the above dataframe to the following dictionary.
Expected Output :
{
'Myrtacease':{'Guava':{1,3,5,7,9,11,13,15,17}},
'Citrus':{'Orange':{2,4,6,8,10,np.nan,np.nan,np.nan,np.nan},{'Lemon':{11,12,np.nan,14,np.nan,np.nan,np.nan,np.nan,np.nan}},
}
How to again convert the dictionary to a dataframe?
Use list comprehension with groupby:
d = {k: v.set_index('Fruit').T.to_dict('list')
for k, v in df.set_index('Category').groupby(level=0)}
print (d)
{'Citrus': {'Orange': [2.0, 4.0, 6.0, 8.0, 10.0, nan, nan, nan, nan],
'Lemon': [11.0, 12.0, nan, 14.0, nan, nan, nan, nan, nan]},
'Myrtaceae': {'Guava': [1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0]}}
Or:
d = {k: v.drop('Category', axis=1).set_index('Fruit').T.to_dict('list')
for k, v in df.groupby('Category')}
And then:
df = (pd.concat({k: pd.DataFrame(v) for k, v in d.items()}, axis=1)
.T
.rename_axis(('Category','Fruit'))
.rename(columns=lambda x: f'ID{x+1}')
.reset_index())
print (df)
Category Fruit ID1 ID2 ID3 ID4 ID5 ID6 ID7 ID8 ID9
0 Citrus Orange 2.0 4.0 6.0 8.0 10.0 NaN NaN NaN NaN
1 Citrus Lemon 11.0 12.0 NaN 14.0 NaN NaN NaN NaN NaN
2 Myrtaceae Guava 1.0 3.0 5.0 7.0 9.0 11.0 13.0 15.0 17.0
I am trying to do sentiment analysis on tweets using sentimentIntensityAnalyzer() from nltk.sentiment.vader
sid = SentimentIntensityAnalyzer()
listy = []
for index, row in data.iterrows():
ss = sid.polarity_scores(row["Tweets"])
listy.append(ss)
se = pd.Series(listy)
data['polarity'] = se.values
display(data.head(100))
This is the resulting dataFramee :
Tweets polarity
0 RT #spectatorindex: Facebook controls:\n\n- Wh... {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
1 RT #YAATeamWest: Today we're at #BradfordUniSU... {'neg': 0.0, 'neu': 0.902, 'pos': 0.098, 'comp...
2 #SachinTendulkar launches India’s first Multip... {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
3 How To Create a 360 Render (And How to Improv... {'neg': 0.0, 'neu': 0.722, 'pos': 0.278, 'comp...
4 The Most Disturbing Virtual Reality You Will E... {'neg': 0.174, 'neu': 0.826, 'pos': 0.0, 'comp...
5 VR Training for Troops 🎮\n\n... {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
6 RT #DefenceHQ: The #BritishArmy has awarded a ... {'neg': 0.0, 'neu': 0.847, 'pos': 0.153, 'comp...
7 RT #UofGHumanities: #UofGCSPE Humanities Lectu... {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
8 RT #OyezServices: Ever wanted a tour of Machu ... {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
9 RT #ProjectDastaan: We are an Oxford Universit... {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
10 RT #Paula_Piccard: Virtual reality will change... {'neg': 0.0, 'neu': 0.878, 'pos': 0.122, 'comp...
In order to do statistical analysis on the 'neg','pos','neu' and 'compound' entities in the polarity column I wanted to split the data into four different columns. To achieve this I used :
list_pos= []
list_neg = []
list_comp = []
list_neu = []
for index, row in data.iterrows():
list_pos.append(row['polarity']['pos'])
list_neg.append(row['polarity']['neg'])
list_comp.append(row['polarity']['compound'])
list_neu.append(row['polarity']['neu'])
se_pos = pd.Series(list_pos)
se_neg = pd.Series(list_neg)
se_comp = pd.Series(list_comp)
se_neu = pd.Series(list_neu)
data['positive'] = se_pos.values
data['negative'] = se_neg.values
data['compound'] = se_comp.values
data['neutral'] = se_neu.values
The resulting dataFrame:
Tweets polarity positive negative compound neutral
0 RT #spectatorindex: Facebook controls:\n\n- Wh... {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound... 0.000 0.000 0.0000 1.000
1 RT #YAATeamWest: Today we're at #BradfordUniSU... {'neg': 0.0, 'neu': 0.902, 'pos': 0.098, 'comp... 0.098 0.000 0.3612 0.902
2 #SachinTendulkar launches India’s first Multip... {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound... 0.000 0.000 0.0000 1.000
Is there a more concise way of achieving a similar dataFrame? Using the lambda function perhaps? Thanks for the help!