Related
Supposing that i have a simple pd.DataFrame like so:
d = {'col1': [1, 20], 'col2': [3, 40], 'col3': [5, 50]}
df = pd.DataFrame(data=d)
df
col1 col2 col4
0 1 3 5
1 20 40 60
is there a way to convert this to nasted pandas Dataframe (df_new) , so as when i call df_new.values[0] taking as ouptut:
array(
[0 1
1 3
2 5
Length: 3, dtype: int], dtype=object)
I still don't think I understand the exact requirement, but here is something:
One way of getting the desired output is this:
>>> pd.Series(df.T[0].values)
0 1
1 3
2 5
dtype: int64
If you want to have these as 2d arrays:
>>> np.array(pd.DataFrame(df.T[0].values).reset_index())
array([[0, 1],
[1, 3],
[2, 5]])
>>> np.array(pd.DataFrame(df.T[1].values).reset_index())
array([[ 0, 20],
[ 1, 40],
[ 2, 50]])
I have a dataframe containing some embeddings in column D. I would like to first groupby the data by column A and then apply kmeans on each group. Each group might contain nan values, so in the apply function I consider number of clusters as the number of non-nan values in column D devided by 2 (n_clusters = int(not_na_mask.sum()/2)).
In the apply function I return df['cluster'].values.tolist(). I printed this values and it's correct for each group, but after running the whole script df_test['clusters'] only contains nan in all the rows.
Sample DataFrame:
df_test = pd.DataFrame({'A' : ['aa', 'bb', 'aa', 'bb','aa', 'bb', 'aa', 'cc', 'aa', 'aa', 'bb', 'bb', 'bb','cc', 'bb', 'aa', 'cc', 'aa'],
'B' : [1, 2, np.nan, 4, 6, np.nan, 7, 8, np.nan, 1, 4, 3, 4, 7, 5, 7, 9, np.nan],
'D' : [[2, 0, 1, 5, 4, 0], np.nan, [4, 7, 0, 1, 0, 2], [1., 1, 1, 2, 0, 5], np.nan , [1, 6, 3, 2, 1, 9], [4, 2, 1, 0, 0, 0], [3, 5, 6, 8, 8, 0], np.nan,
np.nan, [2, 5, 1, 7, 4, 0] , [4, 2, 0, 4, 0, 0], [1., 0, 1, 8, 0, 9], [1, 0, 7, 2, 1, 0], np.nan , [1, 1, 5, 0, 8, 0], [4, 1, 6, 1, 1, 0], np.nan]})
df_test:
A B D
0 aa 1.0 [2, 0, 1, 5, 4, 0]
1 bb 2.0 NaN
2 aa NaN [4, 7, 0, 1, 0, 2]
3 bb 4.0 [1.0, 1, 1, 2, 0, 5]
4 aa 6.0 NaN
5 bb NaN [1, 6, 3, 2, 1, 9]
6 aa 7.0 [4, 2, 1, 0, 0, 0]
7 cc 8.0 [3, 5, 6, 8, 8, 0]
8 aa NaN NaN
9 aa 1.0 NaN
10 bb 4.0 [2, 5, 1, 7, 4, 0]
11 bb 3.0 [4, 2, 0, 4, 0, 0]
12 bb 4.0 [1.0, 0, 1, 8, 0, 9]
13 cc 7.0 [1, 0, 7, 2, 1, 0]
14 bb 5.0 NaN
15 aa 7.0 [1, 1, 5, 0, 8, 0]
16 cc 9.0 [4, 1, 6, 1, 1, 0]
17 aa NaN NaN
My approach for calculating kmeans:
def apply_kmeans_on_each_category(df):
not_na_mask = df['D'].notna()
embedding = df[not_na_mask]['D']
n_clusters = int(not_na_mask.sum()/2)
if n_clusters > 1:
df['cluster'] = np.nan
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embedding.tolist())
df.loc[not_na_mask, 'cluster'] = kmeans.labels_
return df['cluster'].values.tolist()
else:
return [np.nan] * len(df)
df_test['clusters'] = df_test.groupby('A').apply(apply_kmeans_on_each_category)
result:
df_test['clusters']:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
Name: clusters, dtype: object
Made some slight changes. Meat of the change is to use transform instead of apply. Also, no need to pass the entire Grouper df, you can just pass column D directly as that's the only column you are using -
def apply_kmeans_on_each_category(df):
not_na_mask = df.notna()
embedding = df.loc[not_na_mask]
n_clusters = int(not_na_mask.sum()/2)
op = pd.Series([np.nan] * len(df), index=df.index)
if n_clusters > 1:
df['cluster'] = np.nan
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embedding.tolist())
op.loc[not_na_mask] = kmeans.labels_.tolist()
return op
df_test['clusters'] = df_test.groupby('A')['D'].transform(apply_kmeans_on_each_category)
Output
A B D clusters
0 aa 1.0 [2, 0, 1, 5, 4, 0] 0.0
1 bb 2.0 NaN NaN
2 aa NaN [4, 7, 0, 1, 0, 2] 1.0
3 bb 4.0 [1.0, 1, 1, 2, 0, 5] 0.0
4 aa 6.0 NaN NaN
5 bb NaN [1, 6, 3, 2, 1, 9] 0.0
6 aa 7.0 [4, 2, 1, 0, 0, 0] 1.0
7 cc 8.0 [3, 5, 6, 8, 8, 0] NaN
8 aa NaN NaN NaN
9 aa 1.0 NaN NaN
10 bb 4.0 [2, 5, 1, 7, 4, 0] 1.0
11 bb 3.0 [4, 2, 0, 4, 0, 0] 1.0
12 bb 4.0 [1.0, 0, 1, 8, 0, 9] 0.0
13 cc 7.0 [1, 0, 7, 2, 1, 0] NaN
14 bb 5.0 NaN NaN
15 aa 7.0 [1, 1, 5, 0, 8, 0] 0.0
16 cc 9.0 [4, 1, 6, 1, 1, 0] NaN
17 aa NaN NaN NaN
I have:
df=pd.DataFrame({'a':[1,1,2],'b':[[1,2,3],[2,5],[3]],'c':['f','df','ere']})
df
a b c
0 1 [1, 2, 3] f
1 1 [2, 5] df
2 2 [3] ere
I want to concatenate and create a list on each element:
pd.DataFrame({'a':[1,2],'b':[[1,2,3,2,5],[3]],'c':[['f', 'df'],['ere']]})
a b c
0 1 [1, 2, 3, 2, 5] [f, df]
1 2 [3] [ere]
I tried:
df.groupby('a').agg({'b': 'sum', 'c': lambda x: list(''.join(x))})
a b c
1 [1, 2, 3, 2, 5] [f, d, f]
2 [3] [e, r, e]
But it is not quite right.
Any suggesetions?
You almost get it right:
df.groupby('a', as_index=False).agg({
'b': 'sum',
'c': list # no join needed
})
Output:
a b c
0 1 [1, 2, 3, 2, 5] [f, df]
1 2 [3] [ere]
I have a dataframe with 1000 columns. I want to replace every -9 value in every column with that row's df['a'] value.
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, -9, 8, np.nan, -9], 'c': [-9, 19, -9, -9, -9]})
What I want is
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, 2, 8, np.nan, 5], 'c': [1, 19, 3, 4, 5]})
I have tried
df.replace(-9, df['a'], inplace = True)
And
df.replace(-9, np.nan, inplace = True)
df.fillna(df.a, inplace = True)
But they don't change the df.
My solution right now is to use a for loop:
df.replace(-9, np.nan, inplace = True)
col_list = list(df)
for i in col_list:
df[i].fillna(df['a'], inplace = True)
This solution works, but it also replaces any np.nan values. Any ideas as to how I can replace just the -9 values without first converting it into np.nan? Thanks.
I think need mask:
df = df.mask(df == -9, df['a'], axis=0)
print (df)
a b c
0 1 6.0 1
1 2 2.0 19
2 3 8.0 3
3 4 NaN 4
4 5 5.0 5
Or:
df = pd.DataFrame(np.where(df == -9, df['a'].values[:, None], df), columns=df.columns)
print (df)
a b c
0 1.0 6.0 1.0
1 2.0 2.0 19.0
2 3.0 8.0 3.0
3 4.0 NaN 4.0
4 5.0 5.0 5.0
you can also do something like this
import numpy as np
import pandas as pd
df_tar = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, 2, 8, np.nan, 5], 'c': [1, 19, 3, 4, 5]})
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [6, -9, 8, np.nan, -9], 'c': [-9, 19, -9, -9, -9]})
df.loc[df['b']==-9,'b']=df.loc[df['b']==-9,'a']
df.loc[df['c']==-9,'c']=df.loc[df['c']==-9,'a']
Suppose we start with
import numpy as np
a = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
How can this be efficiently be made into a pandas DataFrame equivalent to
import pandas as pd
>>> pd.DataFrame({'a': [0, 0, 1, 1], 'b': [1, 3, 5, 7], 'c': [2, 4, 6, 8]})
a b c
0 0 1 2
1 0 3 4
2 1 5 6
3 1 7 8
The idea is to have the a column have the index in the first dimension in the original array, and the rest of the columns be a vertical concatenation of the 2d arrays in the latter two dimensions in the original array.
(This is easy to do with loops; the question is how to do it without them.)
Longer Example
Using #Divakar's excellent suggestion:
>>> np.random.randint(0,9,(4,3,2))
array([[[0, 6],
[6, 4],
[3, 4]],
[[5, 1],
[1, 3],
[6, 4]],
[[8, 0],
[2, 3],
[3, 1]],
[[2, 2],
[0, 0],
[6, 3]]])
Should be made to something like:
>>> pd.DataFrame({
'a': [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3],
'b': [0, 6, 3, 5, 1, 6, 8, 2, 3, 2, 0, 6],
'c': [6, 4, 4, 1, 3, 4, 0, 3, 1, 2, 0, 3]})
a b c
0 0 0 6
1 0 6 4
2 0 3 4
3 1 5 1
4 1 1 3
5 1 6 4
6 2 8 0
7 2 2 3
8 2 3 1
9 3 2 2
10 3 0 0
11 3 6 3
Here's one approach that does most of the processing on NumPy before finally putting it out as a DataFrame, like so -
m,n,r = a.shape
out_arr = np.column_stack((np.repeat(np.arange(m),n),a.reshape(m*n,-1)))
out_df = pd.DataFrame(out_arr)
If you precisely know that the number of columns would be 2, such that we would have b and c as the last two columns and a as the first one, you can add column names like so -
out_df = pd.DataFrame(out_arr,columns=['a', 'b', 'c'])
Sample run -
>>> a
array([[[2, 0],
[1, 7],
[3, 8]],
[[5, 0],
[0, 7],
[8, 0]],
[[2, 5],
[8, 2],
[1, 2]],
[[5, 3],
[1, 6],
[3, 2]]])
>>> out_df
a b c
0 0 2 0
1 0 1 7
2 0 3 8
3 1 5 0
4 1 0 7
5 1 8 0
6 2 2 5
7 2 8 2
8 2 1 2
9 3 5 3
10 3 1 6
11 3 3 2
Using Panel:
a = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
b=pd.Panel(rollaxis(a,2)).to_frame()
c=b.set_index(b.index.labels[0]).reset_index()
c.columns=list('abc')
then a is :
[[[1 2]
[3 4]]
[[5 6]
[7 8]]]
b is :
0 1
major minor
0 0 1 2
1 3 4
1 0 5 6
1 7 8
and c is :
a b c
0 0 1 2
1 0 3 4
2 1 5 6
3 1 7 8