I am searching a way to countif rows in pandas. An example would be:
df = pd.DataFrame(data = {'A': [x,y, z], 'B':[z,y,x], 'C': [y,x,z] })
I want to count the number of repetitions on each row and add it to new columns based on specific criteria:
Criteria
C1 = x
C2 = y
C3 = z
In the example above, C3 will be [1,0,2] As there are one 'z' in row 0, no 'z' in row 1 and two 'z' in row 2.
The end table would look like:
A B C | C1 C2 C3
x z y | 1 1 1
y y x | 1 2 0
z x z | 1 0 2
How can I do this in Pandas?
Thanks a lot!
do you mean:
df.join(df.apply(pd.Series.value_counts, axis=1).fillna(0))
Output:
A B C x y z
0 x z y 1.0 1.0 1.0
1 y y x 1.0 2.0 0.0
2 z x z 1.0 0.0 2.0
Can iterate through the values and sum across axis 1
df = pd.concat([df.eq(val).sum(1) for val in ['x', 'y', 'z']], axis=1)
0 1 2
0 1 1 1
1 1 2 0
2 1 0 2
Then rename your column names accordingly.
For a more general solution, consider np.unique and using the pd.Series.name attr.
pd.concat([df.eq(val).sum(1).rename(val) for val in np.unique(df)], axis=1)
x y z
0 1 1 1
1 1 2 0
2 1 0 2
And with some trivial tweaks, you can have your end table
map_ = {'x':'C1', 'y':'C2', 'z':'C3'}
df.join(pd.concat([df.eq(i).sum(1).rename(map_[i]) for i in np.unique(df)], 1))
A B C C1 C2 C3
0 x z y 1 1 1
1 y y x 1 2 0
2 z x z 1 0 2
Related
For the 3 columns below, I would like to create a 4th column based on unique values from the 3 columns.
Col 1
Col 2
Col 3
A
: X :
Y :
X
: B :
Y :
C
: X :
X :
4th Column should have values of only A, B or C, as shown below. Please let me know how this can be done.
Col 1
Col 2
Col 3
Col 4
A
X
Y
A
X
B
Y
B
C
X
X
C
If unique means unique values in all columns with join multiple unique values per rows use DataFrame.stack with Series.drop_duplicates with aggregate join:
c = ['Col 1','Col 2','Col 3']
df['Col 4'] = df[c].stack().drop_duplicates(keep=False).groupby(level=0).agg(','.join)
print (df)
Col 1 Col 2 Col 3 Col 4
0 A X Y A
1 X B Y B
2 C X X C
print (df)
Col 1 Col 2 Col 3 Col 4
0 A X Y A
1 X B Y B
2 C X D C
c = ['Col 1','Col 2','Col 3']
df['Col 4'] = df[c].stack().drop_duplicates(keep=False).groupby(level=0).agg(','.join)
print (df)
Col 1 Col 2 Col 3 Col 4
0 A X Y A
1 X B Y B
2 C X D C,D
EDIT: If need extract only A,B,C values defined in list use:
L = ['A','B','C']
c = ['Col 1','Col 2','Col 3']
s = df[c].stack()
df['Col 4'] = s[s.isin(L)].groupby(level=0).agg(','.join)
print (df)
Col 1 Col 2 Col 3 Col 4
0 A X Y A
1 X B Y B
2 C X X C
I have two dataframe like these:
df1 a b c
0 1 2 3
1 2 3 4
2 3 4 5
df2 x y z
0 T T F
1 F T T
2 F T F
I want to merge these matrix according column one i between like this:
df a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F
whats your idea? how we can merge or append or concate?!!
I used this code. it work dynamically:
df=pd.DataFrame()
for i in range(0,6):
if i%2 == 0:
j=(i)/2
df.loc[:,i] = df1.iloc[:,int(j)]
else:
j=(i-1)/2
df.loc[:,i] = df2.iloc[:,int(j)]
And it works correctly !!
Try:
df = pd.concat([df1, df2], axis=1)
df = df[['a','x','b','y','c','z']]
Prints:
a x b y c z
0 1 T 2 T 3 F
1 2 F 3 T 4 T
2 3 F 4 T 5 F
I have a dataframe with 2 columns and I want to select N number of row from column B per column A
A B
0 A
0 B
0 I
0 D
1 A
1 F
1 K
1 L
2 R
For each unique number in Column A give me N random rows from Column B: if N == 2 then the resulting dataframe would look like. If Column A doesn't have up to N rows then return all of column A
A B
0 A
0 D
1 F
1 K
2 R
Use DataFrame.sample per groups in GroupBy.apply with test length of groups with if-else:
N = 2
df1 = df.groupby('A').apply(lambda x: x.sample(N) if len(x) >=N else x).reset_index(drop=True)
print (df1)
A B
0 0 I
1 0 D
2 1 A
3 1 K
4 2 R
Or:
N = 2
df1 = df.groupby('A', group_keys=False).apply(lambda x: x.sample(N) if len(x) >=N else x)
print (df1)
A B
0 0 A
3 0 D
5 1 F
6 1 K
8 2 R
I am new to pandas and I can add to cumsum as
df.cumsum(axis=1)
y0 y1 y2
0 2 3 4
1 2 2 3
2 0 0 0
3 1 2 3
y0 y1 y2
0 2 5 9
1 2 4 7
2 0 0 0
3 1 3 6
But is there way to perform on only first 2 columns i.e. skip y2?
You need to exclude y2, find cumsum and concat y2 back.
pd.concat([df[['y0', 'y1']].cumsum(axis=1),df['y2']], axis=1)
Output:
y0 y1 y2
0 2 5 4
1 2 4 3
2 0 0 0
3 1 3 3
You can also use .loc to select only the columns you care about.
cols = ['y0', 'y1']
df.loc[:, cols] = df.loc[:, cols].cumsum(axis=1)
Output
y0 y1 y2
0 2 5 4
1 2 4 3
2 0 0 0
3 1 3 3
loc is a flexible way to slice a DataFrame and in general follows the format:
.loc[row_labels, column_labels]
where an : can be used to indicate all rows, or all_columns.
I have a key-value dataframe:
pd.DataFrame(columns=['X','Y','val'],data= [['a','z',5],['b','g',3],['b','y',6],['e','r',9]])
> X Y val
0 a z 5
1 b g 3
2 b y 6
3 e r 9
Which I'd like to convert into a denser dataframe:
X z g y r
0 a 5 0 0 0
1 b 0 3 6 0
2 e 0 0 0 9
Before I resort to a pure-python I was wondering if there was a simple way to do this with pandas.
You can use get_dummies:
In [11]: dummies = pd.get_dummies(df['Y'])
In [12]: dummies
Out[12]:
g r y z
0 0 0 0 1
1 1 0 0 0
2 0 0 1 0
3 0 1 0 0
and then multiply by the val column:
In [13]: res = dummies.mul(df['val'], axis=0)
In [14]: res
Out[14]:
g r y z
0 0 0 0 5
1 3 0 0 0
2 0 0 6 0
3 0 9 0 0
To fix the index, you could just add the X as this index, you could first apply set_index:
In [21]: df1 = df.set_index('X', append=True)
In [22]: df1
Out[22]:
Y val
X
0 a z 5
1 b g 3
2 b y 6
3 e r 9
In [23]: dummies = pd.get_dummies(df['Y'])
In [24]: dummies.mul(df['val'], axis=0)
Out[24]:
g r y z
X
0 a 0 0 0 5
1 b 3 0 0 0
2 b 0 0 6 0
3 e 0 9 0 0
If you wanted to do this pivot (you can also use pivot_table):
In [31]: df.pivot('X', 'Y').fillna(0)
Out[31]:
val
Y g r y z
X
a 0 0 0 5
b 3 0 6 0
e 0 9 0 0
Perhaps you want to reset_index, to make X a column (I'm not sure whether than makes sense):
In [32]: df.pivot('X', 'Y').fillna(0).reset_index()
Out[32]:
X val
Y g r y z
0 a 0 0 0 5
1 b 3 0 6 0
2 e 0 9 0 0
For completeness, the pivot_table:
In [33]: df.pivot_table('val', 'X', 'Y', fill_value=0)
Out[33]:
Y g r y z
X
a 0 0 0 5
b 3 0 6 0
e 0 9 0 0
In [34]: df.pivot_table('val', 'X', 'Y', fill_value=0).reset_index()
Out[34]:
Y X g r y z
0 a 0 0 0 5
1 b 3 0 6 0
2 e 0 9 0 0
Note: the column name are named Y, after reseting the index, not sure if this makes sense (and easy to rectify via res.columns.name = None).
If you want something that feels more direct. Something akin to DataFrame.lookup but for np.put might make sense.
def lookup_index(self, row_labels, col_labels):
values = self.values
ridx = self.index.get_indexer(row_labels)
cidx = self.columns.get_indexer(col_labels)
if (ridx == -1).any():
raise ValueError('One or more row labels was not found')
if (cidx == -1).any():
raise ValueError('One or more column labels was not found')
flat_index = ridx * len(self.columns) + cidx
return flat_index
flat_index = lookup_index(df, vals.X, vals.Y)
np.put(df.values, flat_index, vals.val.values)
This assumes that df has the appropriate columns and index to hold the X/Y values. Here's an ipython notebook http://nbviewer.ipython.org/6454120