I have:
pd.DataFrame({'col1':['A','A','B','F']})
col1
0 A
1 A
2 B
3 F
I want:
pd.DataFrame({'col1':['A','A','B','F'],'col2':['1A:0B:0C:0D:0E:0F','2A:0B:0C:0D:0E:0F','2A:1B:0C:0D:0E:0F','2A:1B:0C:0D:0E:1F']})
col1 col2
0 A 1A:0B:0C:0D:0E:0F
1 A 2A:0B:0C:0D:0E:0F
2 B 2A:1B:0C:0D:0E:0F
3 F 2A:1B:0C:0D:0E:1F
Requirements:
I have a column that can take one of 6 values (A:F). I want to create a new column that shows the running total of the values of that row and above.
any suggestions?
You can use get_dummies + cumsum. That output is generally easier to work with, but if you need that single string output, you can join the columns with the counts. The .reindex and .fillna ensure everything is ordered and includes exactly the categories you want.
import pandas as pd
df = pd.DataFrame({'col1':['A','A','B','F']})
df = (pd.get_dummies(df['col1'])
.reindex(list('ABCDEF'), axis=1)
.fillna(0, downcast='infer')
.cumsum())
# A B C D E F
#0 1 0 0 0 0 0
#1 2 0 0 0 0 0
#2 2 1 0 0 0 0
#3 2 1 0 0 0 1
df['res'] = [':'.join(x) for x in (df.astype(str)+df.columns).to_numpy()]
# A B C D E F res
#0 1 0 0 0 0 0 1A:0B:0C:0D:0E:0F
#1 2 0 0 0 0 0 2A:0B:0C:0D:0E:0F
#2 2 1 0 0 0 0 2A:1B:0C:0D:0E:0F
#3 2 1 0 0 0 1 2A:1B:0C:0D:0E:1F
Related
With reference to Pandas groupby with categories with redundant nan
import pandas as pd
df = pd.DataFrame({"TEAM":[1,1,1,1,2,2,2], "ID":[1,1,2,2,8,4,5], "TYPE":["A","B","A","B","A","A","A"], "VALUE":[1,1,1,1,1,1,1]})
df["TYPE"] = df["TYPE"].astype("category")
df = df.groupby(["TEAM", "ID", "TYPE"]).sum()
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
4 A 0
B 0
5 A 0
B 0
8 A 0
B 0
2 1 A 0
B 0
2 A 0
B 0
4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
Expected output
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
I tried to use astype("category") for TYPE. However it seems to output every cartesian product of every item in every group.
What you want is a little abnormal, but we can force it there from a pivot table:
out = df.pivot_table(index=['TEAM', 'ID'],
columns=['TYPE'],
values=['VALUE'],
aggfunc='sum',
observed=True, # This is the key when working with categoricals~
# You should known to try this with your groupby from the post you linked...
fill_value=0).stack()
print(out)
Output:
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
here is one way to do it, based on the data you shared
reset the index and then do the groupby to choose groups where sum is greater than 0, means either of the category A or B is non-zero. Finally set the index
df.reset_index(inplace=True)
(df[df.groupby(['TEAM','ID'])['VALUE']
.transform(lambda x: x.sum()>0)]
.set_index(['TEAM','ID','TYPE']))
VALUE
TEAM ID TYPE
1 1 A 1
B 1
2 A 1
B 1
2 4 A 1
B 0
5 A 1
B 0
8 A 1
B 0
This is my dataframe:
0 1 0 1 1
1 0 1 0 1
I generate the sum for each column as below:
data.iloc[:,1:] = data.iloc[:,1:].sum(axis=0)
The result is:
0 1 1 1 2
1 1 1 1 2
But I only want to update values that are not zero:
0 1 0 1 2
1 0 1 0 2
As it is a large dataframe and I don't know which columns will contain zero, I am having trouble in getting the condition to work togther with the iloc
Assuming the following input:
0 1 2 3 4
0 0 1 0 1 1
1 1 0 1 0 1
you can use the underlying numpy array and numpy.where:
import numpy as np
a = data.values[:, 1:]
data.iloc[:,1:] = np.where(a!=0, a.sum(0), a)
output:
0 1 2 3 4
0 0 1 0 1 2
1 1 0 1 0 2
I have four a data frame as follows:
Proxyid
A
B
C
D
123
1
0
0
0
456
1
1
1
1
789
0
0
0
0
This is the idea of the data frame. now I want to duplicate the rows where there are more than one 1. and assign values as follows.
Proxyid
A
B
C
D
123
1
0
0
0
456
1
0
0
0
456
0
1
0
0
456
0
0
1
0
456
0
0
0
1
789
0
0
0
0
I would really appreciate any input. Thank you.
One option via pd.get_dumies:
df1 = (
pd.get_dummies(
df.set_index('Proxyid')
.mul(df.columns[1:])
.replace('', np.NAN)
.stack()
)
.reset_index().drop('level_1', 1)
)
result = df1.append(df[~df.Proxyid.isin(df1.Proxyid)])
OUTPUT:
Proxyid
A
B
C
D
0
123
1
0
0
0
1
456
1
0
0
0
2
456
0
1
0
0
3
456
0
0
1
0
4
456
0
0
0
1
2
789
0
0
0
0
If you've extra columns just add them in set_index and use:
df1 = df.set_index(['Proxyid', 'test'])
df1 = pd.get_dummies(df1.mul(df1.columns).replace('', np.NAN).stack()).reset_index()
result = df1.append(df[~df.Proxyid.isin(df1.Proxyid)])
I have a DataFrame with 75 columns.
How can I select rows based on a condition in a specific array of columns? If I want to do this on all columns I can just use
df[(df.values > 1.5).any(1)]
But let's say I just want to do this on columns 3:45.
Use ix to slice the columns using ordinal position:
In [31]:
df = pd.DataFrame(np.random.randn(5,10), columns=list('abcdefghij'))
df
Out[31]:
a b c d e f g \
0 -0.362353 0.302614 -1.007816 -0.360570 0.317197 1.131796 0.351454
1 1.008945 0.831101 -0.438534 -0.653173 0.234772 -1.179667 0.172774
2 0.900610 0.409017 -0.257744 0.167611 1.041648 -0.054558 -0.056346
3 0.335052 0.195865 0.085661 0.090096 2.098490 0.074971 0.083902
4 -0.023429 -1.046709 0.607154 2.219594 0.381031 -2.047858 -0.725303
h i j
0 0.533436 -0.374395 0.633296
1 2.018426 -0.406507 -0.834638
2 -0.079477 0.506729 1.372538
3 -0.791867 0.220786 -1.275269
4 -0.584407 0.008437 -0.046714
So to slice the 4th to 5th columns inclusive:
In [32]:
df.ix[:, 3:5]
Out[32]:
d e
0 -0.360570 0.317197
1 -0.653173 0.234772
2 0.167611 1.041648
3 0.090096 2.098490
4 2.219594 0.381031
So in your case
df[(df.ix[:, 2:45]).values > 1.5).any(1)]
should work
indexing is 0 based and the open range is included but the closing range is not so here 3rd column is included and we slice up to column 46 but this is not included in the slice
Another solution with iloc, values can be omited:
#if need from 3rd to 45th columns
print (df[((df.iloc[:, 2:45]) > 1.5).any(1)])
Sample:
np.random.seed(1)
df = pd.DataFrame(np.random.randint(3, size=(5,10)), columns=list('abcdefghij'))
print (df)
a b c d e f g h i j
0 1 0 0 1 1 0 0 1 0 1
1 0 2 1 2 0 2 1 2 0 0
2 2 0 1 2 2 0 1 1 2 0
3 2 1 1 1 1 2 1 1 0 0
4 1 0 0 1 2 1 0 2 2 1
print (df[((df.iloc[:, 2:5]) > 1.5).any(1)])
a b c d e f g h i j
1 0 2 1 2 0 2 1 2 0 0
2 2 0 1 2 2 0 1 1 2 0
4 1 0 0 1 2 1 0 2 2 1
What is the best way to create a set of new columns based on two other columns? (similar to a crosstab or SQL case statement)
This works but performance is very slow on large dataframes:
for label in labels:
df[label + '_amt'] = df.apply(lambda row: row['amount'] if row['product'] == label else 0, axis=1)
You can use pivot_table
>>> df
amount product
0 6 b
1 3 c
2 3 a
3 7 a
4 7 a
>>> df.pivot_table(index=df.index, values='amount',
... columns='product', fill_value=0)
product a b c
0 0 6 0
1 0 0 3
2 3 0 0
3 7 0 0
4 7 0 0
or,
>>> for label in df['product'].unique():
... df[label + '_amt'] = (df['product'] == label) * df['amount']
...
>>> df
amount product b_amt c_amt a_amt
0 6 b 6 0 0
1 3 c 0 3 0
2 3 a 0 0 3
3 7 a 0 0 7
4 7 a 0 0 7