I would like to get series with values from column which name is variable and stored in 'COL'.
import pandas as pd
df = pd.DataFrame({'A': [10, 20, 30, 40], 'B': [100, 200, 300, 400], 'COL': ['A', 'B', 'A', 'B']})
df
A B COL
0 10 100 A
1 20 200 B
2 30 300 A
3 40 400 B
What I need as a result:
X
0 10
1 200
2 30
3 400
Use DataFrame.lookup, only necessary all values from COL exist in columns names:
df['X'] = df.lookup(df.index, df['COL'])
print (df)
A B COL X
0 1 10 A 1
1 2 20 B 20
2 3 30 A 3
3 4 40 B 40
this will solve your problem
df["X"]=df.apply(lambda x: x[x["COL"]],axis=1)
Related
Let say I have got a dataframe called df
A 10
A 20
15
20
B 10
B 10
The result I want is
A 30
35
B 20
I imagine your blanks are actually NaNs, then use dropna=False:
df.groupby('col1', dropna=False).sum()
If they really are empty strings, then it should work with the default.
Example:
df = pd.DataFrame({'col1': ['A', 'A', float('nan'), float('nan'), 'B', 'B'],
'col2': [10, 20, 15, 20, 10, 10]})
df.groupby('col1', dropna=False).sum()
output:
col2
col1
A 30
B 20
NaN 35
Group by custom group and aggregate columns.
Suppose your dataframe with 2 columns: 'col1' and 'col2':
>>> df
col1 col2
0 A 10 # <- group 1
1 A 20 # <- group 1
2 15 # <- group 2
3 20 # <- group 2
4 B 10 # <- group 3
5 B 10 # <- group 3
grp = df.iloc[:, 0].ne(df.iloc[:, 0].shift()).cumsum()
out = df.groupby(grp, as_index=False).agg({'col1': 'first', 'col2': 'sum'})
Output result:
>>> out
col1 col2
0 A 30
1 35
2 B 20
I've a dataset in this format:
df = pd.DataFrame({'ID': [1,2,3,4],
'Type': ['A', 'B', 'C', 'D'],
'Value': [100, 200, 201, 120]})
I want to replicate one of this dataframe more than once and add new ID which is incremental. So, the new dataset looks like the following:
df = pd.DataFrame({'ID': [1,2,3,4, 10, 11, 12, 13],
'Type': ['A', 'B', 'C', 'D', 'B', 'B', 'B', 'B'],
'Value': [100, 200, 201, 120, 200, 200, 200, 200]})
I could replicate the rows in the following way:
df_B = df[df['ID'] == 'B']
df_B = pd.concat([df_B]*4)
But I've trouble updating the ID column incrementally.
If someone can guide me.
Thanks!
Do you mean by something like?
print(df.append(df.loc[df['Type'].eq('B')].reset_index(drop=True).iloc[[0] * 4].reset_index(drop=True).assign(ID=df['ID'] + 9), ignore_index=True))
Output:
ID Type Value
0 1 A 100
1 2 B 200
2 3 C 201
3 4 D 120
4 10 B 200
5 11 B 200
6 12 B 200
7 13 B 200
You can specify starting value for incremental new column and add it by DataFrame.assign, also add original by DataFrame.append:
N = 4
first = 10
df_B = df[df['Type'] == 'B']
df_B = df.append(pd.concat([df_B]*N).assign(ID=range(first, first + N)), ignore_index=True)
print (df_B)
ID Type Value
0 1 A 100
1 2 B 200
2 3 C 201
3 4 D 120
4 10 B 200
5 11 B 200
6 12 B 200
7 13 B 200
df2 has more columns and rows than df1. For each row in df2, I want to lookup a corresponding row in df1 based on matching values in one of their columns. From this matching row in df1, I want to subtract a column between df2 and df1. I tried set_index and directly subtracting the dataframes, but that resulted in a lot of NaN.
df1 = pd.DataFrame([[1, 10], [2, 20], [3, 30]],
columns=['A', 'B'])
df2 = pd.DataFrame([[1, 100, 15], [1, 200, 20],
[2, 100, 30], [2, 200, 35],
[3, 100, 50], [3, 200, 55]],
columns=['A', 'X', 'B'])
# For each row in df2, lookup in df1 based on column A, and produce
# difference of values in columnn B.
expected = pd.DataFrame([[1, 100, 5], [1, 200, 10],
[2, 100, 10], [2, 200, 15],
[3, 100, 20], [3, 200, 25]],
columns=['A', 'X', 'B'])
DataFrames:
df1
A B
0 1 10
1 2 20
2 3 30
df2
A X B
0 1 100 15
1 1 200 20
2 2 100 30
3 2 200 35
4 3 100 50
5 3 200 55
expected
A X B
0 1 100 5
1 1 200 10
2 2 100 10
3 2 200 15
4 3 100 20
5 3 200 25
set_index df1 to 'A' and map it back to df2.A. After that do subtraction:
df2['B'] -= df2.A.map(df1.set_index('A').B)
Out[216]:
A X B
0 1 100 5
1 1 200 10
2 2 100 10
3 2 200 15
4 3 100 20
5 3 200 25
Note: In case df2.A has values doesn't exist in df1.A, it will return NaN on that row. I leave it that way because your sample data doesn't specify how to handle it. If you want to keep the value of B the same in that case, you just need to chain .fillna(0) to the end of map or call method subtract with fill_value=0 option
df2['B'] -= df2.A.map(df1.set_index('A').B).fillna(0)
You can use merge also:
df2.merge(df1, on='A').eval('B = B_x - B_y').drop(['B_x','B_y'], axis=1)
Output:
A X B
0 1 100 5
1 1 200 10
2 2 100 10
3 2 200 15
4 3 100 20
5 3 200 25
I have a pandas dataframe
x = pd.DataFrame.from_dict({'row':[1, 1, 2, 2, 3, 3, 3], 'add': [1, 2, 3, 4, 5, 6, 7], 'take1': ['a', 'b', 'c', 'd', 'e', 'f', 'g'], 'take2': ['11', '22', '33', '44', '55', '66', '77'], 'range': [100, 200, 300, 400, 500, 600, 700]})
add range row take1 take2
0 1 100 1 a 11
1 2 200 1 b 22
2 3 300 2 c 33
3 4 400 2 d 44
4 5 500 3 e 55
5 6 600 3 f 66
6 7 700 3 g 77
I want to group it by the row column, then add up entries in add column, but take the first entry from take1 and take2, and select the min and max from range:
add row take1 take2 min_range max_range
0 3 1 a 11 100 200
1 7 2 c 33 300 400
2 18 3 e 55 500 700
Use DataFrameGroupBy.agg by dict, but then some cleaning is necessary, because get MultiIndex in columns:
#create a dictionary of column names and functions to apply to that column
d = {'add':'sum', 'take1':'first', 'take2':'first', 'range':['min','max']}
#group by the row column and apply the corresponding aggregation to each
#column as specified in the dictionary d
df = x.groupby('row', as_index=False).agg(d)
#rename some columns
df = df.rename(columns={'first':'', 'sum':''})
df.columns = ['{0[0]}_{0[1]}'.format(x).strip('_') for x in df.columns]
print (df)
row take1 range_min range_max take2 add
0 1 a 100 200 11 3
1 2 c 300 400 33 7
2 3 e 500 700 55 18
Details : Aggregate the columns based by the functions specified in the dictionary :
df = x.groupby('row', as_index=False).agg(d)
row range take2 take1 add
min max first first sum
0 1 100 200 11 a 3
1 2 300 400 33 c 7
2 3 500 700 55 e 18
Replacing column names sum and first with '' will lead to
row range take2 take1 add
min max
0 1 100 200 11 a 3
1 2 300 400 33 c 7
2 3 500 700 55 e 18
List comprehension on columns by using string formatters will get the desired column names. Assigning it to df.columns will get the desired output.
Here's what I had, without column renaming/sorting.
x = pd.DataFrame.from_dict({'row':[1, 1, 2, 2, 3, 3, 3], 'add': [1, 2, 3, 4, 5, 6, 7], 'take1': ['a', 'b', 'c', 'd', 'e', 'f', 'g'], 'take2': ['11', '22', '33', '44', '55', '66', '77'], 'range': [100, 200, 300, 400, 500, 600, 700]})
x.reset_index(inplace = True)
min_cols = x.ix[x.groupby(['row'])['index'].idxmin().values][['row','take1','take2']]
x_grouped = x.groupby(['row']).agg({'add':'sum','range':[np.min, np.max]})
x_out = pd.merge(x_grouped,min_cols, how = 'left',left_index = True, right_on = ['row'])
print x_out
(add, sum) (range, amin) (range, amax) row take1 take2
0 3 100 200 1 a 11
2 7 300 400 2 c 33
4 18 500 700 3 e 55
Suppose that I have this dataframe:
import pandas as pd
def creatingDataFrame():
raw_data = {'Region1': ['A', 'A', 'C', 'B' , 'A', 'B'],
'Region2': ['B', 'C', 'A', 'A' , 'B', 'A'],
'var-1': [20, 30, 40 , 50, 10, 20],
'var-2': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['Region1', 'Region2','var-1', 'var-2'])
return df
I want to generate this column:
df['segment']=['A-B','A-C','A-C','A-B','A-B','A-B']
Note that it is using columns 'Region1' and 'Region2' but in a sorted order. I have no clue how to do that using pandas. The only solution that I have in mind is to use a list as intermediary step:
Regions=df[['Region1','Region2']].values.tolist()
segments=[]
for i in range(np.shape(Regions)[0]):
auxRegions=sorted(Regions[i][:])
segments.append(auxRegions[0]+'-'+auxRegions[1])
df['segments']=segments
To get:
>>> df['segments']
0 A-B
1 A-C
2 A-C
3 A-B
4 A-B
5 A-B
You need:
df['segments'] = ['-'.join(sorted(tup)) for tup in zip(df['Region1'], df['Region2'])]
Output:
Region1 Region2 var-1 var-2 segments
0 A B 20 3 A-B
1 A C 30 4 A-C
2 C A 40 5 A-C
3 B A 50 1 A-B
4 A B 10 2 A-B
5 B A 20 3 A-B
np.sort
v = np.sort(df.iloc[:, :2], axis=1).T
df['segments'] = [f'{i}-{j}' for i, j in zip(v[0], v[1])] # '{}-{}'.format(i, j)
df
Region1 Region2 var-1 var-2 segments
0 A B 20 3 A-B
1 A C 30 4 A-C
2 C A 40 5 A-C
3 B A 50 1 A-B
4 A B 10 2 A-B
5 B A 20 3 A-B
DataFrame.agg + str.join
df['segments'] = pd.DataFrame(
np.sort(df.iloc[:, :2], axis=1)).agg('-'.join, axis=1)
df
Region1 Region2 var-1 var-2 segments
0 A B 20 3 A-B
1 A C 30 4 A-C
2 C A 40 5 A-C
3 B A 50 1 A-B
4 A B 10 2 A-B
5 B A 20 3 A-B
(One above's faster.)