Pandas loop groupby - pandas

I have a data
df1
Col1 Col2 Col3 Col4
12 10 R1 0.1
12 10 R2 0.1
12 8 R3 0.6
11 4 R4 0.2
12 10 R5 0.4
11 4 R6 0.1
df2 is a subset of df1
col 1 col 2 count
12 10 3
12 8 1
11 4 2
I want the output of rows matching col1 & col2 of df2 with df1.and thereby automate for each and every combination in df2.
For Combination of 12 ,10 in df2 i want matching rows in df1
col 1 col2 col3 col 4
12 10 R1 0.1
12 10 R2 0.1
12 10 R5 0.4
similarly i want to create a loop for next combination in df2 (12,8)
Col 1 col 2 col 3 col 4
12 8 R3 0.6
similarly i want to create a loop for next combination in df2 (11,4)
Col 1 col 2 col 3 col 4
11 4 R4 0.2
11 4 R6 0.1
i have tried this df3=df1[(df1.Col1 == 12.0)&(df1.Col2 == 10)] but want to automate it without mentioning the combination

I think you second DataFrame is not necessary, only loop by each combinations of unique values in Col1 and Col2 columns:
for i, g in df1.groupby(['Col1','Col2']):
print (i)
print (g)
If want more dynamic solution for dictionry of DataFrame:
d = {f'{i[0]}_{i[1]}':g for i, g in df1.groupby(['Col1','Col2'])}
print (d)
{'11_4': Col1 Col2 Col3 Col4
3 11 4 R4 0.2
5 11 4 R6 0.1, '12_8': Col1 Col2 Col3 Col4
2 12 8 R3 0.6, '12_10': Col1 Col2 Col3 Col4
0 12 10 R1 0.1
1 12 10 R2 0.1
4 12 10 R5 0.4}
print (d['11_4'])
Col1 Col2 Col3 Col4
3 11 4 R4 0.2
5 11 4 R6 0.1

Related

How to split a series with array to multiple series?

One column of my dataset contains numpy arrays as elements. I want to split it to multiple columns each one with a value of the array.
Data now are like:
column1 column2 column3
0 1 np.array([1,2,3,4]) 4.5
1 2 np.array([5,6,7,8]) 3
I want to convert it into:
column1 col1 col2 col3 col4 column3
0 1 1 2 3 4 4.5
1 2 5 6 7 8 3
Another possible solution, based on pandas.DataFrame.from_records:
out = pd.DataFrame.from_records(
df['col'], columns=[f'col{i+1}' for i in range(len(df.loc[0, 'col']))])
Output:
col1 col2 col3 col4
0 1 2 3 4
1 5 6 7 8
As an alternative:
df = pd.DataFrame(data={'col':[np.array([1,2,3,4]),np.array([5,6,7,8])]})
new_df = pd.DataFrame(df.col.tolist(), index= df.index) #explode column to new dataframe and get index from old df.
new_df.columns = ["col_{}".format(i) for i in range(1,len(new_df.columns) + 1)]
'''
col_1 col_2 col_3 col_4
0 1 2 3 4
1 5 6 7 8
'''
I hope I've understood your question well. You can leverage the result_type="expand" of the .apply method:
df = df.apply(
lambda x: {f"col{k}": vv for v in x for k, vv in enumerate(v, 1)},
result_type="expand",
axis=1,
)
print(df)
Prints:
col1 col2 col3 col4
0 1 2 3 4
1 5 6 7 8

Is there a way to count the number of values in a row that are greater than a "variable" value in Pandas?

I have two separated DataFrames:
df1:
Col1 Col2 Col3 Col4 Col5
ID1 2 3 5 0
ID2 7 6 11 5
ID3 9 16 20 12
df2:
Col1 ColB
ID1 2
ID2 7
ID3 9
Is there a way to count how many values in the first row of df1 are greater than the first value in the column ColB in df2? I need this counting for each row and to add it at the end of df1. So, df1 can be looked like this:
df1:
Col1 Col2 Col3 Col4 Col5 COUNT
ID1 2 3 5 0 2
ID2 7 6 11 5 1
ID3 9 16 20 12 3
Thank you for any suggestion!
The prior assumption is that 'Col1' is the index.
If not, add .set_index('Col1') after df1/df2 in the right part of the commands:
You can use the underlying numpy array:
df1['COUNT'] = (df1.values>df2.values).sum(axis=1)
# if "Col1" is not index
df1['COUNT'] = (df1.set_index('Col1').values>df2.set_index('Col1').values).sum(axis=1)
or:
df1['COUNT'] = df1.gt(df2['ColB'].values[:, None]).sum(axis=1)
# if "Col1" is not index
df1['COUNT'] = df1.set_index('Col1').gt(df2['ColB'].values[:, None]).sum(axis=1)
output:
Col2 Col3 Col4 Col5 COUNT
Col1
ID1 2 3 5 0 2
ID2 7 6 11 5 1
ID3 9 16 20 12 3
Try this:
df1 = df1.set_index('Col1')
df1.assign(COUNT = df1.gt(df2.set_index('Col1').squeeze(),axis=0).sum(axis=1))

Pandas read csv with repeating header rows

I have a csv file where the data is as follows:
Col1 Col2 Col3
v1 5 9 5
v2 6 10 6
Col1 Col2 Col3
x1 2 4 6
x2 1 2 10
x3 10 2 1
Col1 Col2 Col3
y1 9 2 7
i.e. there are 3 different tables with same headers laid on top of each other. I am trying to pythonically get rid of repeating header rows and get the following result:
Col1 Col2 Col3
v1 5 9 5
v2 6 10 6
x1 2 4 6
x2 1 2 10
x3 10 2 1
y1 9 2 7
I am not sure how to proceed.
You can read the data and remove the rows that are identical to the columns:
df = pd.read_csv('file.csv')
df = df[df.ne(df.columns).any(1)]
Output:
Col1 Col2 Col3
v1 5 9 5
v2 6 10 6
x1 2 4 6
x2 1 2 10
x3 10 2 1
y1 9 2 7
An alternative solution is to detect the repeated header rows first, and then use the skiprows=... argument in read_csv().
This has the downside of reading the data twice, but has the advantage that it allows read_csv() to automatically parse the correct datatypes, and you won't have to cast them afterwards using astype().
This example uses hard-coded column name for the first column, but a more advanced version could determine the header from the first row, and then detect repeats of that.
# read the file once to detect the repeated header rows
header_rows = []
header_start = "Col1"
with open('file.csv') as f:
for i, line in enumerate(f):
if line.startswith(header_start):
header_rows.append(i)
# the first (real) row should always be detected
assert header_rows[0] == 0
# skip all header rows except for the first one (the real one)
df = pd.read_csv('file.csv', skiprows=header_rows[1:])
Output:
Col1 Col2 Col3
v1 5 9 5
v2 6 10 6
x1 2 4 6
x2 1 2 10
x3 10 2 1
y1 9 2 7

Updating pandas dataframe values assigns Nan

I have a dataframe with 3 columns: Col1, Col2 and Col3.
Toy example
d = {'Col1':['hello','k','hello','we','r'],
'Col2':[10,20,30,40,50],
'Col3':[1,2,3,4,5]}
df = pd.DataFrame(d)
Which gets:
Col1 Col2 Col3
0 hello 10 1
1 k 20 2
2 hello 30 3
3 we 40 4
4 r 50 5
I am selecting the values of Col2 such that the value in Col1 is 'hello'
my_values = df.loc[df['Col1']=='hello']['Col2']
this returns me a Series where I can see the values of Col2 as well as the index.
0 10
2 30
Name: Col2, dtype: int64
Now suppose I want to assign this values to a Col3.
I only want to replace those values(index 0 and 2), keeping the other values in Col3 unmodified
I tried:
df['Col3'] = my_values
But this assigns Nan to the other values (the ones where Col1 is not hello)
Col1 Col2 Col3
0 hello 10 10
1 k 20 NaN
2 hello 30 30
3 we 40 NaN
4 r 50 NaN
How can I update certain values in Col3 leaving the others untouched?
Col1 Col2 Col3
0 hello 10 10
1 k 20 2
2 hello 30 30
3 we 40 4
4 r 50 5
So, in short:
Having my_values I want to put them in Col3
Or just base on np.where
df['Col3']=np.where(df['Col1'] == 'hello',df.Col2,df.Col3)
If base on your myvalue
df.loc[my_values.index,'col3']=my_values
Or you can just do update
df['Col3'].update(my_values)

Append Pandas Series to DataFrame as a column [duplicate]

This question already has answers here:
vlookup in Pandas using join
(3 answers)
Closed 6 years ago.
I have panadas dataframe (df) like ['key','col1','col2','col3'] and I have pandas series (sr) for which the index is the same as 'key' in data frame. I want to append the series to the dataframe at the new column called col4 with the same 'key'. I have the following code:
for index, row in segmention.iterrows():
df[df['key']==row['key']]['col4']=sr.loc[row['key']]
The code is very slow. I assume there should be more efficient and better way to do that. could you please help?
You can simply do:
df['col4'] = sr
If don't misunderstand.
Use map as mentioned EdChum:
df['col4'] = df['key'].map(sr)
print (df)
col1 col2 col3 key col4
0 4 7 1 A 2
1 5 8 3 B 4
2 6 9 5 C 1
Or assign with set_index:
df = df.set_index('key')
df['col4'] = sr
print (df)
col1 col2 col3 col4
key
A 4 7 1 2
B 5 8 3 4
C 6 9 5 1
If dont need align data in Series by key use (see difference 2,1,4 vs 4,1,2):
df['col4'] = sr.values
print (df)
col1 col2 col3 key col4
0 4 7 1 A 4
1 5 8 3 B 1
2 6 9 5 C 2
Sample:
df = pd.DataFrame({'key':[1,2,3],
'col1':[4,5,6],
'col2':[7,8,9],
'col3':[1,3,5]}, index=list('ABC'))
print (df)
col1 col2 col3 key
A 4 7 1 1
B 5 8 3 2
C 6 9 5 3
sr = pd.Series([4,1,2], index=list('BCA'))
print (sr)
B 4
C 1
A 2
dtype: int64
df['col4'] = df['key'].map(sr)
print (df)
col1 col2 col3 key col4
0 4 7 1 A 2
1 5 8 3 B 4
2 6 9 5 C 1
df = df.set_index('key')
df['col4'] = sr
print (df)
col1 col2 col3 col4
key
A 4 7 1 2
B 5 8 3 4
C 6 9 5 1
This is really a good use case for join, where the left dataframe aligns a column with the index of the right dataframe/series. You have to make sure your Series has a name for it to work
sr.name = 'some name'
df.join(sr, on='key')