Append Pandas Series to DataFrame as a column [duplicate] - pandas

This question already has answers here:
vlookup in Pandas using join
(3 answers)
Closed 6 years ago.
I have panadas dataframe (df) like ['key','col1','col2','col3'] and I have pandas series (sr) for which the index is the same as 'key' in data frame. I want to append the series to the dataframe at the new column called col4 with the same 'key'. I have the following code:
for index, row in segmention.iterrows():
df[df['key']==row['key']]['col4']=sr.loc[row['key']]
The code is very slow. I assume there should be more efficient and better way to do that. could you please help?

You can simply do:
df['col4'] = sr
If don't misunderstand.

Use map as mentioned EdChum:
df['col4'] = df['key'].map(sr)
print (df)
col1 col2 col3 key col4
0 4 7 1 A 2
1 5 8 3 B 4
2 6 9 5 C 1
Or assign with set_index:
df = df.set_index('key')
df['col4'] = sr
print (df)
col1 col2 col3 col4
key
A 4 7 1 2
B 5 8 3 4
C 6 9 5 1
If dont need align data in Series by key use (see difference 2,1,4 vs 4,1,2):
df['col4'] = sr.values
print (df)
col1 col2 col3 key col4
0 4 7 1 A 4
1 5 8 3 B 1
2 6 9 5 C 2
Sample:
df = pd.DataFrame({'key':[1,2,3],
'col1':[4,5,6],
'col2':[7,8,9],
'col3':[1,3,5]}, index=list('ABC'))
print (df)
col1 col2 col3 key
A 4 7 1 1
B 5 8 3 2
C 6 9 5 3
sr = pd.Series([4,1,2], index=list('BCA'))
print (sr)
B 4
C 1
A 2
dtype: int64
df['col4'] = df['key'].map(sr)
print (df)
col1 col2 col3 key col4
0 4 7 1 A 2
1 5 8 3 B 4
2 6 9 5 C 1
df = df.set_index('key')
df['col4'] = sr
print (df)
col1 col2 col3 col4
key
A 4 7 1 2
B 5 8 3 4
C 6 9 5 1

This is really a good use case for join, where the left dataframe aligns a column with the index of the right dataframe/series. You have to make sure your Series has a name for it to work
sr.name = 'some name'
df.join(sr, on='key')

Related

How to split a series with array to multiple series?

One column of my dataset contains numpy arrays as elements. I want to split it to multiple columns each one with a value of the array.
Data now are like:
column1 column2 column3
0 1 np.array([1,2,3,4]) 4.5
1 2 np.array([5,6,7,8]) 3
I want to convert it into:
column1 col1 col2 col3 col4 column3
0 1 1 2 3 4 4.5
1 2 5 6 7 8 3
Another possible solution, based on pandas.DataFrame.from_records:
out = pd.DataFrame.from_records(
df['col'], columns=[f'col{i+1}' for i in range(len(df.loc[0, 'col']))])
Output:
col1 col2 col3 col4
0 1 2 3 4
1 5 6 7 8
As an alternative:
df = pd.DataFrame(data={'col':[np.array([1,2,3,4]),np.array([5,6,7,8])]})
new_df = pd.DataFrame(df.col.tolist(), index= df.index) #explode column to new dataframe and get index from old df.
new_df.columns = ["col_{}".format(i) for i in range(1,len(new_df.columns) + 1)]
'''
col_1 col_2 col_3 col_4
0 1 2 3 4
1 5 6 7 8
'''
I hope I've understood your question well. You can leverage the result_type="expand" of the .apply method:
df = df.apply(
lambda x: {f"col{k}": vv for v in x for k, vv in enumerate(v, 1)},
result_type="expand",
axis=1,
)
print(df)
Prints:
col1 col2 col3 col4
0 1 2 3 4
1 5 6 7 8

Pandas read csv with repeating header rows

I have a csv file where the data is as follows:
Col1 Col2 Col3
v1 5 9 5
v2 6 10 6
Col1 Col2 Col3
x1 2 4 6
x2 1 2 10
x3 10 2 1
Col1 Col2 Col3
y1 9 2 7
i.e. there are 3 different tables with same headers laid on top of each other. I am trying to pythonically get rid of repeating header rows and get the following result:
Col1 Col2 Col3
v1 5 9 5
v2 6 10 6
x1 2 4 6
x2 1 2 10
x3 10 2 1
y1 9 2 7
I am not sure how to proceed.
You can read the data and remove the rows that are identical to the columns:
df = pd.read_csv('file.csv')
df = df[df.ne(df.columns).any(1)]
Output:
Col1 Col2 Col3
v1 5 9 5
v2 6 10 6
x1 2 4 6
x2 1 2 10
x3 10 2 1
y1 9 2 7
An alternative solution is to detect the repeated header rows first, and then use the skiprows=... argument in read_csv().
This has the downside of reading the data twice, but has the advantage that it allows read_csv() to automatically parse the correct datatypes, and you won't have to cast them afterwards using astype().
This example uses hard-coded column name for the first column, but a more advanced version could determine the header from the first row, and then detect repeats of that.
# read the file once to detect the repeated header rows
header_rows = []
header_start = "Col1"
with open('file.csv') as f:
for i, line in enumerate(f):
if line.startswith(header_start):
header_rows.append(i)
# the first (real) row should always be detected
assert header_rows[0] == 0
# skip all header rows except for the first one (the real one)
df = pd.read_csv('file.csv', skiprows=header_rows[1:])
Output:
Col1 Col2 Col3
v1 5 9 5
v2 6 10 6
x1 2 4 6
x2 1 2 10
x3 10 2 1
y1 9 2 7

Updating pandas dataframe values assigns Nan

I have a dataframe with 3 columns: Col1, Col2 and Col3.
Toy example
d = {'Col1':['hello','k','hello','we','r'],
'Col2':[10,20,30,40,50],
'Col3':[1,2,3,4,5]}
df = pd.DataFrame(d)
Which gets:
Col1 Col2 Col3
0 hello 10 1
1 k 20 2
2 hello 30 3
3 we 40 4
4 r 50 5
I am selecting the values of Col2 such that the value in Col1 is 'hello'
my_values = df.loc[df['Col1']=='hello']['Col2']
this returns me a Series where I can see the values of Col2 as well as the index.
0 10
2 30
Name: Col2, dtype: int64
Now suppose I want to assign this values to a Col3.
I only want to replace those values(index 0 and 2), keeping the other values in Col3 unmodified
I tried:
df['Col3'] = my_values
But this assigns Nan to the other values (the ones where Col1 is not hello)
Col1 Col2 Col3
0 hello 10 10
1 k 20 NaN
2 hello 30 30
3 we 40 NaN
4 r 50 NaN
How can I update certain values in Col3 leaving the others untouched?
Col1 Col2 Col3
0 hello 10 10
1 k 20 2
2 hello 30 30
3 we 40 4
4 r 50 5
So, in short:
Having my_values I want to put them in Col3
Or just base on np.where
df['Col3']=np.where(df['Col1'] == 'hello',df.Col2,df.Col3)
If base on your myvalue
df.loc[my_values.index,'col3']=my_values
Or you can just do update
df['Col3'].update(my_values)

Pandas count occurrence within column on condition being satisfied

I am trying to do count by grouping. see below input and output.
input:
df = pd.DataFrame()
df['col1'] = ['a','a','a','a','b','b','b']
df['col2'] = [4,4,5,5,6,7,8]
df['col3'] = [1,1,1,1,1,1,1]
output:
col4
0 2
1 2
2 2
3 2
4 1
5 1
6 1
Tried playing around with groupby and count, by doing:
s = df.groupby(['col1','col2'])['col3'].sum()
and the output I got was
a 4 2
5 2
b 6 1
7 1
8 1
how do I add it just as a column on the main df.
Thanks vm!
Use transform len or size:
df['count'] = df.groupby(['col1','col2'])['col3'].transform(len)
print (df)
col1 col2 col3 count
0 a 4 1 2
1 a 4 1 2
2 a 5 1 2
3 a 5 1 2
4 b 6 1 1
5 b 7 1 1
6 b 8 1 1
df['count'] = df.groupby(['col1','col2'])['col3'].transform('size')
print (df)
col1 col2 col3 count
0 a 4 1 2
1 a 4 1 2
2 a 5 1 2
3 a 5 1 2
4 b 6 1 1
5 b 7 1 1
6 b 8 1 1
But column col3 is not necessary, you can use col1 or col2:
df = pd.DataFrame()
df['col1'] = ['a','a','a','a','b','b','b']
df['col2'] = [4,4,5,5,6,7,8]
df['count'] = df.groupby(['col1','col2'])['col1'].transform(len)
df['count1'] = df.groupby(['col1','col2'])['col2'].transform(len)
print (df)
col1 col2 count count1
0 a 4 2 2
1 a 4 2 2
2 a 5 2 2
3 a 5 2 2
4 b 6 1 1
5 b 7 1 1
6 b 8 1 1
try this,
df['count'] = df.groupby(['col1','col2'])['col3'].transform(sum)
print (df)
col1 col2 col3 count
0 a 4 1 2
1 a 4 1 2
2 a 5 1 2
3 a 5 1 2
4 b 6 1 1
5 b 7 1 1
6 b 8 1 1

Select some columns based on WHERE in a dataframe

So, I am working with Blaze and wanted to perform this query on a dataframe:
SELECT col1,col2 FROM table WHERE col1 > 0
For SELECT *, this works: d[d.col1 > 0]. But I want col1 and col2 only rather than all columns. How should I go about it?
Thanks in advance!
Edit: Here I create d as: d = Data('postgresql://uri')
This also works: d[d.col1 > 0][['col1','col2']]
I think you can use first subset and then boolean indexing:
print (d)
col1 col2 col3
0 -1 4 7
1 2 5 8
2 3 6 9
d = d[['col1','col2']]
print (d)
col1 col2
0 -1 4
1 2 5
2 3 6
print (d[d.col1 > 0])
col1 col2
1 2 5
2 3 6
This is same as:
print (d[['col1','col2']][d.col1 > 0])
col1 col2
1 2 5
2 3 6