Add new columns using a pandas Series [duplicate] - pandas

This question already has answers here:
How to assign values to multiple non existing columns in a pandas dataframe?
(2 answers)
Closed 4 years ago.
I have a pandas DataFrame and a pandas Series. I want to add new constant columns that have the values of the dataframe. In an example:
In [1]: import pandas as pd
df1 = pd.DataFrame({'a': [1,2,3,4,5], 'b': [2,2,3,2,5]})
In [2]: df1
Out[2]:
a b
0 1 2
1 2 2
2 3 3
3 4 2
4 5 5
In [3]: s1 = pd.Series({'c':2, 'd':3})
In [4]: s1
Out[4]:
c 2
d 3
dtype: int64
In [5]: for key, value in s1.to_dict().items():
df1[key] = value
My ugly loop does what I want. But there must be definitely a better solution using maybe some merge or group operation I guess
In [6]: df1
Out[6]:
a b c d
0 1 2 2 3
1 2 2 2 3
2 3 3 2 3
3 4 2 2 3
4 5 5 2 3
Any suggestions?

Use assign with unpacking Series by **:
df1 = df1.assign(**s1)
print (df1)
a b c d
0 1 2 2 3
1 2 2 2 3
2 3 3 2 3
3 4 2 2 3
4 5 5 2 3
Numpy solution for new DataFrame with numpy.broadcast_to and join:
df = pd.DataFrame(np.broadcast_to(s1.values, (len(df1),len(s1))),
index=df1.index,
columns=s1.index)
df1 = df1.join(df)
print (df1)
a b c d
0 1 2 2 3
1 2 2 2 3
2 3 3 2 3
3 4 2 2 3
4 5 5 2 3

Related

How to concatenate a dictionary of pandas DataFrames into a signle DataFrame?

I have three DataFrames containing each a single row
dfA = pd.DataFrame( {'A':[3], 'B':[2], 'C':[1], 'D':[0]} )
dfB = pd.DataFrame( {'A':[9], 'B':[3], 'C':[5], 'D':[1]} )
dfC = pd.DataFrame( {'A':[3], 'B':[4], 'C':[7], 'D':[8]} )
for instance dfA is
A B C D
0 3 2 1 0
I organize them in a dictionary:
data = {'row_1': dfA, 'row_2': dfB, 'row_3': dfC}
I want to concatenate them into a single DataFrame
ans = pd.concat(data)
which returns
A B C D
row_1 0 3 2 1 0
row_2 0 9 3 5 1
row_3 0 3 4 7 8
whereas I want to obtain this
A B C D
row_1 3 2 1 0
row_2 9 3 5 1
row_3 3 4 7 8
That is to say I want to "drop" an index column.
How do I do this?
Use DataFrame.reset_index with second level and parameter drop=True:
df = ans.reset_index(level=1, drop=True)
print (df)
A B C D
row_1 3 2 1 0
row_2 9 3 5 1
row_3 3 4 7 8
You can reset index:
pd.concat(data).reset_index(level=-1,drop=True)
Output:
A B C D
row_1 3 2 1 0
row_2 9 3 5 1
row_3 3 4 7 8

Get group counts of level 1 after doing a group by on two columns

I am doing a group by on two columns and need the count of the number of values in level-1
I tried the following:
>>> import pandas as pd
>>> df = pd.DataFrame({'A': ['one', 'one', 'two', 'three', 'three', 'one'], 'B': [1, 2, 0, 4, 3, 4], 'C': [3,3,3,3,4,8]})
>>> print(df)
A B C
0 one 1 3
1 one 2 3
2 two 0 3
3 three 4 3
4 three 3 4
5 one 4 8
>>> aggregator = {'C': {'sC' : 'sum','cC':'count'}}
>>> df.groupby(["A", "B"]).agg(aggregator)
/envs/pandas/lib/python3.7/site-packages/pandas/core/groupby/generic.py:1315: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
C
sC cC
A B
one 1 3 1
2 3 1
4 8 1
three 3 4 1
4 3 1
two 0 3 1
I want an output something like this where the last column tC gives me the count corresponding to group one, two and three.
C
sC cC tC
A B
one 1 3 1 3
2 3 1
4 8 1
three 3 4 1 2
4 3 1
two 0 3 1 1
If there is only one column for aggregation pass list of tuples:
aggregator = [('sC' , 'sum'),('cC', 'count')]
df = df.groupby(["A", "B"])['C'].agg(aggregator)
For last column convert first level to Series of MultiIndex, get counts by GroupBy.transform and GroupBy.size and for first values only use numpy.where:
s = df.index.get_level_values(0).to_series()
df['tC'] = np.where(s.duplicated(), np.nan, s.groupby(s).transform('size'))
print(df)
sC cC tC
A B
one 1 3 1 3.0
2 3 1 NaN
4 8 1 NaN
three 3 4 1 2.0
4 3 1 NaN
two 0 3 1 1.0
You can also set duplicated values to empty string in tC column, but then later all numeric operation with this column failed, because mixed values - numeric with strings:
df['tC'] = np.where(s.duplicated(), '', s.groupby(s).transform('size'))
print(df)
sC cC tC
A B
one 1 3 1 3
2 3 1
4 8 1
three 3 4 1 2
4 3 1
two 0 3 1 1

Need to loop over pandas series to find indices of variable

I have a dataframe and a list. I would like to iterate over elements in the list and find their location in dataframe then store this to a new dataframe
my_list = ['1','2','3','4','5']
df1 = pd.DataFrame(my_list, columns=['Num'])
dataframe : df1
Num
0 1
1 2
2 3
3 4
4 5
dataframe : df2
0 1 2 3 4
0 9 12 8 6 7
1 11 1 4 10 13
2 5 14 2 0 3
I've tried something similar to this but doesn't work
for x in my_list:
i,j= np.array(np.where(df==x)).tolist()
df2['X'] = df.append(i)
df2['Y'] = df.append(j)
so looking for a result like this
dataframe : df1 updated
Num X Y
0 1 1 1
1 2 2 2
2 3 2 4
3 4 1 2
4 5 2 0
any hints or ideas would be appreciated
Instead of trying to find the value in df2, why not just make df2 a flat dataframe.
df2 = pd.melt(df2)
df2.reset_index(inplace=True)
df2.columns = ['X', 'Y', 'Num']
so now your df2 just looks like this:
Index X Y Num
0 0 0 9
1 1 0 11
2 2 0 5
3 3 1 12
4 4 1 1
5 5 1 14
You can of course sort by Num and if you just want the values from your list you can further filter df2:
df2 = df2[df2.Num.isin(my_list)]

In pandas, how to set_index with using column index instead of referring to column names?

For example:
We have a Pandas dataFrame foo with 2 columns ['A', 'B'].
I want to do function like
foo.set_index([0,1])
instead of
foo.set_index(['A', 'B'])
Have tried foo.set_index([[0,.1]]) as well but came with this error:
Length mismatch: Expected axis has 9 elements, new values have 2 elements
If the column index is unique you could use:
df.set_index(list(df.columns[cols]))
where cols is a list of ordinal indices.
For example,
In [77]: np.random.seed(2016)
In [79]: df = pd.DataFrame(np.random.randint(10, size=(5,4)), columns=list('ABCD'))
In [80]: df
Out[80]:
A B C D
0 3 7 2 3
1 8 4 8 7
2 9 2 6 3
3 4 1 9 1
4 2 2 8 9
In [81]: df.set_index(list(df.columns[[0,2]]))
Out[81]:
B D
A C
3 2 7 3
8 8 4 7
9 6 2 3
4 9 1 1
2 8 2 9
If the DataFrame's column index is not unique, then setting the index by label
is impossible and by ordinals more complicated:
import numpy as np
import pandas as pd
np.random.seed(2016)
def set_ordinal_index(df, cols):
columns, df.columns = df.columns, np.arange(len(df.columns))
mask = df.columns.isin(cols)
df = df.set_index(cols)
df.columns = columns[~mask]
df.index.names = columns[mask]
return df
df = pd.DataFrame(np.random.randint(10, size=(5,4)), columns=list('AAAA'))
print(set_ordinal_index(df, [0,2]))
yields
A A
A A
3 2 7 3
8 8 4 7
9 6 2 3
4 9 1 1
2 8 2 9
This worked for me, the other answer didn't.
# single column
df.set_index(df.columns[1])
# multi column
df.set_index(df.columns[[1, 0]].tolist())

How to prepend pandas data frames

How can I prepend a dataframe to another dataframe? Consider dataframe A:
b c d
2 3 4
6 7 8
and dataFrame B:
a
1
5
I want to prepend A to B to get:
a b c d
1 2 3 4
5 6 7 8
2 methods:
In [1]: df1 = DataFrame(randint(0,10,size=(12)).reshape(4,3),columns=list('bcd'))
In [2]: df1
Out[2]:
b c d
0 5 9 5
1 8 4 0
2 8 4 5
3 4 9 2
In [3]: df2 = DataFrame(randint(0,10,size=(4)).reshape(4,1),columns=list('a'))
In [4]: df2
Out[4]:
a
0 4
1 9
2 2
3 0
Concating (returns a new frame)
In [6]: pd.concat([df2,df1],axis=1)
Out[6]:
a b c d
0 4 5 9 5
1 9 8 4 0
2 2 8 4 5
3 0 4 9 2
Insert, puts a series into an existing frame
In [8]: df1.insert(0,'a',df2['a'])
In [9]: df1
Out[9]:
a b c d
0 4 5 9 5
1 9 8 4 0
2 2 8 4 5
3 0 4 9 2
Achieved by doing
A[B.columns]=B