Let's say i have a dataframe
df = pd.Dataframe({'A': [6,5,9,6,2]})
I also have an array/series
ser = pd.Series([5,6,7])
How can i insert this series into the existing df as a new column, but start at the specific index, while "padding" missing indexes with nan (i think pandas does this automatically).
Ie. psuedo code:
insert ser into df at index 2 as column 'B'
Example output
A B
----------
1| 6 | Nan
2| 5 | 5
3| 9 | 6
4| 6 | 7
5| 2 | Nan
Assuming that the start index value is in startInd variable:
startInd = 2
use the following code:
df['B'] = pd.Series(data=ser.values, index=df.index[df.index >= startInd]
.to_series().iloc[:ser.size])
Details:
df.index[df.index >= startInd] - returns a fragment of df.index,
starting from the "start value" (for now, up to the end).
.to_series() - converts it to a Series (in order to by able to
"slice" it using iloc, in a moment).
.iloc[:ser.size] - takes as many values as needed.
index=... - what we got in the previous step use as the index of the
created Series.
pd.Series(data=ser.values, ... - Create a Series - the source of
data, which will be saved in a new column in df (in a moment).
df['B'] = - Save the above data in a new column (only in rows with
index values matching the above index, other rows will be set to NaN).
There is a subtle but unavoidable difference from your expected result:
As some values are NaN, the type of the new column is coerced to float.
So the result is:
A B
1 6 NaN
2 5 5.0
3 9 6.0
4 6 7.0
5 2 NaN
Related
I have the next DataFrame
df
I count the values this way
I want to have the category values in the next order:
1.0 1
4.0 1
7.0 2
10.0 1
and so on ...
In the ascending way with their respect amount of values
You can sort by index using sort_index()
df['col_1'].value_counts().sort_index()
You can sort on the index after calling value_counts. here's an example
df = pd.DataFrame({'x':[1,2,2,2,1,4,5,5,4,3,5,6,3]})
df['x'].value_counts().sort_index()
Output:
1 2
2 3
3 2
4 2
5 3
6 1
Name: x, dtype: int64
Created Data frame having values as below :
and then sort it via below code ,below is output shown
df1.groupby(by=['Cat']).count().sort_values(by='col1')
I am performing a grouby and apply over a dataframe that is returning some strange results, I am using pandas 1.3.1
Here is the code:
ddf = pd.DataFrame({
"id": [1,1,1,1,2]
})
def do_something(df):
return "x"
ddf["title"] = ddf.groupby("id").apply(do_something)
ddf
I would expect every row in the title column to be assigned the value "x" but when this happens I get this data:
id title
0 1 NaN
1 1 x
2 1 x
3 1 NaN
4 2 NaN
Is this expected?
The result is not strange, it's the right behavior: apply returns a value for the group, here 1 and 2 which becomes the index of the aggregation:
>>> list(ddf.groupby("id"))
[(1, # the group name (the future index of the grouped df)
id # the subset dataframe of the group 2
0 1
1 1
2 1
3 1),
(2, # the group name (the future index of the grouped df)
id # the subset dataframe of the group 2
4 2)]
Why I have a result? Because the label of the group is found in the same of your dataframe index:
>>> ddf.groupby("id").apply(do_something)
id
1 x
2 x
dtype: object
Now change the id like this:
ddf['id'] += 10
# id
# 0 11
# 1 11
# 2 11
# 3 11
# 4 12
ddf["title"] = ddf.groupby("id").apply(do_something)
# id title
# 0 11 NaN
# 1 11 NaN
# 2 11 NaN
# 3 11 NaN
# 4 12 NaN
Or change the index:
ddf.index += 10
# id
# 10 1
# 11 1
# 12 1
# 13 1
# 14 2
ddf["title"] = ddf.groupby("id").apply(do_something)
# id title
# 10 1 NaN
# 11 1 NaN
# 12 1 NaN
# 13 1 NaN
# 14 2 NaN
Yes it is expected.
First of all the apply(do_something) part works like a charme, it is the groupby right before that causes the problem.
A Groupby returns a groupby object, which is a little different to a normal dataframe. If you debug and inspect what the groupby returns, then you can see you need some form of summary function to use it(mean max or sum).If you run one of them as example like this:
df = ddf.groupby("id")
df.mean()
it leads to this result:
Empty DataFrame
Columns: []
Index: [1, 2]
After that do_something is applied to index 1 and 2 only; and then integrated into your original df. This is why you only have index 1 and 2 with x.
For now I would recommend leave out the groupby since it is not clear why you want to use it here anyway.
And have a deeper look into the groupby object
If need new column in aggregate function use GroupBy.transform, is necessary specified column after groupby used for processing, here id:
ddf["title"] = ddf.groupby("id")['id'].transform(do_something)
Or assign new column in function:
def do_something(x):
x['title'] = 'x'
return x
ddf = ddf.groupby("id").apply(do_something)
Explanation why not workin gis in another answers.
I'm working with a pandas dataframe that has multiple groups:
date | group | brand | calculated_value
_______________________________
5 | 1 | x | 1
6 | 1 | x | NaN
7 | 1 | x | NaN
5 | 2 | y | 1
6 | 2 | y | NaN
Within each date, group, and brand, I have initialized the first instance with a calculated_value. I am iterating through these with nested for loops so that I can update and assign the next sequential date occurrence of calculated_value (within date-group-brand).
The groupby()/apply() paradigm doesn't work for me, because in e.g. the third row above, the function being passed to apply() looks above and finds NaN. It is not a sequential update.
After calculating the value, I am attempting to assign it to the cell in question, using the right syntax to avoid the CopySettings problem:
df.loc[ (df.date == 5) & (df.group == 1) & (df.brand == 'x'), "calculated_value" ] = calc_value
However, this fails to set the cell, and it remains NaN. Why is that? I've tried searching many terms, but I was not able to find an answer relevant to my case.
I have confirmed that each of the for loops is incrementing properly, and that I'm addressing the correct row in each iteration.
EDIT: I discovered the problem. When I pass the cells to calculate_function as individual arguments, they each pass as a single-value series, and the function returns a single-value series, which cannot be assigned to the NaN cell. No error was thrown on the mismatched assignment, and the for loop didn't terminate.
I fixed this by passing
calculate_function(arg1.values[0], arg2.values[0], ...)
Extracting the value array and taking its first index seems inelegant and brittle, but the default is a quirky behavior compared what I'm used to in R.
You can use groupby().idxmin() to identify the first date in each group of group, band:
s = df.groupby(['group', 'brand']).date.idxmin()
df.loc[s,'calculated_value'] = 1
Output:
date group brand calculated_value
0 5 1 x 1.0
1 6 1 x NaN
2 7 1 x NaN
3 5 2 y 1.0
4 6 2 y NaN
I will do transform with min
s=df.groupby(['group','brand']).date.transform('min')
df['calculated_value']=df.date.eq(s).astype(int)
Objective: to lookup value from one data frame (conditionally) and place the results in a different dataframe with a new column name
df_1 = pd.DataFrame({'user_id': [1,2,1,4,5],
'name': ['abc','def','ghi','abc','abc'],
'rank': [6,7,8,9,10]})
df_2 = pd.DataFrame ({'user_id': [1,2,3,4,5]})
df_1 # original data
df_2 # new dataframe
In this general example, I am trying to create a new column named "priority_rank" and only fill "priority_rank" based on the conditional lookup against df_1, namely the following:
user_id must match between df_1 and df_2
I am interested in only df_1['name'] == 'abc' all else should be blank
df_2 should end up looking like this:
|user_id|priority_rank|
1 6
2
3
4 9
5 10
One way to do this:
In []:
df_2['priority_rank'] = np.where((df_1.name=='abc') & (df_1.user_id==df_2.user_id), df_1['rank'], '')
df_2
Out[]:
user_id priority_rank
0 1 6
1 2
2 3
3 4 9
4 5 10
Note: In your example df_1.name=='abc' is a sufficient condition because all values for user_id are identical when df_1.name=='abc'. I'm assuming this is not always going to be the case.
Using merge
df_2.merge(df_1.loc[df_1.name=='abc',:],how='left').drop('name',1)
Out[932]:
user_id rank
0 1 6.0
1 2 NaN
2 3 NaN
3 4 9.0
4 5 10.0
You're looking for map:
df_2.assign(priority_rank=df_2['user_id'].map(
df_1.query("name == 'abc'").set_index('user_id')['rank']))
user_id priority_rank
0 1 6.0
1 2 NaN
2 3 NaN
3 4 9.0
4 5 10.0
Consider following dataframe which has columns with same name (Apparently this does happens, currently I have a dataset like this! :( )
>>> df = pd.DataFrame({"a":range(10,15),"b":range(5,10)})
>>> df.rename(columns={"b":"a"},inplace=True)
df
a a
0 10 5
1 11 6
2 12 7
3 13 8
4 14 9
>>> df.columns
Index(['a', 'a'], dtype='object')
I would expect that when dropping by index , only the column with the respective index would be gone, but apparently this is not the case.
>>> df.drop(df.columns[-1],1)
0
1
2
3
4
Is there a way to get rid of columns with duplicated column names?
EDIT: I choose missleading values for the first column, fixed now
EDIT2: the expected outcome is
a
0 10
1 11
2 12
3 13
4 14
Actually just do this:
In [183]:
df.ix[:,~df.columns.duplicated()]
Out[183]:
a
0 0
1 1
2 2
3 3
4 4
So this index all rows and then uses the column mask generated from duplicated and invert the mask using ~
The output from duplicated:
In [184]:
df.columns.duplicated()
Out[184]:
array([False, True], dtype=bool)
UPDATE
As .ix is deprecated (since v0.20.1) you should do any of the following:
df.iloc[:,~df.columns.duplicated()]
or
df.loc[:,~df.columns.duplicated()]
Thanks to #DavideFiocco for alerting me