Create new column using for loop Pandas - pandas

I have the following data set:
d = {'person':[1,1,1,1,1,1],'id':['-8','-5','-4','-3','-3','-2'],'obs':
['A','B','C','D','E','F']}
df_start = pd.DataFrame(data=d)
Need to create an output dataset like:
d_end = {'id':[-8,-8,-5,-8,-5,-4,-5,-4,-3,-3,-5,-4,-3,-3],
'obs':['A','A','B','A','B','C','B','C','D','E','B','C','D','E'],
'id_group':[-8,-5,-5,-4,-4,-4,-3,-3,-3,-3,-2,-2,-2,-2]}
df_end = pd.DataFrame(data=d_end)
I am trying to group the rows using a new column called id_group that is created by comparing id values across rows. A single id will belong to its own id group. An id will belong to another id_group if (id+4) is greater than or equal to an id on another row.
Have not been able to get very far using a for loop trying to do this, very open to suggestions

To do it without looping, perform one large merge and then subset afterwards:
df_start['id'] = df_start['id'].astype(int)
df_end = df_start.assign(dummy=1).merge(df_start[['id']].assign(dummy=1),
on='dummy',
suffixes=['','_gp']).drop(columns='dummy')
df_end = df_end[df_end.id.between(df_end.id_gp - 4, df_end.id_gp)]
Output:
person id obs id_gp
0 1 -8 A -8
1 1 -8 A -5
2 1 -8 A -4
7 1 -5 B -5
8 1 -5 B -4
9 1 -5 B -3
10 1 -5 B -3
11 1 -5 B -2
14 1 -4 C -4
15 1 -4 C -3
16 1 -4 C -3
17 1 -4 C -2
21 1 -3 D -3
22 1 -3 D -3
23 1 -3 D -2
27 1 -3 E -3
28 1 -3 E -3
29 1 -3 E -2
35 1 -2 F -2
With looping, create subsets for all values within 4 of the id. Concatenate them all together.
df_start['id'] = df_start['id'].astype(int)
l = []
for id_gp in df_start.id.unique():
l.append(df_start[df_start.id.between(id_gp-4, id_gp)].assign(id_gp = id_gp))
df_end = pd.concat(l, ignore_index=True)

Related

how to Sort negative and positive data in SAS

I have bellow data in variable NUM
-3 1 0 1 3 2 -2 5 -5 -6 4 6 -4
i want data NUM in bellow sorting order
0 -1 1 -2 2 -3 3 -4 4 -5 5 -6 6
How can we sort negative and positive values together? please help
data have;
input NUM ##;
cards;
-3 1 0 1 3 2 -2 5 -5 -6 4 6 -4
;
run;
Sort by abs(num), num if you want the negative values to appear before the positive within the same absolute value as in the requested data.
data have;
input NUM ##;
cards;
-3 1 0 -1 3 2 -2 5 -5 -6 4 6 -4
;
run;
proc sql;
create table want as
select * from have
order by abs(num), num
;
quit;
Make a new variable with the absolute value and include it in the sort.
data want;
set have;
absolute=abs(num);
run;
proc sort data=want;
by absolute num;
run;

To prepare a dataframe with elements being repeated from a list in python

I have a list as primary = ['A' , 'B' , 'C' , 'D']
and a DataFrame as
df2 = pd.DataFrame(data=dateRange, columns = ['Date'])
which contains 1 date column starting from 01-July-2020 till 31-Dec-2020.
I created another column 'DayNum' which will contain the day number from the date like 01-July-2020 is Wednesday so the 'DayNum' column will have 2 and so on.
Now using the list I want to create another column 'primary' so that the DataFrame looks as follows:
In short, the elements on the list should repeat. You can say that this is a roster to show the name of the person on the roster on a weekly basis where Monday is the start (day 0) and Sunday is the end (day 6).
The output should be like this:
Date DayNum Primary
0 01-Jul-20 2 A
1 02-Jul-20 3 A
2 03-Jul-20 4 A
3 04-Jul-20 5 A
4 05-Jul-20 6 A
5 06-Jul-20 0 B
6 07-Jul-20 1 B
7 08-Jul-20 2 B
8 09-Jul-20 3 B
9 10-Jul-20 4 B
10 11-Jul-20 5 B
11 12-Jul-20 6 B
12 13-Jul-20 0 C
13 14-Jul-20 1 C
14 15-Jul-20 2 C
15 16-Jul-20 3 C
16 17-Jul-20 4 C
17 18-Jul-20 5 C
18 19-Jul-20 6 C
19 20-Jul-20 0 D
20 21-Jul-20 1 D
21 22-Jul-20 2 D
22 23-Jul-20 3 D
23 24-Jul-20 4 D
24 25-Jul-20 5 D
25 26-Jul-20 6 D
26 27-Jul-20 0 A
27 28-Jul-20 1 A
28 29-Jul-20 2 A
29 30-Jul-20 3 A
30 31-Jul-20 4 A
First compare column for 0 by Series.eq with cumulative sum by Series.cumsum for groups for each week, then use modulo by Series.mod with number of values in list and last map by dictioanry created by enumerate and list by Series.map:
primary = ['A','B','C','D']
d = dict(enumerate(primary))
df['Primary'] = df['DayNum'].eq(0).cumsum().mod(len(primary)).map(d)

cumulative product for specific groups of observations in pandas

I have a dataset of the following type
Date ID window var
0 1998-01-28 X -5 8.500e-03
1 1998-01-28 Y -5 1.518e-02
2 1998-01-29 X -4 8.005e-03
3 1998-01-29 Y -4 7.905e-03
4 1998-01-30 X -3 -5.497e-03
... ... ... ...
3339 2016-12-19 Y 3 -4.365e-04
3340 2016-12-20 X 4 3.628e-03
3341 2016-12-20 Y 4 6.608e-03
3342 2016-12-21 X 5 -2.467e-03
3343 2016-12-21 Y 5 -2.651e-03
My aim is to calculate the cumulative product of the variable var according to the variable window. The idea is that for every date, I have identified a window of 5 days around that date /the variable window goes from -5 to 5). Now, I want to calculate the cumulative product in the window that belongs to a specific date. For example, the first date (1998-01-28) has a value of windows of -5, and thus represent the starting point for the calculation of the cumprod. I want to have a new variable called cumprod which is exactly var on the date in which window is -5, then it is the cumprod between the value of varat -5 and -4, and so on until window is equal to 5. This defines the value of cumprod for the first group of dates, where every group is defined by consecutive dates in a way that var starts at -5 and ends at 5. I shall then repeat this for any group of date. I will therefore obtain something like
Date ID window var cumprod
0 1998-01-28 X -5 8.500e-03 8.500e-03
1 1998-01-28 Y -5 1.518e-02 1.518e-02
2 1998-01-29 X -4 8.005e-03 6.80425e-05
3 1998-01-29 Y -4 7.905e-03 0.00011999790000000002
4 1998-01-30 X -3 -5.497e-03
... ... ... ...
3339 2016-12-19 Y 3 -4.365e-04
3340 2016-12-20 X 4 3.628e-03
3341 2016-12-20 Y 4 6.608e-03
3342 2016-12-21 X 5 -2.467e-03
3343 2016-12-21 Y 5 -2.651e-03
where I gave an example in of cumprod for the first 2 dates.
How could I achieve this? I was thinking to find a way to attach an identifier to every group of dates and then run some sort of cumprod() method using .groupby(group_identifier). I can't think of how to do it though. Would it be possible to simplify it by using a rolling function on window? Any other kind of approach is very welcomed.
I suggest the following
import numpy as np
import pandas as pd
np.random.seed(123)
df = pd.DataFrame({"Date": pd.date_range("1998-01-28", freq="d", periods=22),
"window": np.concatenate([np.arange(-5,6,1),np.arange(-5,6,1)]),
"var": np.random.randint(1,10,22)
})
My df is similar to yours:
Date window var
0 1998-01-28 -5 3
1 1998-01-29 -4 3
2 1998-01-30 -3 7
3 1998-01-31 -2 2
4 1998-02-01 -1 4
5 1998-02-02 0 7
6 1998-02-03 1 2
7 1998-02-04 2 1
8 1998-02-05 3 2
9 1998-02-06 4 1
10 1998-02-07 5 1
11 1998-02-08 -5 4
12 1998-02-09 -4 5
Then I create a grouping variable and transform var usingcumprod:
df = df.sort_values("Date") # My df is already sorted by Date given the way
# I created it, but I add this to make sure yours is sorted by date
df["group"] = (df["window"] == -5).cumsum()
df = pd.concat([df, df.groupby("group")["var"].transform("cumprod")], axis=1)
And the result is :
Date window var group var
0 1998-01-28 -5 3 1 3
1 1998-01-29 -4 3 1 9
2 1998-01-30 -3 7 1 63
3 1998-01-31 -2 2 1 126
4 1998-02-01 -1 4 1 504
5 1998-02-02 0 7 1 3528
6 1998-02-03 1 2 1 7056
7 1998-02-04 2 1 1 7056
8 1998-02-05 3 2 1 14112
9 1998-02-06 4 1 1 14112
10 1998-02-07 5 1 1 14112
11 1998-02-08 -5 4 2 4
12 1998-02-09 -4 5 2 20
13 1998-02-10 -3 1 2 20

SAS sum observations not in a group, by group

I have a data set :
data have;
input group $ value;
datalines;
A 4
A 3
A 2
A 1
B 1
C 1
D 2
D 1
E 1
F 1
G 2
G 1
H 1
;
run;
The first variable is a group identifier, the second a value.
For each group, I want a new variable "sum" with the sum of all values in the column, exept for the group the observation is in.
My issue is having to do that on nearly 30 millions of observations, so efficiency matters.
I found that using data step was more efficient than using procs.
The final database should looks like :
data want;
input group $ value $ sum;
datalines;
A 4 11
A 3 11
A 2 11
A 1 11
B 1 20
C 1 20
D 2 18
D 1 18
E 1 20
F 1 20
G 2 18
G 1 20
H 1 20
;
run;
Any idea how to perform this please?
Edit: I don't know if this matter but the example I gave is a simplified version of my issue. In the real case, I have 2 other group variable, thus taking the sum of the whole column and substract the sum in the group is not a viable solution.
The requirement
sum of all values in the column, except for the group the observation is in
indicates two passes of the data must occur:
Compute the all_sum and each group's group_sumA hash can store each group's sum -- computed via a specified suminc: variable and .ref() method invocation. A variable can accumulate allsum.
Compute allsum - group_sum for each row of a group.The group_sum is retrieved from hash and subtracted from allsum.
Example:
data want;
if 0 then set have; * prep pdv;
declare hash sums (suminc:'value');
sums.defineKey('group');
sums.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
sums.ref(); * adds value to internal sum of hash data record;
allsum + value;
end;
do while (not last_have);
set have end=last_have;
sums.sum(sum:sum); * retrieve groups sum. Do you hear the Dragnet theme too?;
sum = allsum - sum; * subtract from allsum;
output;
end;
stop;
run;
What is wrong with a straight forward approach? You need to make two passes no matter what you do.
Like this. I included extra variables so you can see how the values are derived.
proc sql ;
create table want as
select a.*,b.grand,sum(value) as total, b.grand - sum(value) as sum
from have a
, (select sum(value) as grand from have) b
group by a.group
;
quit;
Results:
Obs group value grand total sum
1 A 3 21 10 11
2 A 1 21 10 11
3 A 2 21 10 11
4 A 4 21 10 11
5 B 1 21 1 20
6 C 1 21 1 20
7 D 2 21 3 18
8 D 1 21 3 18
9 E 1 21 1 20
10 F 1 21 1 20
11 G 1 21 3 18
12 G 2 21 3 18
13 H 1 21 1 20
Note it does not matter what you have as your GROUP BY clause.
Do you really need to output all of the original observations? Why not just output the summary table?
proc sql ;
create table want as
select a.group, b.grand - sum(value) as sum
from have a
, (select sum(value) as grand from have) b
group by a.group
;
quit;
Results
Obs group total sum
1 A 10 11
2 B 1 20
3 C 1 20
4 D 3 18
5 E 1 20
6 F 1 20
7 G 3 18
8 H 1 20
I would break this out into two different segments:
1.) You could start by using PROC SQL to get the sums by the group
2.) Then use some IF/THEN statements to reassign the values by group

Pandas : Get a column value where another column is the minimum in a sub-grouping [duplicate]

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:
df1 = df.groupby("item", as_index=False)["diff"].min()
However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?
My data looks like:
item diff otherstuff
0 1 2 1
1 1 1 2
2 1 3 7
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
and should end up like:
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
but what I'm getting is:
item diff
0 1 1
1 2 -6
2 3 0
I've been looking through the documentation and can't find anything. I tried:
df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()
df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]
df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()
But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).
Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:
>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0
[3 rows x 3 columns]
Method #2: sort by diff, and then take the first element in each item group:
>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0
[3 rows x 3 columns]
Note that the resulting indices are different even though the row content is the same.
You can use DataFrame.sort_values with DataFrame.drop_duplicates:
df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
item diff otherstuff
6 2 -6 2
7 3 0 0
1 1 1 2
If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:
print (df)
item diff otherstuff
0 1 2 1
1 1 1 2 <-multiple min
2 1 1 7 <-multiple min
3 2 -1 0
4 2 1 3
5 2 4 9
6 2 -6 2
7 3 0 0
8 3 2 9
print (df.groupby("item")["diff"].transform('min'))
0 1
1 1
2 1
3 -6
4 -6
5 -6
6 -6
7 0
8 0
Name: diff, dtype: int64
df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
item diff otherstuff
1 1 1 2
2 1 1 7
6 2 -6 2
7 3 0 0
The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked
def filter_group(dfg, col):
return dfg[dfg[col] == dfg[col].min()]
df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))
As an aside, .filter() is also relevant to this question but didn't work for me.
I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.
df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)
For a little more explanation:
Sort items by the minimum value you want
Drop the duplicates of the column you want to sort with
Resort the data because the data is still sorted by the minimum values
If you know that all of your "items" have more than one record you can sort, then use duplicated:
df.sort_values(by='diff').duplicated(subset='item', keep='first')