Counting groups in columns in dataframe - pandas

I have a dataframe df:
prds
0 E01
1 E02
2 E03
3 E04
4 E01
5 E02
6 E03
7 E04
8 F01
9 F02
10 F03
11 F04
12 F05
I would like to have an count on each group in the column 'prds' on another column 'match', hence:
prds match
0 E01 1
1 E02 2
2 E03 3
3 E04 4
4 E01 1
5 E02 2
6 E03 3
7 E04 4
8 F01 1
9 F02 2
10 F03 3
11 F04 4
12 F05 5
Any help would be greatly appreciated please. Thanking you in advance.

If each group is possible defined by ending by 1 value is possible use Series.str.endswith with Series.cumsum and pass to GroupBy.cumcount:
df['match'] = df.groupby(df['prds'].str.endswith('1').cumsum()).cumcount() + 1
print (df)
prds match
0 E01 1
1 E02 2
2 E03 3
3 E04 4
4 E01 1
5 E02 2
6 E03 3
7 E04 4
8 F01 1
9 F02 2
10 F03 3
11 F04 4
12 F05 5

You can simply extract digits:
df['match'] = df['prds'].str.extract('(\d+)').astype('int')
Output:
prds match
0 E01 1
1 E02 2
2 E03 3
3 E04 4
4 E01 1
5 E02 2
6 E03 3
7 E04 4
8 F01 1
9 F02 2
10 F03 3
11 F04 4
12 F05 5

Related

How can I add rows iteratively to a select result set in pl sql?

In the work_order table there is wo_no. When I query the work_order table I want 2 additional columns (Task_no, Task_step_no) in the results set as follows
this should be iterate for all the wo_no s in the work_order table. task_no should go up to 5 and task_step_no should go upto 2000. (please have a look on the attached image to see the results set if not clear)
Any idea how to get such a results set in plsql?
One option is to use 2 row generators cross joined to your current table.
SQL> with
2 work_order (wo_no) as
3 (select 1 from dual union all
4 select 2 from dual
5 ),
6 task (task_no) as
7 (select level from dual connect by level <= 5),
8 step (task_step_no) as
9 (select level from dual connect by level <= 20) --> you'd have 2000 here
10 select y.wo_no, t.task_no, s.task_step_no
11 from work_order y cross join task t cross join step s
12 order by 1, 2, 3;
Result:
WO_NO TASK_NO TASK_STEP_NO
---------- ---------- ------------
1 1 1
1 1 2
1 1 3
1 1 4
1 1 5
1 1 6
1 1 7
1 1 8
1 1 9
1 1 10
1 1 11
1 1 12
1 1 13
1 1 14
1 1 15
1 1 16
1 1 17
1 1 18
1 1 19
1 1 20
1 2 1
1 2 2
1 2 3
1 2 4
1 2 5
1 2 6
1 2 7
1 2 8
1 2 9
1 2 10
1 2 11
1 2 12
1 2 13
1 2 14
1 2 15
1 2 16
1 2 17
1 2 18
1 2 19
1 2 20
1 3 1
1 3 2
1 3 3
1 3 4
1 3 5
1 3 6
1 3 7
1 3 8
1 3 9
1 3 10
1 3 11
1 3 12
1 3 13
1 3 14
1 3 15
1 3 16
1 3 17
1 3 18
1 3 19
1 3 20
1 4 1
1 4 2
1 4 3
1 4 4
1 4 5
1 4 6
1 4 7
1 4 8
1 4 9
1 4 10
1 4 11
1 4 12
1 4 13
1 4 14
1 4 15
1 4 16
1 4 17
1 4 18
1 4 19
1 4 20
1 5 1
1 5 2
1 5 3
1 5 4
1 5 5
1 5 6
1 5 7
1 5 8
1 5 9
1 5 10
1 5 11
1 5 12
1 5 13
1 5 14
1 5 15
1 5 16
1 5 17
1 5 18
1 5 19
1 5 20
2 1 1
2 1 2
2 1 3
2 1 4
2 1 5
2 1 6
2 1 7
2 1 8
2 1 9
2 1 10
2 1 11
2 1 12
2 1 13
2 1 14
2 1 15
2 1 16
2 1 17
2 1 18
2 1 19
2 1 20
2 2 1
2 2 2
2 2 3
2 2 4
2 2 5
2 2 6
2 2 7
2 2 8
2 2 9
2 2 10
2 2 11
2 2 12
2 2 13
2 2 14
2 2 15
2 2 16
2 2 17
2 2 18
2 2 19
2 2 20
2 3 1
2 3 2
2 3 3
2 3 4
2 3 5
2 3 6
2 3 7
2 3 8
2 3 9
2 3 10
2 3 11
2 3 12
2 3 13
2 3 14
2 3 15
2 3 16
2 3 17
2 3 18
2 3 19
2 3 20
2 4 1
2 4 2
2 4 3
2 4 4
2 4 5
2 4 6
2 4 7
2 4 8
2 4 9
2 4 10
2 4 11
2 4 12
2 4 13
2 4 14
2 4 15
2 4 16
2 4 17
2 4 18
2 4 19
2 4 20
2 5 1
2 5 2
2 5 3
2 5 4
2 5 5
2 5 6
2 5 7
2 5 8
2 5 9
2 5 10
2 5 11
2 5 12
2 5 13
2 5 14
2 5 15
2 5 16
2 5 17
2 5 18
2 5 19
2 5 20
200 rows selected.
SQL>
As you already have the work_order table, you'd just use it in FROM clause (not as a CTE):
with
task (task_no) as
(select level from dual connect by level <= 5),
step (task_step_no) as
(select level from dual connect by level <= 20)
select y.wo_no, t.task_no, s.task_step_no
from work_order y cross join task t cross join step s
order by 1, 2, 3;

pandas join with tables that has the same columns based on a single key

I have two dataframe:
df1 = K C1 C2 C3 ... Cn. D1. D2. D3
k1 1 2 4. 7 1 2 3
k2 3 5 6. 1 2 3 4
df2 = K C1 C2 C3 ... Cn B1 P1
k1 1 2 4 7 0 0
k1 1 2 4 7 0 1
k1 1 2 4 7 1 0
k1 1 2 4 7 1 1
k2 3 5 6 1 0 0
k2 3 5 6 1 0 1
k2 3 5 6 1 1 0
k2 3 5 6 1 1 1
I want to join in order to get:
df_merged =
K C1 C2 C3 ... Cn B1 P1 D1 D2 D3
k1 1 2 4 7 0 0 1 2 3
k1 1 2 4 7 0 1 1 2 3
k1 1 2 4 7 1 0 1 2 3
k1 1 2 4 7 1 1 1 2 3
k2 3 5 6 1 0 0 2 3 4
k2 3 5 6 1 0 1 2 3 4
k2 3 5 6 1 1 0 2 3 4
k2 3 5 6 1 1 1 2 3 4
I dont want to do left join on the columns [K.... Cn] because it will be very heavy.
What is the best way to do so?
Just a thought, if we can find sub-arrays in the matrix, it could shrink the quantities of combination.
df = pd.DataFrame(np.array(np.meshgrid(['k1 1 2 4 7','k2 3 5 6 1'],[0,1],[0,1],['1 2 3', '2 3 4'])).T.reshape(-1,4), columns=['KC','B','P','D'])
df
###
front = pd.DataFrame(df['KC'].str.split(' ').values.tolist()).add_prefix('C_')
front
###
C_0 C_1 C_2 C_3 C_4
0 k1 1 2 4 7
1 k1 1 2 4 7
2 k2 3 5 6 1
3 k2 3 5 6 1
4 k1 1 2 4 7
5 k1 1 2 4 7
6 k2 3 5 6 1
7 k2 3 5 6 1
8 k1 1 2 4 7
9 k1 1 2 4 7
10 k2 3 5 6 1
11 k2 3 5 6 1
12 k1 1 2 4 7
13 k1 1 2 4 7
14 k2 3 5 6 1
15 k2 3 5 6 1
rear = pd.DataFrame(df['D'].str.split(' ').values.tolist()).add_prefix('D_')
rear
###
D_0 D_1 D_2
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
4 1 2 3
5 1 2 3
6 1 2 3
7 1 2 3
8 2 3 4
9 2 3 4
10 2 3 4
11 2 3 4
12 2 3 4
13 2 3 4
14 2 3 4
15 2 3 4
# output = front.join(df[['B','P']]).join(rear)
output = pd.concat([front,df[['B','P']],rear],axis=1)
output.rename(columns={'C_0':'K'},inplace=True)
output
###
K C_1 C_2 C_3 C_4 B P D_0 D_1 D_2
0 k1 1 2 4 7 0 0 1 2 3
1 k1 1 2 4 7 1 0 1 2 3
2 k2 3 5 6 1 0 0 1 2 3
3 k2 3 5 6 1 1 0 1 2 3
4 k1 1 2 4 7 0 1 1 2 3
5 k1 1 2 4 7 1 1 1 2 3
6 k2 3 5 6 1 0 1 1 2 3
7 k2 3 5 6 1 1 1 1 2 3
8 k1 1 2 4 7 0 0 2 3 4
9 k1 1 2 4 7 1 0 2 3 4
10 k2 3 5 6 1 0 0 2 3 4
11 k2 3 5 6 1 1 0 2 3 4
12 k1 1 2 4 7 0 1 2 3 4
13 k1 1 2 4 7 1 1 2 3 4
14 k2 3 5 6 1 0 1 2 3 4
15 k2 3 5 6 1 1 1 2 3 4

Remove duplicates from left join

Table A:
MSISDN A B C
990 1 2 3
992 1 2 3
995 1 2 3
993 1 2 3
991 1 2 3
994 1 2 3
Table B:
990 2 2 3
992 2 2 4
993 1 2 3
994 1 2 3
995 1 6 3
990 1 2 3
991 2 2 3
992 2 2 3
995 1 2 3
select msis1.msisdn,msis1.a,msis2.c from msis1 left join msis2 on msis1.msisdn=msis2.msisdn;
MSISDN A C
990 1 3
992 1 4
993 1 3
994 1 3
995 1 3
990 1 3
991 1 3
992 1 3
995 1 3
I want to modify the above query not to get duplicate records
Try SELECT DISTINCT, this will only select rows that are unique.

How to add aggregated rows based on other rows in Pandas dataframe

I have a dataframe similar to this:
index a b c d
0 1 1 1 3
1 1 1 2 1
2 1 2 1 4
3 1 2 2 1
4 2 2 1 5
5 2 2 2 9
6 2 2 1 2
7 2 3 2 6
I want to add new rows where c is 0, and d is replaced with the maximum value of d of existing rows where a and b are the same:
index a b c d
8 1 1 0 3
9 1 2 0 4
10 2 2 0 9
11 2 3 0 6
What command can I use? Thanks!
Seems you can using sort_values chain with drop_duplicates, then append
df.append(df.sort_values('d').drop_duplicates(['a','b'],keep='last').assign(c=0))
Out[77]:
a b c d
index
0 1 1 1 3
1 1 1 2 1
2 1 2 1 4
3 1 2 2 1
4 2 2 1 5
5 2 2 2 9
6 2 2 1 2
7 2 3 2 6
0 1 1 0 3
2 1 2 0 4
7 2 3 0 6
5 2 2 0 9
I come up with solution using groupby and pd.concat as follows:
pd.concat([df, df.groupby(['a', 'b'])['d'].max().reset_index().assign(c=0)], ignore_index=True)
Out[1668]:
a b c d
0 1 1 1 3
1 1 1 2 1
2 1 2 1 4
3 1 2 2 1
4 2 2 1 5
5 2 2 2 9
6 2 2 1 2
7 2 3 2 6
8 1 1 0 3
9 1 2 0 4
10 2 2 0 9
11 2 3 0 6

Pandas df row count

Date Ct
0 2015-04-01 1
1 2015-04-01 2
2 2015-04-01 3
3 2015-04-01 4
4 2015-04-02 1
5 2015-04-02 2
6 2015-04-02 3
7 2015-04-02 4
8 2015-04-03 1
9 2015-04-03 2
10 2015-04-03 3
11 2015-04-03 4
12 2015-04-04 1
13 2015-04-04 2
14 2015-04-04 3
15 2015-04-04 4
I have a string column 'Date' and I would like to create the 'Ct' column as represented below to maintain a count of the rows for a certain date. Date needs to be a string in my application, there will not always be an equal number of rows for each date, and 'Ct' will always count in the order of the numerical index. An answer or a nudge in the right direction would be greatly appreciated.
OK, this is a little weird but you can add a temp column and set this value to 1:
df['temp'] = 1
you can then perform a groupby on 'Date' and call transform on the 'temp' column to perform the count:
In [80]:
df['Ct'] = df.groupby('Date')['temp'].transform(pd.Series.cumsum)
df
Out[80]:
Date temp Ct
0 2015-04-01 1 1
1 2015-04-01 1 2
2 2015-04-01 1 3
3 2015-04-01 1 4
4 2015-04-02 1 1
5 2015-04-02 1 2
6 2015-04-02 1 3
7 2015-04-02 1 4
8 2015-04-03 1 1
9 2015-04-03 1 2
10 2015-04-03 1 3
11 2015-04-03 1 4
12 2015-04-04 1 1
13 2015-04-04 1 2
14 2015-04-04 1 3
15 2015-04-04 1 4
In [81]:
df.drop('temp',axis=1,inplace=True)
df
Out[81]:
Date Ct
0 2015-04-01 1
1 2015-04-01 2
2 2015-04-01 3
3 2015-04-01 4
4 2015-04-02 1
5 2015-04-02 2
6 2015-04-02 3
7 2015-04-02 4
8 2015-04-03 1
9 2015-04-03 2
10 2015-04-03 3
11 2015-04-03 4
12 2015-04-04 1
13 2015-04-04 2
14 2015-04-04 3
15 2015-04-04 4
The reason we can't just say perform the cumsum on the 'Date' column is because if it's a string then this will result in your date strings being concatenated with each other which is not what you want.
EDIT
Thanks to the master #Jeff for pointing out that the temp columns is unncecessary and you just use cumcount
In [150]:
df['Ct'] = df.groupby('Date').cumcount() + 1
df
Out[150]:
Date Ct
0 2015-04-01 1
1 2015-04-01 2
2 2015-04-01 3
3 2015-04-01 4
4 2015-04-02 1
5 2015-04-02 2
6 2015-04-02 3
7 2015-04-02 4
8 2015-04-03 1
9 2015-04-03 2
10 2015-04-03 3
11 2015-04-03 4
12 2015-04-04 1
13 2015-04-04 2
14 2015-04-04 3
15 2015-04-04 4