To prepare a dataframe with elements being repeated from a list in python - pandas

I have a list as primary = ['A' , 'B' , 'C' , 'D']
and a DataFrame as
df2 = pd.DataFrame(data=dateRange, columns = ['Date'])
which contains 1 date column starting from 01-July-2020 till 31-Dec-2020.
I created another column 'DayNum' which will contain the day number from the date like 01-July-2020 is Wednesday so the 'DayNum' column will have 2 and so on.
Now using the list I want to create another column 'primary' so that the DataFrame looks as follows:
In short, the elements on the list should repeat. You can say that this is a roster to show the name of the person on the roster on a weekly basis where Monday is the start (day 0) and Sunday is the end (day 6).
The output should be like this:
Date DayNum Primary
0 01-Jul-20 2 A
1 02-Jul-20 3 A
2 03-Jul-20 4 A
3 04-Jul-20 5 A
4 05-Jul-20 6 A
5 06-Jul-20 0 B
6 07-Jul-20 1 B
7 08-Jul-20 2 B
8 09-Jul-20 3 B
9 10-Jul-20 4 B
10 11-Jul-20 5 B
11 12-Jul-20 6 B
12 13-Jul-20 0 C
13 14-Jul-20 1 C
14 15-Jul-20 2 C
15 16-Jul-20 3 C
16 17-Jul-20 4 C
17 18-Jul-20 5 C
18 19-Jul-20 6 C
19 20-Jul-20 0 D
20 21-Jul-20 1 D
21 22-Jul-20 2 D
22 23-Jul-20 3 D
23 24-Jul-20 4 D
24 25-Jul-20 5 D
25 26-Jul-20 6 D
26 27-Jul-20 0 A
27 28-Jul-20 1 A
28 29-Jul-20 2 A
29 30-Jul-20 3 A
30 31-Jul-20 4 A

First compare column for 0 by Series.eq with cumulative sum by Series.cumsum for groups for each week, then use modulo by Series.mod with number of values in list and last map by dictioanry created by enumerate and list by Series.map:
primary = ['A','B','C','D']
d = dict(enumerate(primary))
df['Primary'] = df['DayNum'].eq(0).cumsum().mod(len(primary)).map(d)

Related

the 'combine' of a split-apply-combine in pd.groupby() works brilliantly, but I'm not sure why

I have a fragment of code similar to below. It works perfectly, but I'm not sure why I am so lucky.
The groupby() is a split-apply-combine operation. So I understand why the qf.groupby(qf.g).mean() returns a series with two rows, the mean() for each of a,b.
And what's brilliant is that -combine step of the qf.groupby(qf.g).cumsum() reassembles all the rows into their original order as found in the starting df.
My question is, "Why am I able to count on this behavior?" I'm glad I can, but I cannot articulate why it's possible.
#split-apply-combine
import pandas as pd
#DF with a value, and an arbitrary category
qf= pd.DataFrame(data=[x for x in "aaabbaaaab"], columns=['g'])
qf['val'] = [1,2,3,1,2,3,4,5,6,9]
print(f"applying mean() to members in each group of a,b ")
print ( qf.groupby(qf.g).mean() )
print(f"\n\napplying cumsum() to members in each group of a,b ")
print( qf.groupby(qf.g).cumsum() ) #this combines them in the original index order thankfully
qf['running_totals'] = qf.groupby(qf.g).cumsum()
print (f"\n{qf}")
yields:
applying mean() to members in each group of a,b
val
g
a 3.428571
b 4.000000
applying cumsum() to members in each group of a,b
val
0 1
1 3
2 6
3 1
4 3
5 9
6 13
7 18
8 24
9 12
g val running_totals
0 a 1 1
1 a 2 3
2 a 3 6
3 b 1 1
4 b 2 3
5 a 3 9
6 a 4 13
7 a 5 18
8 a 6 24
9 b 9 12

Print Pandas Unique Rows by Column Condition

I am trying to print the rows whereby a data condition is met in a pandas DF based on the unique values in the DF. For example, I have data that looks like this:
DF:
site temp month day
A 15 7 18
A 11 6 12
A 22 9 3
B 9 4 23
B 3 2 11
B -1 5 18
I need the result to print the rows where the max in the 'temp' column occurs such as this for the final result:
A 15
B 9
I have tried this but it is not working correctly:
for i in DF['site'].unique():
print(DF.temp.max())
I get the same answer of:
22
22
but the answer should be:
site temp month day
A 22 9 3
B 9 4 23
thank you!
A possible solution:
df.groupby('site', as_index=False).max()
Output:
site temp
0 A 22
1 B 9
In case you want to use a for loop:
for i in df['site'].unique():
print(df.loc[df['site'].eq(i), 'temp'].max())
Output:
22
9
df.groupby('site').max()
output:
temp month day
site
A 22 9 18
B 9 5 23
Let us do sort_values + drop_duplicates
df = df.sort_values('temp',ascending=False).drop_duplicates('site')
Out[190]:
site temp month day
2 A 22 9 3
3 B 9 4 23

Pick row with key GROUP_FILENAME and add a new column with column name

I have a data frame which looks like this
GROUP_FIELD_NAME:BKR_ID
GROUP_FIELD_VALUE:T80
GROUP_FIELD_NAME:GROUP_OFFSET
GROUP_FIELD_VALUE:0
GROUP_FIELD_NAME:GROUP_LENGTH
GROUP_FIELD_VALUE:0
GROUP_FIELD_NAME:FIRM_ID
GROUP_FIELD_VALUE:KIZEM
GROUP_FILENAME:000000018.pdf
GROUP_FIELD_NAME:BKR_ID
GROUP_FIELD_VALUE:T80
GROUP_FIELD_VALUE:P
GROUP_FIELD_NAME:FI_ID
GROUP_FIELD_VALUE:
GROUP_FIELD_NAME:RUN_DTE
GROUP_FIELD_VALUE:20220208
GROUP_FIELD_NAME:FIRM_ID
GROUP_FIELD_VALUE:KIZEM
GROUP_FILENAME:000000019.pdf
It has three keys Group field ,group field value and group file name,i want to create a dataframe like this
I am expecting a data frame with three column group_field_name,group_field_value and group_file name.
You can use:
(df['col'].str.extract('GROUP_FILENAME:(.*)|([^:]+):(.*)')
.set_axis(['GROUP_FILENAME', 'var', 'val'], axis=1)
.assign(GROUP_FILENAME=lambda d: d['GROUP_FILENAME'].bfill(),
n=lambda d: d.groupby(['GROUP_FILENAME', 'var']).cumcount()
)
.dropna(subset=['var'])
.pivot(index=['GROUP_FILENAME', 'n'], columns='var', values='val')
.droplevel(1).rename_axis(columns=None)
.reset_index('GROUP_FILENAME')
)
Output:
GROUP_FILENAME GROUP_FIELD_NAME GROUP_FIELD_VALUE
0 000000018.pdf BKR_ID T80
1 000000018.pdf GROUP_OFFSET 0
2 000000018.pdf GROUP_LENGTH 0
3 000000018.pdf FIRM_ID KIZEM
4 000000019.pdf BKR_ID T80
5 000000019.pdf FI_ID P
6 000000019.pdf RUN_DTE
7 000000019.pdf FIRM_ID 20220208
8 000000019.pdf NaN KIZEM
Used input:
col
0 GROUP_FIELD_NAME:BKR_ID
1 GROUP_FIELD_VALUE:T80
2 GROUP_FIELD_NAME:GROUP_OFFSET
3 GROUP_FIELD_VALUE:0
4 GROUP_FIELD_NAME:GROUP_LENGTH
5 GROUP_FIELD_VALUE:0
6 GROUP_FIELD_NAME:FIRM_ID
7 GROUP_FIELD_VALUE:KIZEM
8 GROUP_FILENAME:000000018.pdf
9 GROUP_FIELD_NAME:BKR_ID
10 GROUP_FIELD_VALUE:T80
11 GROUP_FIELD_VALUE:P
12 GROUP_FIELD_NAME:FI_ID
13 GROUP_FIELD_VALUE:
14 GROUP_FIELD_NAME:RUN_DTE
15 GROUP_FIELD_VALUE:20220208
16 GROUP_FIELD_NAME:FIRM_ID
17 GROUP_FIELD_VALUE:KIZEM
18 GROUP_FILENAME:000000019.pdf

How to write sql query to generate a group no for each grouped record

following is scenario:
I have data in following format:
entryid , ac_no, db/cr, amt
-----------------------------------------------
1 10 D 5
1 11 C 5
2 01 D 8
2 11 C 8
3 12 D 10
3 13 C 10
4 14 D 5
4 16 C 5
5 14 D 2
5 17 C 2
6 14 D 3
6 18 C 3
I want data in following format:
So far i have acheived the first 3 columns by query
select wm_concat(entryid),ac_no,db_cr,Sum(amt) from t1 group by ac_no,db_cr
wm_Concat(entryid),ac_no, db/cr, Sum(amt), set_id
------------------------------------------------
1 10 D 5 S1
2 01 D 8 S1
1,2 11 C 13 S1
3 12 D 10 S2
3 13 C 10 S2
4,5,6 14 D 10 S3
4 16 C 5 S3
5 17 C 2 S3
6 18 C 3 S3
I want an additional column `set_id` that either shows this S1, S2.. or any number 1,2.. so that the debit & credit entries sets can be identified.
I am making sets of debit and credit entries based on their Ac_no values.
Any little help will be highly appreciated. Thanks
Create a new column say set and give a unique identifier to the particular set. So for example the first three records will have set id S1, next two will have S2 and so on.
To distinguish a transaction from a set you can use column db/cr along with newly added set column. You can identify that the 3rd row is a set since it's transaction type is 'C' whereas the transactions are of type 'D'.
Here I have assumed that your transactions are debit only, if not please provide more details in the question. Hope this helps.

Using temporary extended table to make a sum

From a given table I want to be able to sum values having the same number (should be easy, right?)
Problem: A given value can be assigned from 2 to n consecutive numbers.
For some reasons this information is stored in a single row describing the value, the starting number and the ending number as below.
TABLE A
id | starting_number | ending_number | value
----+-----------------+---------------+-------
1 2 5 8
2 0 3 5
3 4 6 6
4 7 8 10
For instance the first row means:
value '8' is assigned to numbers: 2, 3 and 4 (5 is excluded)
So, I would like the following intermediairy result table
TABLE B
id | number | value
----+--------+-------
1 2 8
1 3 8
1 4 8
2 0 5
2 1 5
2 2 5
3 4 6
3 5 6
4 7 10
So I can sum 'value' for elements having identical 'number'
SELECT number, sum(value)
FROM B
GROUP BY number
TABLE C
number | sum(value)
--------+------------
2 13
3 8
4 14
0 5
1 5
5 6
7 10
I don't know how to do this and didn't find any answer on the web (maybe not looking with appropriate key words...)
Any idea?
You can do what you want with generate_series(). So, TableB is basically:
select id, generate_series(starting_number, ending_number - 1, 1) as n, value
from tableA;
Your aggregation is then:
select n, sum(value)
from (select id, generate_series(starting_number, ending_number - 1, 1) as n, value
from tableA
) a
group by n;