how to save categorical columns when doing groupby.median()? - pandas

I have credit loan data, but the original df has many loan ids that can be under one customer. thus I need to group by client id in order to build client profile.
the original df:
contract_id', 'product_id','client_id','bal','age', 'gender', 'pledge_amount', 'branche_region
RZ13/25 000345 98023432 2300 32 M 4500 'west'
clients = df.groupby(by=['client_id']).median().reset_index()
This line completely removes important categories like gender, branch region! It groups by client_id and calculates median for NUMERIC columns. all other categorical columns are gone.
I wonder how to group by unique customers but also keep the categoricals..

It is removed, because pandas remove nuisance columns.
For avoid it is necessary aggregate each column, here for numeric are aggregated means and for non numeric is returned first value:
f = lambda x: x.median() if np.issubdtype(x.dtype, np.number) else x.iat[0]
#another idea with join non numeric values
#f = lambda x: x.median() if np.issubdtype(x.dtype, np.number) else ','.join(x)
clients = df.groupby(by=['client_id']).agg(f)
If values of another non numeri columns are same per groups is possible add them to list for by parameter:
clients = df.groupby(by=['client_id', 'gender', 'branche_region']).median()

Related

How do I do a sum per id?

SELECT distinct
A.PROPOLN, C.LIFCLNTNO, A.PROSASORG, sum (A.PROSASORG) as sum
FROM [FPRODUCTPF] A
join [FNBREQCPF] B on (B.IQCPLN=A.PROPOLN)
join [FLIFERATPF] C on (C.LIFPOLN=A.PROPOLN and C.LIFPRDCNT=A.PROPRDCNT and C.LIFBNFCNT=A.PROBNFCNT)
where C.LIFCLNTNO='2012042830507' and A.PROSASORG>0 and A.PROPRDSTS='10' and
A.PRORECSTS='1' and A.PROBNFLVL='M' and B.IQCODE='B10000' and B.IQAPDAT>20180101
group by C.LIFCLNTNO, A.PROPOLN, A.PROSASORG
This does not sum correctly, it returns two lines instead of one:
PROPOLN LIFCLNTNO PROSASORG sum
1 209814572 2012042830507 3881236 147486968
2 209814572 2012042830507 15461074 463832220
You are seeing two rows because A.PROSASORG has two different values for the "C.LIFCLNTNO, A.PROPOLN" grouping.
i.e.
C.LIFCLNTNO, A.PROPOLN, A.PROSASORG together give you two unique rows.
If you want a single row for C.LIFCLNTNO, A.PROPOLN, then you may want to use an aggregate on A.PROSASORG as well.
Your entire query is being filtered on your "C" table by the one LifClntNo,
so you can leave that out of your group by and just have it as a MAX() value
in your select since it will always be the same value.
As for you summing the PROSASORG column via comment from other answer, just sum it. Hour column names are not evidently clear for purpose, so I dont know if its just a number, a quantity, or whatever. You might want to just pull that column out of your query completely if you want based on a single product id.
For performance, I would suggest the following indexes on
Table Index
FPRODUCTPF ( PROPRDSTS, PRORECSTS, PROBNFLVL, PROPOLN )
FNBREQCPF ( IQCODE, IQCPLN, IQAPDAT )
FLIFERATPF ( LIFPOLN, LIFPRDCNT, LIFBNFCNT, LIFCLNTNO )
I have rewritten your query to put the corresponding JOIN components to the same as the table they are based on vs all in the where clause.
SELECT
P.PROPOLN,
max( L.LIFCLNTNO ) LIFCLNTNO,
sum (P.PROSASORG) as sum
FROM
[FPRODUCTPF] P
join [FNBREQCPF] N
on N.IQCODE = 'B10000'
and P.PROPOLN = N.IQCPLN
and N.IQAPDAT > 20180101
join [FLIFERATPF] L
on L.LIFCLNTNO='2012042830507'
and P.PROPOLN = L.LIFPOLN
and P.PROPRDCNT = L.LIFPRDCNT
and P.PROBNFCNT = L.LIFBNFCNT
where
P.PROPRDSTS = '10'
and P.PRORECSTS = '1'
and P.PROBNFLVL = 'M'
and P.PROSASORG > 0
group by
P.PROPOLN
Now, one additional issue you will PROBABLY be running into. You are doing a query with multiple joins, and it appears that there will be multiple records in EACH of your FNBREQCPF and FLIFERATPF tables for the same FPRODUCTPF entry. If you, you will be getting a Cartesian result as the PROSASORG value will be counted for each instance combination in the two other tables.
Ex: FProductPF has ID = X with a Prosasorg value of 3
FNBreQCPF has matching records of Y1 and Y2
FLIFERATPF has matching records of Z1, Z2 and Z3.
So now your total will be equal to 3 times 6 = 18.
If you look at the combinations, Y1:Z1, Y1:Z2, Y1:Z3 AND Y2:Z1, Y2:Z2, Y2:Z3 giving your 6 entries that qualify, times the original value of 3, thus bloating your numbers -- IF such multiple records may exist in each respective table. Now, imagine if your tables have 30 and 40 matching instances respectively, you have just bloated your totals by 1200 times.

How to add rows based on Unique values of other columns in pandas

Let's say I have a data frame that looks like this
For Each Unique Store there are 6 Unique MRP Values
which is 1999., 2499., 2699., 2799., 2999., 4499, So for some the Unique Stores there is not all the price values which means sales_aty is for them is zero. How can i find those Unique Store and if any of the price is not there then add that store with missed MRP value and sales qty as zero
Use a join. First make an index based on all combinations you are expecting. Then try to join your dataframe. A left join will leave the index values without data as null. Apply logic to fill in nulls to suit your needs.
import itertools
import pandas as pd
store_ids = df['STORE'].drop_duplicates()
mrp_values = (1999, 2499, 2699, 2799, 2999)
keys = itertools.product(store_ids, mrp_values)
temp = pd.DataFrame.from_records([{'store_id': store_id, 'mrp': mrp_value}
for store_id, mrp_value in keys])
result = temp.merge(df, how="left", left_on=["store_id", "mrp"],
right_on=["STORE", "MRP"])
# apply null-handling logic to result...

How to use aggregate with condition in pandas?

I have a dataframe.
Following code works
stat = working_data.groupby(by=['url', 'bucket_id'],
as_index=False).agg({'delta': 'max','id': 'count'})
Now i need to count ids with different statuses. I have "DOWNLOADED", "NOT_DOWNLOADED" and "DOWNLOADING" for the status.
I would like to have df with columns bucket_id, max, downloaded (how many have "DOWNLOADED" status) , not_downloaded (how many have "NOT_DOWNLOADED" status) , downloading (how many have "DOWNLOADING" status). How to make it?
Input I have:
.
Output i have:
As you can see count isn't devided by status. But i want to know that there are x downloaded, y not_downloaded, z downloading for each bucket_id bucket_id (so they should be in separate columns, but info for one bucket_id should be in one row)
One way to use assign to create columns then aggregate this new column.
working_data.assign(downloaded=df['status'] == 'DOWNLOADED',
not_downloaded=df['status'] == 'NOT_DOWNLOADED',
downloading=df['status'] == 'DOWNLOADING')\
.groupby(by=['url', 'bucket_id'],
as_index=False).agg({'delta': 'max',
'id': 'count',
'downloaded': 'sum',
'not_donwloaded':'sum',
'downloading':'sum'})

How to group by two columns and get a sum down a third column

I have a dataframe where each row is a prescription event and contains a drug name, a postcode, and the quantity prescribed. I need to find the total quantity of every drug prescribed in every postcode.
I need to group by postcode, then by drug name, and find the sum of the cells in the "items" column for every group.
This is my function that I want to apply to every postcode group:
def count(group):
sums = []
for bnf_name in group['bnf_name']:
sum_ = group[group['bnf_name'] == bnf_name]['items'].sum()
sums.append(sum_)
group['sum'] = sums
merged.groupby('post_code').apply(count).head()
merged.head()
Calling merged.head() returns the original merged dataframe without a new column for sums like I would expect. I think there is something I don't get about the apply() function on a groupby object...
You do not need to define your own function
merged.groupby(['post_code','bnf_name'])['items'].sum()

Transform a column of type string to an array/record i.e. nesting a column

I am trying to get calculate and retrieve some indicators from mutiple tables I have in my dataset on bigquery. I am want to invoke nesting on sfam which is a column of strings which I can't do for now i.e. it could have values or be null. So the goal is to transform that column into an array/record...that's the idea that came to mind and I have no idea how to go about doing it.
The product and cart are grouped by key_web, dat_log, univ, suniv, fam and sfam.
The data is broken down into universe refered to as univ which is composed of sub-universe refered to as suniv. Sub-universes contain families refered to as 'fam' which may or may not have sub-families refered to as sfam. I want to invoke nesting on prd.sfam to reduce the resulting columns.
The data is collected from Google Analytics for insight into website trafic and users activities.
I am trying to get information and indicators about each visitor, the amount of time he/she spent on particular pages, actions taken and so on. The resulting table gives me the sum of time spent on those pages, sum of total number of visits for a single day and a breakdown to which category it belongs, thus the univ, suniv, fam and sfam colummns which are of type string (the sfam could be null since some sub-universes suniv only have families famand don't go down to a sub-family level sfam.
dat_log: refers to the date
nrb_fp: number of views for a product page
tps_fp: total time spent on said page
I tried different methods that I found online but none worked, so I post my code and problem in hope of finding guidance and a solution !
A simpler query would be:
select
prd.key_web
, dat_log
, prd.nrb_fp
, prd.tps_fp
, prd.univ
, prd.suniv
, prd.fam
, prd.sfam
from product as prd
left join cart as cart
on prd.key_web = cart.key_web
and prd.dat_log = cart.dat_log
and prd.univ = cart.univ
and prd.suniv = cart.suniv
and prd.fam = cart.fam
and prd.sfam = cart.sfam
And this is a sample result of the query for the last 6 columns in text and images:
Again, I want to get a column of array as sfam where I have all the string values of sfam even nulls.
I limited the output to only only the last 6 columns, the first 3 are the row, key_web and dat_log. Each fam is composed of several sfam or none (null), I want to be able to do nesting on either the fam or sfam.
I want to get a column of array as sfam where I have all the string values of sfam even nulls.
This is not possible in BigQuery. As the documentation explains:
Currently, BigQuery has two following limitations with respect to NULLs and ARRAYs:
BigQuery raises an error if query result has ARRAYs which contain NULL elements, although such ARRAYs can be used inside the query.
That is, your result set cannot contain an array with NULL elements.
Obviously, in BigQuery you cannot output array which holds NULL, but if for some reason you need to preserve them somehow - the workaround is to create array of structs as opposed to arrays of single elements
For example (BigQuery Standard SQL) if you try to execute below
SELECT ['a', 'b', NULL] arr1, ['x', NULL, NULL] arr2
you will get error: Array cannot have a null element; error in writing field arr1
While if you will try below
SELECT ARRAY_AGG(STRUCT(val1, val2)) arr
FROM UNNEST(['a', 'b', NULL]) val1 WITH OFFSET
JOIN UNNEST(['x', NULL, NULL]) val2 WITH OFFSET
USING(OFFSET)
you get result
Row arr.val1 arr.val2
1 a x
b null
null null
As you can see - approaching this way - you can have have even both elements as NULL