Find the sum of previous count occurrences per unique ID in pandas - pandas

I have a history of customer IDs and purchase IDs where no customer has ever bought the same product. However, for each purchase ID (which is unique), how can I find out the number of times the customer has made a previous purchase
I have tried using groupby() and sort_values()
df = pd.DataFrame({'id_cust': [1,2,1,3,2,4,1],
'id_purchase': ['20A','143C','99B','78R','309D','90J','78J']})
df.sort_values(by='id_cust')
df.groupby('id_cust')['id_purchase'].cumcount()
This is what I expect:
id_cust id_purchase value
1 20A 1
2 143C 1
1 99B 2
3 78R 1
2 3097D 2
4 900J 1
1 78J 3

You can just use the cumcount() on the id_cust column since the id_purchase is unique:
df['value']=df.groupby('id_cust')['id_cust'].cumcount()+1
print(df)
id_cust id_purchase value
0 1 20A 1
1 2 143C 1
2 1 99B 2
3 3 78R 1
4 2 309D 2
5 4 90J 1
6 1 78J 3

Related

assign value to a group pandas based on a day difference between them

I have a dataframe with ID and date ( and calculated day difference between the rows for the same ID)
ID date day_difference
1 27/06/2019 0
1 28/06/2019 1
1 29/06/2019 1
1 01/07/2019 2
1 02/07/2019 1
1 03/07/2019 1
1 05/07/2019 2
2 27/06/2019 0
2 28/06/2019 1
2 29/06/2019 1
2 01/08/2019 33
2 02/08/2019 1
2 03/08/2019 1
2 04/08/2019 1
which i would like to group by ID and calculate total duration with a condition if day difference is bigger than 30 days re-use that ID again and create a new group starting counting duration from that day after a 30day gap.
Desired result
ID Duration
1 8
2 3
2 4
Thanks.
You can do:
(df.groupby(['ID', df.day_difference.gt(30).cumsum()])
.agg(ID=('ID','first'), Duration=('ID','count'))
.reset_index(drop=True)
)
Output:
ID Duration
0 1 7
1 2 3
2 2 4

Count number of rows before date per id

I'm not sure how else to explain it other than the title. I'm basically trying to get the number of rows per id before the date on that specific row. I've tried a bunch of things and scoured the internet to no avail. Please help!
Before
id date
1 3/3/2015
2 3/27/2015
2 4/15/2015
2 5/1/2015
3 3/7/2015
3 5/17/2015
3 7/9/2015
3 7/19/2015
After
id date count
1 3/3/2015 0
2 3/27/2015 0
2 4/15/2015 1
2 5/1/2015 2
3 3/7/2015 0
3 5/17/2015 1
3 7/9/2015 2
3 7/19/2015 3
-1 + row_number() over (partition by id order by date)

Pandas: keep the first three rows containing a value for each unique value [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

DB Query matching ids and sum data on columns

Here is the info i have on my tables, what i need is to create a report based on certain dates and make a sum of every stock movement of the same id
Table one Table Two
Items Stocks
---------- ---------------------------
ID - NAME items_id - altas - bajas - created_at
1 White 4 5 0 8/10/2016
2 Black 2 1 5 8/10/2016
3 Red 3 3 2 8/11/2016
4 Blue 4 1 4 8/11/2016
2 10 2 8/12/2016
So based on a customer choice of dates (on this case lets say it selects all the data available on the table), i need to group them by items_id and then SUM all altas, and all bajas for that items_id, having the following at the end:
items_id altas bajas
1 0 0
2 11 7
3 3 2
4 6 4
Any help solving this?
Hope this will help:
Stock.select("sum(altas) as altas, sum(bajas) as bajas").group("item_id")

Get data for a given number of days by converting rows to column dynamically

This is a follow-up to my previous question: Get records for last 10 dates
I have to generate reports for all books of a store along with sold count (if any) for the last N dates, by passing storeId.
BOOK Book Sold Store
------------ -------------------- ----------------
Id Name SID Id Bid Count Date SID Name
1 ABC 1 1 1 20 11/12/2015 1 MNA
2 DEF 1 2 1 30 12/12/2015 2 KLK
3 DF2 2 3 2 20 11/12/2015 3 KJH
4 DF3 3 4 3 10 13/12/2015
5 GHB 3 5 4 5 14/12/2015
The number of days N is supplied by the user. This is the expected output for the last 4 dates for storeId -1,2 & 3.
BookName 11/12/2015 12/12/2015 13/12/2015 14/12/2015
ABC 20 30 -- --
DEF 20 -- -- --
DF2 -- -- 10 --
DF3 -- -- -- 5
GHB -- -- -- --
If the user passes 5 than data for the last 5 days shall be generated, starting date as 14/12/2015.
I am using Postgres 9.3.
Cross table without crosstab function:
SELECT
SUM(CASE book.Date ='11/11/2015' THEN book.Count ELSE 0 END) AS '11/11/2015',
SUM(CASE book.Date ='15/11/2015' THEN book.Count ELSE 0 END) AS '15/11/2015',
SUM(CASE book.Date ='17/11/2015' THEN book.Count ELSE 0 END) AS '17/11/2015'
FROM
store,
book
WHERE
store.Id = booksold.Bid
AND store.Id IN (1,2)
GROUP BY
book.Name
ORDER BY
book.id ASC;