How to write this SQL in Pandas? - sql

I have this SQL code and I want to write in in Pandas. Every example I saw uses groupby and order by outside of the window function and that is not what I want. I don't want my data to look grouped, instead I just need a cumulative sum of my new column (reg_sum) ordered by hour for each article_id.
SELECT
*,
SUM(registrations) OVER(PARTITION BY article_id ORDER BY time) AS
cumulative_regs
FROM table
Data example of what I need to get (reg_sum column):
article_id time registrations reg_sum
A 7 6 6
A 9 5 11
B 10 1 1
C 10 2 2
C 11 4 6
If anyone can say what is the equivalent of this in Pandas, that would be great. Thanks!

Using groupby and cumsum, this should work:
import pandas as pd
import numpy as np
# generate data
df = pd.DataFrame({'article_id': np.array(['A', 'A', 'B', 'C', 'C']),
'time': np.array([7, 9, 10, 10, 11]),
'registrations': np.array([6, 5, 1, 2, 4])})
# compute cumulative sum of registrations sorted by time and grouped by article_id
df['reg_sum'] = df.sort_values('time').groupby('article_id').registrations.cumsum()
Output:
article_id time registrations reg_sum
0 A 7 6 6
1 A 9 5 11
2 B 10 1 1
3 C 10 2 2
4 C 11 4 6

Related

Python: obtaining the first observation according to its date [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

Combine two dataframes in Pandas to generate many to many relationship

I have two lists, say
customers = ['a', 'b', 'c']
accounts = [1, 2, 3, 4, 5, 6, 7, 8, 9]
I want to generate a Pandas dataframe so:
All customers and accounts are used
There is a many to many relationship between customers and accounts (one customer 'may' have multiple accounts and an account 'may' be owned by multiple customers
I want the many to many relationship to be random. That is, some customers will have one account and some will have more than one. Similarly, some accounts will be owned by just one customers and others by more than one.
Something like,
Customer
Account
a
1
a
2
b
2
c
3
a
4
b
4
c
4
b
5
b
6
b
7
b
8
a
9
Since I am generating random data, in the worst case scenario, I can generate way too many accounts and discard the unused ones if the code is easier (essentially relaxing the requirement 1 above).
I am using sample (n=20, replace=True) to generate 20 records in both dataframes and then merging them into one based on the index. Is there an out of the box API or library to do this or is my code the recommended way?
customers = ['a', 'b', 'c']
accounts = [1, 2, 3, 4, 5, 6, 7, 8, 9]
customers_df = pd.DataFrame(data=customers)
customers_df = customers_df.sample (n=20, replace=True)
customers_df['new_index'] = range (20)
customers_df.set_index ('new_index', inplace=True)
accounts_df = pd.DataFrame (data=accounts)
accounts_df = accounts_df.sample (n=20, replace=True)
accounts_df['new_index'] = range (20)
accounts_df.set_index ('new_index', inplace=True)
combined_df = pd.merge (customers_df, accounts_df, on='new_index')
print (combined_df)
Edit: Modified the question and added sample code I have tried.
One way to accomplish this is to collect the set of all possible relationships with a cartesian product, then select from that list before building your dataframe:
import itertools
import random
customers = ['a', 'b', 'c']
accounts = [1, 2, 3, 4, 5, 6, 7, 8, 9]
possible_associations = [ca for ca in itertools.product(customers, accounts)]
df = pd.DataFrame.from_records(random.choices(possible_associations, k=20), columns=['customers', 'accounts']).sort_values(['customers','accounts'])
print(df)
Output
customers accounts
0 a 2
3 a 2
15 a 2
18 a 4
16 a 5
14 a 7
7 a 8
12 a 8
1 a 9
2 b 5
9 b 5
8 b 8
11 b 8
19 c 2
17 c 3
5 c 4
4 c 5
6 c 5
13 c 5
10 c 7
To have a repeatable test result, start with np.random.seed(1) (in the target
version drop it).
Then proceed as follows:
Create a list of probabilities - how many accounts can have a customer, e.g.:
prob = [0.5, 0.25, 0.15, 0.09, 0.01]
Generate a Series stating how many owners shall have each account:
cnt = pd.Series(np.random.choice(range(1, len(prob) + 1), size=len(accounts),
p=prob), name='Customer')
Its name is Customer, because it will be the source to create just
Customer column.
For my sample probablities and generator seeding the result is:
0 1
1 2
2 1
3 1
4 1
5 1
6 1
7 1
8 1
Name: Customer, dtype: int32
(the left column is the index, the right - actual values).
Because your data sample contains only 9 accounts, the result does
not contain "greater" number of owners. But in your target version,
with more accounts, there will be accounts with greater numbers of
owners.
Generate the result - cust_acct DataFrame, defining the assignment of customers
to accounts:
cust_acct = cnt.apply(lambda x: np.random.choice(customers, x, replace=False))\
.explode().to_frame().join(pd.Series(accounts, name='Account')).reset_index(drop=True)
The result, for your sample data and my seeding and probabilities, is:
Customer Account
0 b 1
1 a 2
2 b 2
3 b 3
4 b 4
5 c 5
6 b 6
7 c 7
8 a 8
9 b 9
Of course, you can assume different proabilities in prob.
You can also choose other "top" number of owners (the number of
entries in prob).
In this case no change in code is needed, because the range of values in
the first np.random.choice is set to accomodete to the lenght of prob.
Note: Because your sample data contains only 3 customers,
under different generator seeding there can occur ValueError: Cannot
take a larger sample than population when 'replace=False'.
The reason is that if the number of owners for some account was > 3 then just
this error occurs.
But with your target data, with greater number of customers, this error
will not occur.

Check if list cell contains value

Having a dataframe like this:
month transactions_ids
0 1 [0, 5, 1]
1 2 [7, 4]
2 3 [8, 10, 9, 11]
3 6 [2]
4 9 [3]
For a given transaction_id, I would like to get the month when it took place. Notice that a transaction_id can only be related to one single month.
So for example, given transaction_id = 4, the month would be 2.
I know this can be done in a loop by looking month by month if the transactions_ids related contain the given transaction_id, but I'm wondering if there is any way more efficient than that.
Cheers
The best way in my opinion is to explode your data frame and avoid having python lists in your cells.
df = df.explode('transaction_ids')
which outputs
month transactions_ids
0 1 0
0 1 5
0 1 1
1 2 7
1 2 4
2 3 8
2 3 10
2 3 9
2 3 11
3 6 2
4 9 3
Then, simply
id_to_find = 1 # example
df.loc[df.transactions_ids == id_to_find, 'month']
P.S: be aware of the duplicated indexes that explode outputs. In general, it is better to do explode(...).reset_index(drop=True) for most cases to avoid unwanted behavior.
You can use pandas string methods to find the id in the "list" (it's really just a string as far as pandas is concerned when read in using StringIO):
import pandas as pd
from io import StringIO
data = StringIO("""
month transactions_ids
1 [0,5,1]
2 [7,4]
3 [8,10,9,11]
6 [2]
9 [3]
""")
df = pd.read_csv(data, delim_whitespace=True)
df.loc[df['transactions_ids'].str.contains('4'), 'month']
In case your transactions_ids are real lists, then you can use map to check for membership:
df['transactions_ids'].map(lambda x: 3 in x)

Sequence of numbers per category given first entry (Python, Pandas)

Suppose I have 5 categories {A, B, C, D, E} and Several date entries of PURCHASES with distinct dates (for instance, A may range from 01/01/1900 to 31/01/1901 and B from 02/02/1930 to 03/03/1933.
I want to create a new column 'day of occurrence' where I have sequence of number 1...N from the point I find the first date in which number of purchases >= 5.
I want this to compare how categories are similar from the day they've achieved 5 purchases (dates are irrelevant here, but product lifetime is)
Thanks!
Here is how you can label rows from 1 to N depending on column value.
import pandas as pd
df = pd.DataFrame(data=[3, 6, 9, 3, 6], columns=['data'])
df['day of occurrence'] = 0
values_count = df.loc[df['data'] > 5].shape[0]
df.loc[df['data'] > 5, 'day of occurrence'] = range(1, values_count + 1)
The initial DataFrame:
data
0 3
1 6
2 9
3 3
4 6
Output DataFrame:
data day of occurrence
0 3 0
1 6 1
2 9 2
3 3 0
4 6 3
Your data should be sorted by date, for example, df = df.sort_values(by='your-datetime-column')

How to multiply iteratively down a column?

I am having a tough time with this one - not sure why...maybe it's the late hour.
I have a dataframe in pandas as follows:
1 10
2 11
3 20
4 5
5 10
I would like to calculate for each row the multiplicand for each row above it. For example, at row 3, I would like to calculate 10*11*20, or 2,200.
How do I do this?
Use cumprod.
Example:
df = pd.DataFrame({'A': [10, 11, 20, 5, 10]}, index=range(1, 6))
df['cprod'] = df['A'].cumprod()
Note, since your example is just a single column, a cumulative product can be done succinctly with a Series:
import pandas as pd
s = pd.Series([10, 11, 20, 5, 10])
s
# Output
0 10
1 11
2 20
3 5
4 10
dtype: int64
s.cumprod()
# Output
0 10
1 110
2 2200
3 11000
4 110000
dtype: int64
Kudos to #bananafish for locating the inherent cumprod method.