Combinations in Pandas Python (more than 2 unique) - pandas

I have a dataframe where each row has a particular activity of a user:
UserID Purchased
A Laptop
A Food
A Car
B Laptop
B Food
C Food
D Car
Now I want to find all the unique combinations of purchased products and number of unique users against each combination. My data set has around 8 different products so doing it manually is very time consuming. I want end result to be something like:
Number of products Products Unique count of Users
1 Food 1
2 Car 1
2 Laptop,Food 1
3 Car,Laptop,Food 1

# updated sample data
d = {'UserID': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B', 5: 'C', 6: 'D', 7: 'C'},
'Purchased': {0: 'Laptop',
1: 'Food',
2: 'Car',
3: 'Laptop',
4: 'Food',
5: 'Food',
6: 'Car',
7: 'Laptop'}}
df = pd.DataFrame(d)
# groupby user id and combine the purchases to a tuple
new_df = df.groupby('UserID').agg(tuple)
# list comprehension to sort your grouped purchases
new_df['Purchased'] = [tuple(sorted(x)) for x in new_df['Purchased']]
# groupby purchases and get then count, which is the number of users for each purchases
final_df = new_df.reset_index().groupby('Purchased').agg('count').reset_index()
# get the len of purchased, which is the number of products in the tuple
final_df['num_of_prod'] = final_df['Purchased'].agg(len)
# rename the columns
final_df = final_df.rename(columns={'UserID': 'user_count'})
Purchased user_count num_of_prod
0 (Car,) 1 1
1 (Car, Food, Laptop) 1 3
2 (Food, Laptop) 2 2

Related

Unsure how to use where clause with two columns

I just can’t figure this one out. I've been trying for hours.
I have a table like this…
ID
sample
1
A
1
B
1
C
1
D
2
A
2
B
3
A
4
A
4
B
4
C
5
B
I'm interested in getting all the samples that match 'A', 'B' and 'C' for a given ID. The ID must contain all 3 sample types. There are a lot more sample types in the table but I'm interested in just A, B and C.
Here's my desired output...
ID
sample
1
A
1
B
1
C
4
A
4
B
4
C
If I use this:
WHERE sample in ('A', 'B', 'C')
I get this result:
ID
sample
1
A
1
B
1
C
1
D
2
A
2
B
3
A
4
A
4
B
4
C
5
B
Any ideas on how I can get my desired output?
One ANSI compliant way would be to aggregate using a distinct count
select id, sample
from t
where sample in ('A','B','C')
and id in (
select id from t
where sample in ('A','B','C')
group by id
having Count(distinct sample)=3
);
WHERE sample in (‘A’, ‘B’, ‘C’)
Should eliminate any other samples such as 'D'.
You could also try the following:
WHERE sample = ('A' OR 'B' OR 'C')
Not sure what flavor of SQL is being used, but here's an example to work off of:
Postgre - db-fiddle
SELECT id
FROM t
GROUP BY id
HAVING array_agg(sample) #> array['A', 'B', 'C']::varchar[];
-- HAVING 'A' = ANY (array_agg(sample))
-- AND 'B' = ANY (array_agg(sample))
-- AND 'C' = ANY (array_agg(sample))
Presto
SELECT id
FROM t
GROUP BY id
HAVING contains(array_agg(sample), 'A')
AND contains(array_agg(sample), 'B')
AND contains(array_agg(sample), 'C')

Selecting rows based on the value of columns in other rows

For this problem I would be happy with a solution either in R (ideally with dplyr but other methods would also be OK) or pure SQL.
I have data consisting for individuals (ID) and email addresses, and a binary indicator representing whether the email address is the individual's primary email address (1) or not (0)
all IDs have one and only one primary email address
IDs can have several non-primary email addresses (or none)
IDs can have the same email address as both primary and non-primary
For example:
ID Email Primary
1 1 A 1
2 1 A 0
3 1 B 0
4 2 A 1
5 2 A 0
6 3 C 1
7 4 D 1
8 4 C 0
9 5 E 1
10 5 F 0
(The actual dataset has around half a million rows)
I wish to identify IDs where an email address is non-primary, but is primary for a different ID. That is, I want to select rows where:
Primary is 0
There exists another row where that ID is Primary but for a different ID
Thus in the data above, I want to select row 5 (because the email address is non-primary, but primary in row 1 for a different ID and row 8 (because it is non-primary, but primary in row 6 for a different ID) and row 2
For R users, here is the toy dataframe above:
structure(list(ID = c(1, 1, 1, 2, 2, 3, 4, 4, 5, 5), Email = c("A", "A", "B", "A", "A", "C", "D", "C", "E", "F"), Primary = c(1, 0, 0, 1, 0, 1, 1, 0, 1, 0)), class = "data.frame", row.names = c(NA, -10L))
You can select rows where
Primary = 0
number of ID's for that Email is greater than 1.
There is atleast one primary = 1 for that Email
Using dplyr, you can do this as :
library(dplyr)
df %>%
group_by(Email) %>%
filter(Primary == 0, n_distinct(ID) > 1, any(Primary == 1))
# ID Email Primary
# <dbl> <chr> <dbl>
#1 1 A 0
#2 2 A 0
#3 4 C 0
Since you have big data a data.table solution would be helpful :
library(data.table)
setDT(df)[, .SD[Primary == 0 & uniqueN(ID) > 1 & any(Primary == 1)], Email]
In SQL, you can use exists for this:
select t.*
from mytable t
where t.primary = 0
and exists (
select 1
from mytable t1
where t1.email = t.email
and t1.id <> t.id
and t1.primary = 1
)

Divide Total Values Into Three Parts - Bin Packing

I am trying to divide the second column into three parts, such that they are equally divided, but if the total values is not divisible by three, the case is that that first part is higher or equal than the second part, and the second part is higher or equal than the third part.
Here is the sample of my table:
ID | Values
-------------------------
1 1
2 1
3 2
4 1
5 2
6 1
7 3
My expected output should be (The ID column here is just a row number):
ID | Divided
-------------------------
1 3
2 3
3 1
As you can see, the total values is 11 if you add them all. So if we divide it by 3, the resulting values are 4, 4, and 3. So I will just count it to get the expected output:
(1, 1, 2) = 3,
next is (1, 2, 1) = 3,
and the last is (3) = 1.
Any idea how will I approach this problem? I have divided them by three but counting them to get the specific business logic is difficult:
select #intSum = SUM(Values) / 3 from #table
I am trying to this on stored procedure.

Pandas: How to grab unique values from a group?

Suppose I have a data frame df1 with columns A, B, C, where C is integers. I want to group by df1.A and grab the top 5 rows of each group based on df1.C. However, I want to grab the top 5 values based on df1.C BUT with unique df1.B values. So the 5 rows grabbed should all have different df1.B values.
What I have so far grabs me what I want, except the df1.B values are not unique. How can I rewrite this so that the 5 rows contain unique column B for each group?
df2 = df1.sort('C').groupby('A').tail(5)
Sample data:
A B C
1 'group1' 'apple' 3
2 'group1' 'apple' 2
3 'group2' 'apple' 1
4 'group1' 'orange' 2
5 'group3' 'pineapple' 3
...
The output 5 rows for the df1.A == 'group1' should NOT include both 1 and 2. It should only include one of the two.

reclassify fields in one column of a table depending on criteria

I have the following table:
person drug
------ -----
1 Y
2 Y
2 other
3 X
4 X
5 X
5 other
6 other
7 Z
However, if there is a person where they have a drug x,y,z (it will only be one distinct choice) plus 'other' - then I want to remove the row that contains other
This would mean that someone with an 'X' and 'other' would remove the row conatining 'other', but anyone with only 'other' will stay as 'other'. i.e.
person drug
------ -----
1 Y
2 Y
3 X
4 X
5 X
6 other
7 Z
where person 6 only has other, so stays that way, but persons 2 and 5 have the 'other' rows removed because they have other drug choices (x,y or z).
Thanks very much for any help.
It is unclear whether you want this removed in the results of a query or in the data itself. To return results without this row from a query, which can be written like this:
select t.*
from t
where not (t.drug = 'other' and
exists (select 1 from t t2 where t2.person = t.person and t2.drug = 'x')
)
To handle any of 'x', 'y', or 'z', change the last statement to t2.drug in ('x', 'y', 'z').