Pandas: How to grab unique values from a group?

Pandas: How to grab unique values from a group? - pandas

Suppose I have a data frame df1 with columns A, B, C, where C is integers. I want to group by df1.A and grab the top 5 rows of each group based on df1.C. However, I want to grab the top 5 values based on df1.C BUT with unique df1.B values. So the 5 rows grabbed should all have different df1.B values.
What I have so far grabs me what I want, except the df1.B values are not unique. How can I rewrite this so that the 5 rows contain unique column B for each group?
df2 = df1.sort('C').groupby('A').tail(5)
Sample data:
A B C
1 'group1' 'apple' 3
2 'group1' 'apple' 2
3 'group2' 'apple' 1
4 'group1' 'orange' 2
5 'group3' 'pineapple' 3
...
The output 5 rows for the df1.A == 'group1' should NOT include both 1 and 2. It should only include one of the two.

Related

Unsure how to use where clause with two columns

I just can’t figure this one out. I've been trying for hours.
I have a table like this…
ID
sample
1
A
1
B
1
C
1
D
2
A
2
B
3
A
4
A
4
B
4
C
5
B
I'm interested in getting all the samples that match 'A', 'B' and 'C' for a given ID. The ID must contain all 3 sample types. There are a lot more sample types in the table but I'm interested in just A, B and C.
Here's my desired output...
ID
sample
1
A
1
B
1
C
4
A
4
B
4
C
If I use this:
WHERE sample in ('A', 'B', 'C')
I get this result:
ID
sample
1
A
1
B
1
C
1
D
2
A
2
B
3
A
4
A
4
B
4
C
5
B
Any ideas on how I can get my desired output?

One ANSI compliant way would be to aggregate using a distinct count
select id, sample
from t
where sample in ('A','B','C')
and id in (
select id from t
where sample in ('A','B','C')
group by id
having Count(distinct sample)=3
);

WHERE sample in (‘A’, ‘B’, ‘C’)
Should eliminate any other samples such as 'D'.
You could also try the following:
WHERE sample = ('A' OR 'B' OR 'C')

Not sure what flavor of SQL is being used, but here's an example to work off of:
Postgre - db-fiddle
SELECT id
FROM t
GROUP BY id
HAVING array_agg(sample) #> array['A', 'B', 'C']::varchar[];
-- HAVING 'A' = ANY (array_agg(sample))
-- AND 'B' = ANY (array_agg(sample))
-- AND 'C' = ANY (array_agg(sample))
Presto
SELECT id
FROM t
GROUP BY id
HAVING contains(array_agg(sample), 'A')
AND contains(array_agg(sample), 'B')
AND contains(array_agg(sample), 'C')

How to check the value of any row in a group after a previous one fulfils a condition?

I have a dataset grouped by test subjects that is filled according to the actions they perform. I need to find which customer does A and then, at some point, does B; but it doesn't necessarily have to be in the next action/row. And it can't be first does B and then A, it has to be specifically in that order. For example, I have this table:
Subject ActionID ActionOrder
1 A 1
1 C 2
1 D 3
1 B 4
1 C 5
2 D 1
2 A 2
2 C 3
2 B 4
3 B 1
3 D 2
3 A 3
4 A 1
Here subjects 1 and 2 are the ones that fulfil the order of actions condition. While 3 does not because it performs the actions in reverse order. And 4 only does action A
How can I get only subjects 1 and 2 as results? Thank you very much

Use conditional aggregation:
SELECT Subject
FROM tablename
WHERE ActionID IN ('A', 'B')
GROUP BY Subject
HAVING MAX(CASE WHEN ActionID = 'A' THEN ActionOrder END) <
MIN(CASE WHEN ActionID = 'B' THEN ActionOrder END)
See the demo.

Consider below option
select Subject
from (
select Subject,
regexp_replace(string_agg(ActionID, '' order by ActionOrder), r'[^AB]', '') check
from `project.dataset.table`
group by Subject
)
where not starts_with(check, 'B')
and check like '%AB%'
Above assumes that Subject can potentially do same actions multiple times that's why few extra checks in where clause. Other wise it would be just check = 'AB'

SQL - after sorting, return only rows with certain consecutive values in a column

I have columns name, timestamp, doing. I've already sorted by name, then by timestamp, and I expect that moving down the doing column within a group with the same name looks like A, A, A, B, B, A, A, ... - alternating series of A and B. I need to get only the rows which comprise the first B row after a transition from A to B within a group with the same name.
name timestamp doing
1 1 A
1 2 A
1 3 B
1 4 B
1 5 A
2 2 B
2 4 A
2 6 B
2 8 A
I would like to return
name timestamp doing
1 3 B
2 6 B
But not
2 2 B
because it is not a transition from A to B within name = 2

I think you just want lag():
select t.*
from (select t.*,
lag(doing) over (partition by name order by timestamp) as prev_doing
from t
) t
where prev_doing = 'A' and doing = 'B';

reclassify fields in one column of a table depending on criteria

I have the following table:
person drug
------ -----
1 Y
2 Y
2 other
3 X
4 X
5 X
5 other
6 other
7 Z
However, if there is a person where they have a drug x,y,z (it will only be one distinct choice) plus 'other' - then I want to remove the row that contains other
This would mean that someone with an 'X' and 'other' would remove the row conatining 'other', but anyone with only 'other' will stay as 'other'. i.e.
person drug
------ -----
1 Y
2 Y
3 X
4 X
5 X
6 other
7 Z
where person 6 only has other, so stays that way, but persons 2 and 5 have the 'other' rows removed because they have other drug choices (x,y or z).
Thanks very much for any help.

It is unclear whether you want this removed in the results of a query or in the data itself. To return results without this row from a query, which can be written like this:
select t.*
from t
where not (t.drug = 'other' and
exists (select 1 from t t2 where t2.person = t.person and t2.drug = 'x')
)
To handle any of 'x', 'y', or 'z', change the last statement to t2.drug in ('x', 'y', 'z').

Multiple columns from a table into one, large column?

I don't know what in the world is the best way to go about this. I have a very large array of columns, each one with 1-25 rows associated with it. I need to be able to combine all into one large column, skipping blanks if at all possible. Is this something that Access can do?
a b c d e f g h
3 0 1 1 1 1 1 5
3 5 6 8 8 3 5
1 1 2 2 1 5
4 4 2 1 1 5
1 5
there are no blanks within each column, but each column has a different number of numbers in it. they need to be added from left to right so a,b, c, d, e, f. And the 0 from be needs to be in the first blank cell after the second 3 in A. And the first 5 in H needs to be directly after the 1 in g, with no blanks.

So you want a result like:
3
3
0
5
1
4
1
6
1
4
etc?
Here is how I would approach the problem. Insert your array into a work table with an autonumber column (important to retain the order the data is in, databases do not guarnatee an order unless you can give them something to sort on) called id
as well as the array columns.
Create a final table with an autonumber column (see above note on why you need an automnumber) and the column you want as you final table.
Run a separate insert statment for each column in your work table and run them in the order you want the data.
so the inserts would look something like:
insert table2 (colA)
select columnA from table1 order by id
insert table2 (colA)
select columnB from table1 order by id
insert table2 (colA)
select columnC from table1 order by id
Now when you do select columnA from table2 order by id you should have the results you need.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pandas: How to grab unique values from a group? - pandas

Related

Unsure how to use where clause with two columns

How to check the value of any row in a group after a previous one fulfils a condition?

SQL - after sorting, return only rows with certain consecutive values in a column

reclassify fields in one column of a table depending on criteria

Multiple columns from a table into one, large column?

Categories

Resources