Setting a min. range before fetching data - sql

i have a table which has localities numbered with unique numbers. Each locality has some buildings numbered that have the status as Activated = Y or N. i want to pick localities which have the min Building Activated = 'Y' count of 15.
Sample Data :
Locality ACTIVATED
1 Y
1 Y
1 N
1 N
1 N
1 N
2 Y
2 Y
2 Y
2 Y
2 Y
Eg : i need count of locality that with min. 5 Y in ACTIVATED Column

SELECT l.*
FROM Localities l
WHERE (SELECT COUNT(*) FROM Building b
WHERE b.LocalityNumber = l.LocalityNumber
AND b.Activated = 'Y') >= 15

Related

Trying to mark first encounter in every group pandas

I have a df with 5 columns:
What i'm trying to do is mark value of first customer interaction after talking to human in every specific group.
Hopefully the outcome would be like this:
What I have tried is shifting type column to put previous row in front of type to check if its customer and prev row is human. However, I can't figure out a grouping option to get min index for each group for each occurrence.
This works:
k = pd.DataFrame(df.groupby('group').apply(lambda g: (g['type'].eq('customer') & g['type'].shift(1).eq('human')).pipe(lambda x: [x.idxmax(), x[::-1].idxmax()])).tolist())
df['First'] = ''
df['Last'] = ''
df.loc[k[1], 'First'] = 'F'
df.loc[k[1], 'Last'] = 'L'
Output:
>>> df
group type First Last
0 x bot
1 x customer
2 x bot
3 x customer
4 x human
5 x customer F
6 x human
7 x customer L
8 y bot
9 y customer
10 y bot
11 y customer
12 y human
13 y customer F
14 y human
15 y customer L
16 z bot
17 z customer
18 z bot
19 z customer
20 z human
21 z customer F
22 z human
23 z customer L
24 z customer
25 z customer

Pandas - find value in column based on values from another column and replace date in different column

I have a df that looks like below:
ID Name Supervisor SupervisorID
1 X Y
2 Y C
3 Z Y
4 C Y
5 V X
What I need is to find SupervisorID. I can find his ID by checking it in column Name and that I will see his ID so if supervisor is Y then I see that in column Name there is Y so his ID id 2. DF should looks like below:
ID Name Supervisor SupervisorID
1 X Y 2
2 Y C 4
3 Z Y 2
4 C Y 2
5 V X 1
Do you have any idea how to solve this?
Thanks for help and best regards
Use Series.map with DataFrame.drop_duplicates for unique Names, because in real data duplicates:
df['SupervisorID']=df['Supervisor'].map(df.drop_duplicates('Name').set_index('Name')['ID'])
print (df)
ID Name Supervisor SupervisorID
0 1 X Y 2
1 2 Y C 4
2 3 Z Y 2
3 4 C Y 2
4 5 V X 1

Conditional frequency of elements within lists in pandas data frame

I have a data frame in pandas like this:
STATUS FEATURES
A [x,y,z]
A [t, y]
B [x,p,t]
B [x,p]
I want to count the frequency of the elements in the lists of features conditional on the status.
The desired output would be:
STATUS FEATURES FREQUENCY
A x 1
A y 2
A z 1
A t 1
B x 2
B t 1
B p 2
Let us do explode , the groupby size
s=df.explode(['FEATURES']).groupby(['STATUS','FEATURES']).size().reset_index()
Use DataFrame.explode and SeriesGroupBy.value_counts:
new_df = (df.explode('FEATURES')
.groupby('STATUS')['FEATURES']
.value_counts()
.reset_index(name='FRECUENCY'))
print(new_df)
Output
STATUS FEATURES FRECUENCY
0 A y 2
1 A t 1
2 A x 1
3 A z 1
4 B p 2
5 B x 2
6 B t 1

SAS sum observations not in a group, by multiple groups

This post follow this one: SAS sum observations not in a group, by group
Where my minimal example was a bit too minimal sadly,I wasn't able to use it on my data.
Here is a complete case example, what I have is :
data have;
input group1 group2 group3 $ value;
datalines;
1 A X 2
1 A X 4
1 A Y 1
1 A Y 3
1 B Z 2
1 B Z 1
1 C Y 1
1 C Y 6
1 C Z 7
2 A Z 3
2 A Z 9
2 A Y 2
2 B X 8
2 B X 5
2 B X 5
2 B Z 7
2 C Y 2
2 C X 1
;
run;
For each group, I want a new variable "sum" with the sum of all values in the column for the same sub groups (group1 and group2), exept for the group (group3) the observation is in.
data want;
input group1 group2 group3 $ value $ sum;
datalines;
1 A X 2 8
1 A X 4 6
1 A Y 1 9
1 A Y 3 7
1 B Z 2 1
1 B Z 1 2
1 C Y 1 13
1 C Y 6 8
1 C Z 7 7
2 A Z 3 11
2 A Z 9 5
2 A Y 2 12
2 B X 8 17
2 B X 5 20
2 B X 5 20
2 B Z 7 18
2 C Y 2 1
2 C X 1 2
;
run;
My goal is to use either datasteps or proc sql (doing it on around 30 millions observations and proc means and such in SAS seems slower than those on previous similar computations).
My issue with solutions provided in the linked post is that is uses the total value of the column and I don't know how to change this by using the total in the sub group.
Any idea please?
A SQL solution will join all data to an aggregating select:
proc sql;
create table want as
select have.group1, have.group2, have.group3, have.value
, aggregate.sum - value as sum
from
have
join
(select group1, group2, sum(value) as sum
from have
group by group1, group2
) aggregate
on
aggregate.group1 = have.group1
& aggregate.group2 = have.group2
;
SQL can be slower than hash solution, but SQL code is understood by more people than those that understand SAS DATA Step involving hashes ( which can be faster the SQL. )
data want2;
if 0 then set have; * prep pdv;
declare hash sums (suminc:'value');
sums.defineKey('group1', 'group2');
sums.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
sums.ref(); * adds value to internal sum of hash data record;
end;
do while (not last_have);
set have end=last_have;
sums.sum(sum:sum); * retrieve group sum.;
sum = sum - value; * subtract from group sum;
output;
end;
stop;
run;
SAS documentation touches on SUMINC and has some examples
The question does not address this concept:
For each row compute the tier 2 sum that excludes the tier 3 this row is in
A hash based solution would require tracking each two level and three level sums:
data want2;
if 0 then set have; * prep pdv;
declare hash T2 (suminc:'value'); * hash for two (T)iers;
T2.defineKey('group1', 'group2'); * one hash record per combination of group1, group2;
T2.defineDone();
declare hash T3 (suminc:'value'); * hash for three (T)iers;
T3.defineKey('group1', 'group2', 'group3'); * one hash record per combination of group1, group2, group3;
T3.defineDone();
do while (not hash_loaded);
set have end=hash_loaded;
T2.ref(); * adds value to internal sum of hash data record;
T3.ref();
end;
T2_cardinality = T2.num_items;
T3_cardinality = T3.num_items;
put 'NOTE: |T2| = ' T2_cardinality;
put 'NOTE: |T3| = ' T3_cardinality;
do while (not last_have);
set have end=last_have;
T2.sum(sum:t2_sum);
T3.sum(sum:t3_sum);
sum = t2_sum - t3_sum;
output;
end;
stop;
drop t2_: t3:;
run;

Hive impala query

Input.
Key---- id---- ind1 ----ind2
1 A Y N
1 B N N
1 C Y Y
2 A N N
2 B Y N
Output
Key ind1 ind2
1 Y Y
2 Y N
So basically whenever the ind1..n col is y for same key different id . The output should be y else N.
That why for key 1 both ind is y
And key 2 ....ind is y and n.
You can use max() for this:
select id, max(ind1), max(ind2)
from t
group by id;