Wrong results with group by for distinct count - sql

I have these two queries for calculating a distinct count from a table for a particular date range. In my first query I group by location, aRID ( which is a rule ) and date. In my second query I don't group by a date.
I am expecting the same distinct count in both the results but I get total count as 6147 in first result and 6359 in second result. What is wrong here? The difference is group by..
select
r.loc
,cast(r.date as DATE) as dateCol
,count(distinct r.dC) as dC_count
from table r
where r.date between '01-01-2018' and '06-02-2018'
and r.loc = 1
group by r.loc, r.aRId, cast(r.date as DATE)
select
r.loc
,count(distinct r.DC) as dC_count
from table r
and r.date between '01-01-2018' and '06-02-2018'
and r.loc = 1
group by r.loc, r.aRId
loc dateCol dC_count
1 2018-01-22 1
1 2018-03-09 2
1 2018-01-28 3
1 2018-01-05 1
1 2018-05-28 143
1 2018-02-17 1
1 2018-05-08 187
1 2018-05-31 146
1 2018-01-02 3
1 2018-02-14 1
1 2018-05-11 273
1 2018-01-14 1
1 2018-03-18 2
1 2018-02-03 1
1 2018-05-20 200
1 2018-05-14 230
1 2018-01-11 5
1 2018-01-31 1
1 2018-05-17 209
1 2018-01-20 2
1 2018-03-01 1
1 2018-01-03 3
1 2018-05-06 253
1 2018-05-26 187
1 2018-03-24 1
1 2018-02-09 1
1 2018-03-04 1
1 2018-05-03 269
1 2018-05-23 187
1 2018-05-29 133
1 2018-03-21 1
1 2018-03-27 1
1 2018-05-15 202
1 2018-03-07 1
1 2018-06-01 155
1 2018-02-21 1
1 2018-01-26 2
1 2018-02-15 2
1 2018-05-12 331
1 2018-03-10 1
1 2018-01-09 3
1 2018-02-18 1
1 2018-03-13 2
1 2018-05-09 184
1 2018-01-12 2
1 2018-03-16 1
1 2018-05-18 198
1 2018-02-07 1
1 2018-02-01 1
1 2018-01-15 3
1 2018-02-24 4
1 2018-03-19 1
1 2018-05-21 161
1 2018-02-10 1
1 2018-05-04 250
1 2018-05-30 148
1 2018-05-24 153
1 2018-01-24 1
1 2018-05-10 199
1 2018-03-08 1
1 2018-01-21 1
1 2018-05-27 151
1 2018-01-04 3
1 2018-05-07 236
1 2018-03-25 1
1 2018-03-11 2
1 2018-01-10 1
1 2018-01-30 1
1 2018-03-14 1
1 2018-02-19 1
1 2018-05-16 192
1 2018-01-13 5
1 2018-01-07 1
1 2018-03-17 3
1 2018-01-27 2
1 2018-02-22 1
1 2018-05-13 200
1 2018-02-08 2
1 2018-01-16 2
1 2018-03-03 1
1 2018-05-02 217
1 2018-05-22 163
1 2018-03-20 1
1 2018-02-05 2
1 2018-02-11 1
1 2018-01-19 2
1 2018-02-28 1
1 2018-05-05 332
1 2018-05-25 211
1 2018-03-23 1
1 2018-05-19 219
loc dC_count
1 6359

From "COUNT (Transact-SQL)"
COUNT(DISTINCT expression) evaluates expression for each row in a group, and returns the number of unique, nonnull values.
The distinct is relative to the group, not to the whole table (or selected subset). I think this might be your misconception here.
To better understand what this means, take the following simplified example:
CREATE TABLE group_test
(a varchar(1),
b varchar(1),
c varchar(1));
INSERT INTO group_test
(a,
b,
c)
VALUES ('a',
'r',
'x'),
('a',
's',
'x'),
('b',
'r',
'x'),
('b',
's',
'y');
If we GROUP BY a and select count(DISTINCT c)
SELECT a,
count(DISTINCT c) #
FROM group_test
GROUP BY a;
we get
a | #
----|----
a | 1
b | 2
As there is only c='x' for a=1, there is only a distinct count of 1 for this group but 2 for the other group as it has 'x'and 'y' in c. The sum of counts is 3 here.
Now if we GROUP BY a, b
SELECT a,
b,
count(DISTINCT c) #
FROM group_test
GROUP BY a,
b;
we get
a | b | #
----|----|----
a | r | 1
a | s | 1
b | r | 1
b | s | 1
We get 1 for every count here as each value of c is the only one in the group. And all of a sudden the sum of counts is 4.
And if we get the distinct count of c for the whole table
SELECT count(DISTINCT c) #
FROM group_test;
we get
#
----
2
which sums up to 2.
The sum of the counts is different in each case but right none the less.
The more groups there are, the higher the chance for a value to be unique within that group. So your results seem totally plausible.
db<>fiddle

Related

Remove duplicates from left join

Table A:
MSISDN A B C
990 1 2 3
992 1 2 3
995 1 2 3
993 1 2 3
991 1 2 3
994 1 2 3
Table B:
990 2 2 3
992 2 2 4
993 1 2 3
994 1 2 3
995 1 6 3
990 1 2 3
991 2 2 3
992 2 2 3
995 1 2 3
select msis1.msisdn,msis1.a,msis2.c from msis1 left join msis2 on msis1.msisdn=msis2.msisdn;
MSISDN A C
990 1 3
992 1 4
993 1 3
994 1 3
995 1 3
990 1 3
991 1 3
992 1 3
995 1 3
I want to modify the above query not to get duplicate records
Try SELECT DISTINCT, this will only select rows that are unique.

Groupby filter based on count, calculate duration, penultimate status

I have a dataframe as shown below.
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
21 3 M 2019-05-20 200
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
28 5 M 2018-10-10 200
29 5 F 2019-06-10 500
30 6 F 2019-06-10 600
31 7 M 2017-08-10 800
32 7 F 2018-06-10 100
33 7 P 2018-08-10 120
34 7 M 2019-01-10 200
35 7 F 2019-06-10 600
where
F = Failure
M = Maintenance
P = Planned
Step1 - Select the data of IDs which is having at least two status(F or M or P) before the last Failure
Step2 - Ignore the rows if the last raw per ID is not F, expected output after this as shown below.
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
31 7 M 2017-08-10 800
32 7 F 2018-06-10 100
33 7 P 2018-08-10 120
34 7 M 2019-01-10 200
35 7 F 2019-06-10 600
Now, for each id last status is failure
Then from the above df I would like to prepare below Data frame
ID No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_SLS_to_LS
1 3 2 2 P 487 151
2 3 3 2 M 487 61
3 3 2 2 P 640 90
4 3 1 1 M 518 151
7 2 1 1 M 518 151
SLS = Second Last Status
LS = Last Status
I tried the following code to calculate the duration.
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days
We can create a mask with gropuby + bfill that allows us to perform both selections.
m = df.Status.eq('F').replace(False, np.NaN).groupby(df.ID).bfill()
df = df.loc[m.groupby(df.ID).transform('sum').gt(2) & m]
ID Status Date Cost
0 1 F 2017-06-22 500
1 1 M 2017-07-22 100
2 1 P 2017-10-22 100
3 1 F 2018-06-22 600
4 1 M 2018-08-22 150
5 1 P 2018-10-22 120
6 1 F 2019-03-22 750
7 2 M 2017-06-29 200
8 2 F 2017-09-29 600
9 2 F 2018-01-29 500
10 2 M 2018-03-29 100
11 2 P 2018-08-29 100
12 2 M 2018-10-29 100
13 2 F 2018-12-29 500
14 3 M 2017-03-20 300
15 3 F 2018-06-20 700
16 3 P 2018-08-20 100
17 3 M 2018-10-20 250
18 3 F 2018-11-20 100
19 3 P 2018-12-20 100
20 3 F 2019-03-20 600
22 4 M 2017-08-10 800
23 4 F 2018-06-10 100
24 4 P 2018-08-10 120
25 4 F 2018-10-10 500
26 4 M 2019-01-10 200
27 4 F 2019-06-10 600
31 7 M 2017-08-10 800
32 7 F 2018-06-10 100
33 7 P 2018-08-10 120
34 7 M 2019-01-10 200
35 7 F 2019-06-10 600
The second part is a bit more annoying. There's almost certainly a smarter way to do this, but here's the straight forward way:
s = df.Date.diff().dt.days
res = pd.concat([df.groupby('ID').Status.value_counts().unstack().add_prefix('No_of_'),
df.groupby('ID').Status.apply(lambda x: x.iloc[-2]).to_frame('SLS'),
(s.where(s.gt(0)).groupby(df.ID).apply(lambda x: x.cumsum().iloc[-2])
.to_frame('NoDays_to_SLS')),
s.groupby(df.ID).apply(lambda x: x.iloc[-1]).to_frame('NoDays_SLS_to_LS')],
axis=1)
Output:
No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_SLS_to_LS
ID
1 3 2 2 P 487.0 151.0
2 3 3 1 M 487.0 61.0
3 3 2 2 P 640.0 90.0
4 3 2 1 M 518.0 151.0
7 2 2 1 M 518.0 151.0
Here's my attempt (Note: I am using pandas 0.25) :
df = pd.read_clipboard()
df['Date'] = pd.to_datetime(df['Date'])
df_1 = df.groupby('ID',group_keys=False)\
.apply(lambda x: x[(x['Status']=='F')[::-1].cumsum().astype(bool)])
df_2 = df_1[df_1.groupby('ID')['Status'].transform('count') > 2]
g = df_2.groupby('ID')
df_Counts = g['Status'].value_counts().unstack().add_prefix('No_of_')
df_SLS = g['Status'].agg(lambda x: x.iloc[-2]).rename('SLS')
df_dates = g['Date'].agg(NoDays_to_SLS = lambda x: x.iloc[-2]-x.iloc[0],
NoDays_to_SLS_LS = lambda x: x.iloc[-1]-x.iloc[-2])
pd.concat([df_Counts, df_SLS, df_dates], axis=1).reset_index()
Output:
ID No_of_F No_of_M No_of_P SLS NoDays_to_SLS NoDays_to_SLS_LS
0 1 3 2 2 P 487 days 151 days
1 2 3 3 1 M 487 days 61 days
2 3 3 2 2 P 640 days 90 days
3 4 3 2 1 M 518 days 151 days
4 7 2 2 1 M 518 days 151 days
There are some enhancements in 0.25 that this code uses.

LAG / OVER / PARTITION / ORDER BY using conditions - SQL Server 2017

I have a table that looks like this:
Date AccountID Amount
2018-01-01 123 12
2018-01-06 123 150
2018-02-14 123 11
2018-05-06 123 16
2018-05-16 123 200
2018-06-01 123 18
2018-06-15 123 17
2018-06-18 123 110
2018-06-30 123 23
2018-07-01 123 45
2018-07-12 123 116
2018-07-18 123 60
This table has multiple dates and IDs, along with multiple Amounts. For each individual row, I want grab the last Date where Amount was over a specific value for that specific AccountID. I have been trying to use the LAG( Date, 1 ) in combination with several variatons of CASE and OVER ( PARTITION BY AccountID ORDER BY Date ) statements but I've had no luck. Ultimately, this is what I would like my SELECT statement to return.
Date AccountID Amount LastOverHundred
2018-01-01 123 12 NULL
2018-01-06 123 150 2018-01-06
2018-02-14 123 11 2018-01-06
2018-05-06 123 16 2018-01-06
2018-05-16 123 200 2018-05-16
2018-06-01 123 18 2018-05-16
2018-06-15 123 17 2018-05-16
2018-06-18 123 110 2018-06-18
2018-06-30 123 23 2018-06-18
2018-07-01 123 45 2018-06-18
2018-07-12 123 116 2018-07-12
2018-07-18 123 60 2018-07-12
Any help with this would be greatly appreciated.
Use a cumulative conditional max():
select t.*,
max(case when amount > 100 then date end) over (partition by accountid order by date) as lastoverhundred
from t;

Groupby by different columns

I have a dataframe nf as follows :
StationID DateTime Channel Count
0 1 2017-10-01 00:00:00 1 1
1 1 2017-10-01 00:00:00 1 201
2 1 2017-10-01 00:00:00 1 8
3 1 2017-10-01 00:00:00 1 2
4 1 2017-10-01 00:00:00 1 0
5 1 2017-10-01 00:00:00 1 0
6 1 2017-10-01 00:00:00 1 0
7 1 2017-10-01 00:00:00 1 0
.......... and so on
I want to groupby values by each hour and for each channel and StationID
Output Req
Station ID DateTime Channel Count
1 2017-10-01 00:00:00 1 232
1 2017-10-01 00:01:00 1 23
2 2017-10-01 00:00:00 1 244...
...... and so on
I think you need groupby with aggregate sum, for datetimes with floor by hours add floor - it set minutes and seconds to 0:
print (df)
StationID DateTime Channel Count
0 1 2017-12-01 00:00:00 1 1
1 1 2017-12-01 00:00:00 1 201
2 1 2017-12-01 00:10:00 1 8
3 1 2017-12-01 10:00:00 1 2
4 1 2017-10-01 10:50:00 1 0
5 1 2017-10-01 10:20:00 1 5
6 1 2017-10-01 08:10:00 1 4
7 1 2017-10-01 08:00:00 1 1
df['DateTime'] = pd.to_datetime(df['DateTime'])
df1 = (df.groupby(['StationID', df['DateTime'].dt.floor('H'), 'Channel'])['Count']
.sum()
.reset_index()
)
print (df1)
StationID DateTime Channel Count
0 1 2017-10-01 08:00:00 1 5
1 1 2017-10-01 10:00:00 1 5
2 1 2017-12-01 00:00:00 1 210
3 1 2017-12-01 10:00:00 1 2
print (df['DateTime'].dt.floor('H'))
0 2017-12-01 00:00:00
1 2017-12-01 00:00:00
2 2017-12-01 00:00:00
3 2017-12-01 10:00:00
4 2017-10-01 10:00:00
5 2017-10-01 10:00:00
6 2017-10-01 08:00:00
7 2017-10-01 08:00:00
Name: DateTime, dtype: datetime64[ns]
But if dates are not important, only hours use hour:
df2 = (df.groupby(['StationID', df['DateTime'].dt.hour, 'Channel'])['Count']
.sum()
.reset_index()
)
print (df2)
StationID DateTime Channel Count
0 1 0 1 210
1 1 8 1 5
2 1 10 1 7
Or you can use Grouper:
df.groupby(pd.Grouper(key='DateTime', freq='"H'), 'Channel', 'StationID')['Count'].sum()

Want SQL query for this scenario

tbl_employee
empid empname openingbal
2 jhon 400
3 smith 500
tbl_transection1
tid empid amount creditdebit date
1 2 100 1 2016-01-06 00:00:00.000
2 2 200 1 2016-01-08 00:00:00.000
3 2 100 2 2016-01-11 00:00:00.000
4 2 700 1 2016-01-15 00:00:00.000
5 3 100 1 2016-02-03 00:00:00.000
6 3 200 2 2016-02-06 00:00:00.000
7 3 400 1 2016-02-07 00:00:00.000
tbl_transection2
tid empid amount creditdebit date
1 2 100 1 2016-01-07 00:00:00.000
2 2 200 1 2016-01-08 00:00:00.000
3 2 100 2 2016-01-09 00:00:00.000
4 2 700 1 2016-01-14 00:00:00.000
5 3 100 1 2016-02-04 00:00:00.000
6 3 200 2 2016-02-05 00:00:00.000
7 3 400 1 2016-02-08 00:00:00.000
Here 1 stand for credit and 2 for debit
I want output like
empid empname details debitamount creditamount balance Dr/Cr date
2 jhon opening Bal 400 Cr
2 jhon transection 1 100 500 Cr 2016-01-06 00:00:00.000
2 jhon transection 2 100 600 Cr 2016-01-07 00:00:00.000
2 jhon transection 1 200 800 Cr 2016-01-08 00:00:00.000
2 jhon transection 2 200 1000 Cr 2016-01-08 00:00:00.000
2 jhon transection 2 100 900 Dr 2016-01-09 00:00:00.000
2 jhon transection 1 100 800 Dr 2016-01-11 00:00:00.000
2 jhon transection 2 700 1500 Cr 2016-01-14 00:00:00.000
2 jhon transection 1 700 2200 Cr 2016-01-15 00:00:00.000
3 smith opening Bal 500 Cr
3 smith transection 1 100 600 Cr 2016-02-03 00:00:00.000
3 smith transection 2 100 700 Cr 2016-02-04 00:00:00.000
3 smith transection 2 200 500 Dr 2016-02-05 00:00:00.000
3 smith transection 1 200 300 Dr 2016-02-06 00:00:00.000
3 smith transection 1 400 700 Cr 2016-02-07 00:00:00.000
3 smith transection 2 400 1100 Cr 2016-02-08 00:00:00.000
You can do it with something like this:
select
empid, sum(amount) over (partition by empid order by date) as balance, details
from (
select
empid, case creditdebit when 1 then amount else -amount end as amount, date, details
from (
select empid, openingbal as amount, 1 as creditdebit, '19000101' as date, 'opening Bal' as details
from tbl_employee
union all
select empid, amount, creditdebit, date, 'transection 1'
from tbl_transection1
union all
select empid, amount, creditdebit, date, 'transection 2'
from tbl_transection2
) X
) Y
The innermost select is to gather the data from the 3 tables, the next one is to calculate +/- for the amounts and the outermost is to calculate the balance.
Example in SQL Fiddle