SELECT check the colum of the max row - sql

Here my row with my first select:
SELECT
user.id, analytic_youtube_demographic.age,
analytic_youtube_demographic.percent
FROM
`user`
INNER JOIN
analytic ON analytic.user_id = user.id
INNER JOIN
analytic_youtube_demographic ON analytic_youtube_demographic.analytic_id = analytic.id
Result:
---------------------------
| id | Age | Percent |
|--------------------------
| 1 |13-17| 19,6 |
| 1 |18-24| 38.4 |
| 1 |25-34| 22.5 |
| 1 |35-44| 11.5 |
| 1 |45-54| 5.3 |
| 1 |55-64| 1.6 |
| 1 |65+ | 1.2 |
| 2 |13-17| 10 |
| 2 |18-24| 10 |
| 2 |25-34| 25 |
| 2 |35-44| 5 |
| 2 |45-54| 25 |
| 2 |55-64| 5 |
| 1 |65+ | 20 |
---------------------------
The max value by user_id:
---------------------------
| id | Age | Percent |
|--------------------------
| 1 |18-24| 38.4 |
| 2 |45-54| 25 |
| 2 |25-34| 25 |
---------------------------
And I need to filter Age in ['25-34', '65+']
I must have at the end :
-----------
| id |
|----------
| 2 |
-----------
Thanks a lot for your help.
Have tried to use MAX(analytic_youtube_demographic.percent). But I don't know how to filter with the age too.
Thanks a lot for your help.

You can use the rank() function to identify the largest percentage values within each user's data set, and then a simple WHERE clause to get those entries that are both of the highest rank and belong to one of the specific demographics you're interested in. Since you can't use windowed functions like rank() in a WHERE clause, this is a two-step process with a subquery or a CTE. Something like this ought to do it:
-- Sample data from the question:
create table [user] (id bigint);
insert [user] values
(1), (2);
create table analytic (id bigint, [user_id] bigint);
insert analytic values
(1, 1), (2, 2);
create table analytic_youtube_demographic (analytic_id bigint, age varchar(32), [percent] decimal(5, 2));
insert analytic_youtube_demographic values
(1, '13-17', 19.6),
(1, '18-24', 38.4),
(1, '25-34', 22.5),
(1, '35-44', 11.5),
(1, '45-54', 5.3),
(1, '55-64', 1.6),
(1, '65+', 1.2),
(2, '13-17', 10),
(2, '18-24', 10),
(2, '25-34', 25),
(2, '35-44', 5),
(2, '45-54', 25),
(2, '55-64', 5),
(2, '65+', 20);
-- First, within the set of records for each user.id, use the rank() function to
-- identify the demographics with the highest percentage.
with RankedDataCTE as
(
select
[user].id,
youtube.age,
youtube.[percent],
[rank] = rank() over (partition by [user].id order by youtube.[percent] desc)
from
[user]
inner join analytic on analytic.[user_id] = [user].id
inner join analytic_youtube_demographic youtube on youtube.analytic_id = analytic.id
)
-- Now select only those records that are (a) of the highest rank within their
-- user.id and (b) either the '25-34' or the '65+' age group.
select
id,
age,
[percent]
from
RankedDataCTE
where
[rank] = 1 and
age in ('25-34', '65+');

Related

SQL Joins with NOT IN displays incorrect data

I have 3 tables as below, and I need data where Expense.Expense_Code Should not be availalbe in Income.Income_Code.
Table: Base
+----+-----------+----------------+
| ID | Reference | Reference_Name |
+----+-----------+----------------+
| 1 | 10000 | AAAA |
| 2 | 10001 | BBBB |
| 3 | 10002 | CCCC |
+----+-----------+----------------+
Table: Expense
+-----+---------+--------------+----------------+
| EID | BASE_ID | Expense_Code | Expense_Amount |
+-----+---------+--------------+----------------+
| 1 | 1 | I0001 | 25 |
| 2 | 1 | I0002 | 50 |
| 3 | 2 | I0003 | 75 |
+-----+---------+--------------+----------------+
Table: Income
+------+---------+-------------+------------+
| I_ID | BASE_ID | Income_Code | Income_Amt |
+------+---------+-------------+------------+
| 1 | 1 | I0001 | 10 |
| 2 | 1 | I0002 | 20 |
| 3 | 1 | I0003 | 30 |
+------+---------+-------------+------------+
SELECT DISTINCT Base.Reference,Expense.Expense_Code
FROM Base
JOIN Expense ON Base.ID = Expense.BASE_ID
JOIN Income ON Base.ID = Income.BASE_ID
WHERE Expense.Expense_Code IN ('I0001','I0002')
AND Income.Income _CODE NOT IN ('I0001','I0002')
I expect no data be retured.
However I am getting the result as below:
+-----------+--------------+
| REFERENCE | Expense_Code |
+-----------+--------------+
| 10000 | I0001 |
| 10000 | I0002 |
+-----------+--------------+
For Base.Reference (10000), Expense.Expense_Code='I0001','I0002' the same expense_code is availalbe in Income table therefore I should not get any data.
Am I trying to do something wrong with the joins.
Thanks in advance for your help!
You are not joining EXPENSE and INCOME tables in your query at all. There needs to be a condition to join these tables in order to get desired result. You can also use NOT EXISTS clause. Prefer using NOT EXISTS over NOT IN as it performs better in case there are NULLS allowed in the columns that you're joining on.
SELECT * FROM BASE B
JOIN EXPENSE E ON B.ID=E.BASE_ID
WHERE E.EXPENSE_CODE NOT EXISTS (SELECT I.INCOME_CODE FROM INCOME I WHERE I.I_ID=E.EID)
When the first join is performed, you end with two lines possessing the ID 1, because the relationship between the tables is not 1o1, hence every line of the first table will have joined to it a line coming from the second table. Like so:
Output of the first join statement
Then, when the second part of your statement is executed, the DBMS finds two ID's 1 from the first joined table(BASE+EXPENSE) and 3 from the third table(INCOME).
Again since it's non a 1o1 relationship between tables, every row from the first joined table will have a joined line coming from the second table, like so: Output of the second join statement
Finally, when it reads your where clause and outputs what you see. I highlighted the excluded rows from the where clause
Output of where statement
...I need data where Expense.Expense_Code Should not be availalbe in Income.Income_Code
The following query will retrieve this data:
select b.*, e.*
from base b
join expense e on e.base_id = b.id
left join income i on i.base_id = e.base_id
and e.expense_code = i.income_code
where i.i_id is null
For reference the data script (slightly modified) is:
create table base (
id number(6),
reference number(6),
reference_name varchar2(10)
);
insert into base (id, reference, reference_name) values (1, 10000, 'AAAA');
insert into base (id, reference, reference_name) values (2, 10001, 'BBBB');
insert into base (id, reference, reference_name) values (3, 10002, 'CCCC');
create table expense (
eid number(6),
base_id number(6),
expense_code varchar2(10),
expense_amount number(6)
);
insert into expense (eid, base_id, expense_code, expense_amount) values (1, 1, 'I0001', 25);
insert into expense (eid, base_id, expense_code, expense_amount) values (2, 1, 'I0002', 50);
insert into expense (eid, base_id, expense_code, expense_amount) values (3, 1, 'I0003', 75);
insert into expense (eid, base_id, expense_code, expense_amount) values (4, 2, 'I0004', 101);
create table income (
i_id number(6),
base_id number(6),
income_code varchar2(10),
income_amt number(6)
);
insert into income (i_id, base_id, income_code, income_amt) values (1, 1, 'I0001', 10);
insert into income (i_id, base_id, income_code, income_amt) values (2, 1, 'I0002', 20);
insert into income (i_id, base_id, income_code, income_amt) values (3, 1, 'I0003', 30);
Result:
ID REFERENCE REFERENCE_NAME EID BASE_ID EXPENSE_CODE EXPENSE_AMOUNT
-- --------- -------------- --- ------- ------------ --------------
2 10,001 BBBB 4 2 I0004 101

Get records having the same value in 2 columns but a different value in a 3rd column

I am having trouble writing a query that will return all records where 2 columns have the same value but a different value in a 3rd column. I am looking for the records where the Item_Type and Location_ID are the same, but the Sub_Location_ID is different.
The table looks like this:
+---------+-----------+-------------+-----------------+
| Item_ID | Item_Type | Location_ID | Sub_Location_ID |
+---------+-----------+-------------+-----------------+
| 1 | 00001 | 20 | 78 |
| 2 | 00001 | 110 | 124 |
| 3 | 00001 | 110 | 124 |
| 4 | 00002 | 3 | 18 |
| 5 | 00002 | 3 | 25 |
+---------+-----------+-------------+-----------------+
The result I am trying to get would look like this:
+---------+-----------+-------------+-----------------+
| Item_ID | Item_Type | Location_ID | Sub_Location_ID |
+---------+-----------+-------------+-----------------+
| 4 | 00002 | 3 | 18 |
| 5 | 00002 | 3 | 25 |
+---------+-----------+-------------+-----------------+
I have been trying to use the following query:
SELECT *
FROM Table1
WHERE Item_Type IN (
SELECT Item_Type
FROM Table1
GROUP BY Item_Type
HAVING COUNT (DISTINCT Sub_Location_ID) > 1
)
But it returns all records with the same Item_Type and a different Sub_Location_ID, not all records with the same Item_Type AND Location_ID but a different Sub_Location_ID.
This should do the trick...
-- some test data...
IF OBJECT_ID('tempdb..#TestData', 'U') IS NOT NULL
BEGIN DROP TABLE #TestData; END;
CREATE TABLE #TestData (
Item_ID INT NOT NULL PRIMARY KEY,
Item_Type CHAR(5) NOT NULL,
Location_ID INT NOT NULL,
Sub_Location_ID INT NOT NULL
);
INSERT #TestData (Item_ID, Item_Type, Location_ID, Sub_Location_ID) VALUES
(1, '00001', 20, 78),
(2, '00001', 110, 124),
(3, '00001', 110, 124),
(4, '00002', 3, 18),
(5, '00002', 3, 25);
-- adding a covering index will eliminate the sort operation...
CREATE NONCLUSTERED INDEX ix_indexname ON #TestData (Item_Type, Location_ID, Sub_Location_ID, Item_ID);
-- the actual solution...
WITH
cte_count_group AS (
SELECT
td.Item_ID,
td.Item_Type,
td.Location_ID,
td.Sub_Location_ID,
cnt_grp_2 = COUNT(1) OVER (PARTITION BY td.Item_Type, td.Location_ID),
cnt_grp_3 = COUNT(1) OVER (PARTITION BY td.Item_Type, td.Location_ID, td.Sub_Location_ID)
FROM
#TestData td
)
SELECT
cg.Item_ID,
cg.Item_Type,
cg.Location_ID,
cg.Sub_Location_ID
FROM
cte_count_group cg
WHERE
cg.cnt_grp_2 > 1
AND cg.cnt_grp_3 < cg.cnt_grp_2;
You can use exists :
select t.*
from table t
where exists (select 1
from table t1
where t.Item_Type = t1.Item_Type and
t.Location_ID = t1.Location_ID and
t.Sub_Location_ID <> t1.Sub_Location_ID
);
Sql server has no vector IN so you can emulate it with a little trick. Assuming '#' is illegal char for Item_Type
SELECT *
FROM Table1
WHERE Item_Type+'#'+Cast(Location_ID as varchar(20)) IN (
SELECT Item_Type+'#'+Cast(Location_ID as varchar(20))
FROM Table1
GROUP BY Item_Type, Location_ID
HAVING COUNT (DISTINCT Sub_Location_ID) > 1
);
The downsize is the expression in WHERE is non-sargable
I think you can use exists:
select t1.*
from table1 t1
where exists (select 1
from table1 tt1
where tt1.Item_Type = t1.Item_Type and
tt1.Location_ID = t1.Location_ID and
tt1.Sub_Location_ID <> t1.Sub_Location_ID
);

Selecting data from 3 tables, grouping by latest date value and another value

Okay, so I've been racking my brains on this for a while, and I think it's time to ask the collective!
I'm using SQLServer and I've got 3 tables, defined as such:
VolumeData
__________________________
| dataid | currentReading|
--------------------------
| 1 | 22 |
| 7 | 33 |
| 9 | 25 |
| 12 | 12 |
--------------------------
LatestData
________________________________________________________________
| dataid | unitNumber | unitLocation | dateTimeStamp |
----------------------------------------------------------------
| 1 | 2344454 | 2 | 2017-07-10 13:16:29.000 |
| 7 | 2344451 | 44 | 2017-07-10 13:22:29.000 |
| 9 | 2344456 | 92 | 2017-07-10 12:16:29.000 |
| 12 | 2344456 | 12 | 2017-07-10 12:13:23.000 |
----------------------------------------------------------------
unitData
____________________________________________________________________________________
| unitNumber | unitLocation | buildingNumber | officeNumber | officeName | country |
------------------------------------------------------------------------------------
| 2344454 | 2 | 44 | 1 | Telford | UK |
| 2344451 | 44 | 22 | 1 | Telford | UK |
| 2344456 | 92 | 12 | 2 | Hamburg | GER |
| 2344456 | 12 | 33 | 2 | Hamburg | GER |
------------------------------------------------------------------------------------
I need to retrieve just the latest currentReading (based on the dateTimeStamp field in LatestData) along with the following fields, grouped on the unitNumber:
currentReading, unitNumber, officeName, country, buildingNumber
One more thing to note is that records can arrive in any order.
The following is one example that I tried, I've tried many more but I've not kept them open unfortunately:
SELECT
a.currentReading
,MAX(b.dateTimeStamp)
,c.unitNumber
,c.country
,c.officeName
FROM [VolumeData] a INNER JOIN LatestData b ON a.dataid = b.dataid INNER JOIN
unitData c ON c.[unitNumber] = b.[unitNumber] AND c.[unitLocation] = b.[unitLocation];
This results in: Column 'VolumeData.currentReading' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Any advice would be much appreciated! Everything I try either results in retrieving far too many rows or results in logical SQL errors. I should also add that these tables contain millions of rows, and grow daily, so I'm looking for a really efficient way to do this.
Thanks!
You can use ROW_NUMBER() to order the date. Then you just take the first one, which correspond to the latest date.
SELECT *
FROM (
SELECT a.currentReading
, b.dateTimeStamp
, c.unitNumber
, c.country
, c.officeName
, ROW_NUMBER() OVER (PARTITION BY c.unitNumber ORDER BY b.dateTimeStamp DESC) AS rowNum
FROM [VolumeData] a
INNER JOIN LatestData b ON a.dataid = b.dataid
INNER JOIN unitData c ON c.[unitNumber] = b.[unitNumber] AND c.[unitLocation] = b.[unitLocation]
) a
WHERE rowNum = 1
Same logic as Eric's answer, probably a bit cleaner using CTE and joins lesser records.
DECLARE #VolumeData TABLE
(
dataid int,
currentReading int
);
INSERT INTO #VolumeData VALUES(1, 22);
INSERT INTO #VolumeData VALUES(7, 33);
INSERT INTO #VolumeData VALUES(9, 25);
INSERT INTO #VolumeData VALUES(12,12);
DECLARE #LatestData TABLE
(
dataid int,
unitNumber int,
unitLocation int,
dateTimeStamp datetime
);
INSERT INTO #LatestData VALUES(1, 2344454, 2, '2017-07-10 13:16:29.000');
INSERT INTO #LatestData VALUES(7, 2344451, 44, '2017-07-10 13:22:29.000');
INSERT INTO #LatestData VALUES(9, 2344456, 92, '2017-07-10 12:16:29.000');
INSERT INTO #LatestData VALUES(12, 2344456, 12, '2017-07-10 12:13:23.000');
DECLARE #UnitData TABLE
(
unitNumber int,
unitLocation int,
buildingNumber int,
officeNumber int,
officeName varchar(50),
country varchar(50)
);
INSERT INTO #UnitData VALUES(2344454, 2, 44, 1, 'Telford', 'UK');
INSERT INTO #UnitData VALUES(2344451, 44, 22, 1, 'Telford', 'UK');
INSERT INTO #UnitData VALUES(2344456, 92, 12, 2, 'Hamburg', 'GER');
INSERT INTO #UnitData VALUES(2344456, 12, 33, 2, 'Hamburg', 'GER');
WITH LatestData_CTE (dataid, unitNumber, unitLocation, dateTimeStamp, rowNum)
AS
(
SELECT dataid
, unitNumber
, unitLocation
, dateTimeStamp
, ROW_NUMBER() OVER (PARTITION BY unitNumber ORDER BY dateTimeStamp DESC) AS rowNum
FROM #LatestData
)
SELECT currentReading, l.unitNumber, officeName, country, buildingNumber
FROM LatestData_CTE l
INNER JOIN #VolumeData v ON v.dataid = l.dataid
INNER JOIN #UnitData u ON u.[unitNumber] = l.[unitNumber] AND u.[unitLocation] = l.[unitLocation]
WHERE l.rowNum = 1
Not complete code, but an advice - It can be implemented by ROW_NUMBER function in CTE
Similar to
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/597b876e-eb00-4013-a613-97c377408668/rownumber-and-cte?forum=transactsql
http://datachix.com/2010/02/10/use-a-common-table-expression-and-the-row_number-function-to-eliminate-duplicate-rows-3/
Just google CTE+ROW_NUMBER to get more examples.
So in CTE you join all required tables and you apply ROW_NUMBER over partition, ordered by dateTimestamp (DESC) and then you use WHERE CTE_name.Rank = 1 in the query that uses that CTE.

Average of Days between ordered dates per group

+-------+-------+-----------+
| EmpID | PerID | VisitDate |
+-------+-------+-----------+
| 1 | 22 | 2/24/2017 |
| 1 | 22 | 3/25/2017 |
| 1 | 22 | 4/5/2017 |
| 2 | 33 | 5/6/2017 |
| 2 | 33 | 8/9/2017 |
| 2 | 33 | 6/7/2017 |
+-------+-------+-----------+
I am trying to find the latest visit date and average days between visits per EmpID. For Avg, I'll first have to order the days consecutively and then find the average.
Eg: Avg. days for EmpID=1 and PerID=22 would be [29(Days between 3/25 and 2/24) + 11 (Days between 3/25 and 4/5)/2] = 20 Days.
Desired Output:
+-------+-------+----------+----------+
| EmpID | PerID | MaxVDate | AvgVDays |
+-------+-------+----------+----------+
| 1 | 22 | 4/5/2017 | 20 |
| 2 | 33 | 8/9/2017 | 47.5 |
+-------+-------+----------+----------+
Attempt:
SELECT
EmpID
,PerID
,MAX(VisitDate) AS MaxVDate
,--Dunno how to find average AS AvgVDays
FROM
T1
GROUP BY
EmpID
,PerID
You can use lag to get the previous date and compute the date difference. Then use avg window function to get the average days.
Select distinct empid,perid,maxVdate,avg(diff_with_prev) OVER(Partition by empid) as avgVDays
from (
SELECT EmpID,PerID
,MAX(VisitDate) OVER(Partition BY EmpID) AS MaxVDate
,DATEDIFF(DAY,LAG(VisitDate) OVER(Partition BY EmpID order by VisitDate), VisitDate) as diff_with_prev
FROM T1
) t
Here's an option...
IF OBJECT_ID('tempdb..#TestData', 'U') IS NOT NULL
DROP TABLE #TestData;
CREATE TABLE #TestData (
EmpID INT NOT NULL,
PerID INT NOT NULL,
VisitDate DATE NOT NULL
);
INSERT #TestData (EmpID, PerID, VisitDate) VALUES
(1, 22, '2/24/2017'),
(1, 22, '3/25/2017'),
(1, 22, '4/5/2017'),
(2, 33, '5/6/2017'),
(2, 33, '8/9/2017'),
(2, 33, '6/7/2017');
-- SELECT * FROM #TestData td;
SELECT
db.EmpID,
db.PerID,
AvgDays = AVG(db.DaysBetween * 1.0)
FROM (
SELECT
*,
DaysBetween = DATEDIFF(dd, LAG(td.VisitDate, 1) OVER (PARTITION BY td.EmpID, td.PerID ORDER BY td.VisitDate), td.VisitDate)
FROM
#TestData td
) db
GROUP BY
db.EmpID,
db.PerID;
Results...
EmpID PerID AvgDays
----------- ----------- ---------------------------------------
1 22 20.000000
2 33 47.500000
The task is much easier than you think. You get the average with (last visit - first visit) / (count visits - 1).
select
empid,
perid,
max(VisitDate) as MaxVDate,
datediff(day, min(VisitDate), max(VisitDate)) * 1.0 / (count(*) - 1) as avgvdays
from mytable
group by empid, perid
having count(*) > 1
order by empid, perid;
The multiplication with 1.0 is necessary in order to avoid integer division. (You could also cast to decimal instead.)
As the calcualtion only makes sense for empid/perid pairs with more than one entry (and in order to avoid division by zero), I have applied an according HAVING clause.
Here is a test: http://rextester.com/AIFPA62612

Get id of max value in group

I have a table and i would like to gather the id of the items from each group with the max value on a column but i have a problem.
SELECT group_id, MAX(time)
FROM mytable
GROUP BY group_id
This way i get the correct rows but i need the id:
SELECT id,group_id,MAX(time)
FROM mytable
GROUP BY id,group_id
This way i got all the rows. How could i achieve to get the ID of max value row for time from each group?
Sample Data
id = 1, group_id = 1, time = 2014.01.03
id = 2, group_id = 1, time = 2014.01.04
id = 3, group_id = 2, time = 2014.01.04
id = 4, group_id = 2, time = 2014.01.02
id = 5, group_id = 3, time = 2014.01.01
and from that i should get id: 2,3,5
Thanks!
Use your working query as a sub-query, like this:
SELECT `id`
FROM `mytable`
WHERE (`group_id`, `time`) IN (
SELECT `group_id`, MAX(`time`) as `time`
FROM `mytable`
GROUP BY `group_id`
)
Have a look at the below demo
DROP TABLE IF EXISTS mytable;
CREATE TABLE mytable(id INT , group_id INT , time_st DATE);
INSERT INTO mytable VALUES(1, 1, '2014-01-03'),(2, 1, '2014-01-04'),(3, 2, '2014-01-04'),(4, 2, '2014-01-02'),(5, 3, '2014-01-01');
/** Check all data **/
SELECT * FROM mytable;
+------+----------+------------+
| id | group_id | time_st |
+------+----------+------------+
| 1 | 1 | 2014-01-03 |
| 2 | 1 | 2014-01-04 |
| 3 | 2 | 2014-01-04 |
| 4 | 2 | 2014-01-02 |
| 5 | 3 | 2014-01-01 |
+------+----------+------------+
/** Query for Actual output**/
SELECT
id
FROM
mytable
JOIN
(
SELECT group_id, MAX(time_st) as max_time
FROM mytable GROUP BY group_id
) max_time_table
ON mytable.group_id = max_time_table.group_id AND mytable.time_st = max_time_table.max_time;
+------+
| id |
+------+
| 2 |
| 3 |
| 5 |
+------+
When multiple groups may contain the same value, you could use
SELECT subq.id
FROM (SELECT id,
value,
MAX(time) OVER (PARTITION BY group_id) as max_time
FROM mytable) as subq
WHERE subq.time = subq.max_time
The subquery here generates a new column (max_time) that contains the maximum time per group. We can then filter on time and max_time being identical. Note that this still returns multiple rows per group if the maximum value occurs multiple time within the same group.
Full example:
CREATE TABLE test (
id INT,
group_id INT,
value INT
);
INSERT INTO test (id, group_id, value) VALUES (1, 1, 100);
INSERT INTO test (id, group_id, value) VALUES (2, 1, 200);
INSERT INTO test (id, group_id, value) VALUES (3, 1, 300);
INSERT INTO test (id, group_id, value) VALUES (4, 2, 100);
INSERT INTO test (id, group_id, value) VALUES (5, 2, 300);
INSERT INTO test (id, group_id, value) VALUES (6, 2, 200);
INSERT INTO test (id, group_id, value) VALUES (7, 3, 300);
INSERT INTO test (id, group_id, value) VALUES (8, 3, 200);
INSERT INTO test (id, group_id, value) VALUES (9, 3, 100);
select * from test;
id | group_id | value
----+----------+-------
1 | 1 | 100
2 | 1 | 200
3 | 1 | 300
4 | 2 | 100
5 | 2 | 300
6 | 2 | 200
7 | 3 | 300
8 | 3 | 200
9 | 3 | 100
(9 rows)
SELECT subq.id
FROM (SELECT id,
value,
MAX(value) OVER (partition by group_id) as max_value
FROM test) as subq
WHERE subq.value = subq.max_value;
id
----
3
5
7
(3 rows)