Selecting Active roles from denormalized table with duplicates - sql

I have a bit of a garbage table that I need to extract data from.
Name | Person# | Assignment_Status | Group
--------------------------------------------------
Smith, John | 1234567 | NLE | G1
Smith, John | 1234567 | Active | G2
Jones, Jane | 7654321 | Active | G1
James, Jack | 9876541 | LOA | G3
Peep, Laura | 6549871 | ServiceLOA | G1
Some, One | 3219875 | NLE | G2
Every time a person moves groups their current assignment_status gets set to NLE and a new record gets create to set the assignment_status to Active for the new group. When a person leaves the company they also set the assignment_status to NLE. This table does not have a Unique row ID nor does it have a date stamp.
I need a query that reduces the table to 1 record per employee and if the employee has multiple records I need the Assignment_Status that is not NLE. For example, John Smith should show as active for G2.
My first attempt was:
SELECT *
INTO #TempAssignments
FROM
(SELECT
ROW_NUMBER() OVER (ORDER BY aID) AS ID,
Name,
Person#,
(CASE WHEN Assignment_Status='NLE' THEN 1 ELSE 0 END) AS aID,
Group
FROM
tblAssignments)
With the data in a temp table then I created a second query to select the MIN of ID, the MIN of aID and GROUP BY Name and Person# then joined that back to the temp table to get the Group for the given ID.
This seems to work however this is a solution that needs to be deployed in multiple reports so I was wondering if there isn't a more compact way of doing this.

The following query:
SELECT Name, Person#, [Group]
FROM (
SELECT Name, Person#, [Group],
ROW_NUMBER() OVER (PARTITION BY Person#, Name
ORDER BY CASE
WHEN Assignment_Status <> 'NLE' THEN 0
ELSE 1
END) AS rn
FROM tblAssignments ) t
WHERE t.rn = 1
will select one record for each employee, as identified by a Person#, Name value pair. If the employee has multiple records, then a record with Assignment_Status that is not NLE will be selected.

Related

Show column values as comma seperated in grafana

I have a table with 2 columns
organization_id | user_name
1 | abc
1 | xyz
2 | bhi
2 | ipq
2 | sko
3 | ask
...
Each organization could have any number of users ranging from 1 to 100, 2000 and so on.
I wanted to show them in grafana in a table as following:
organization_id | user_name
1 | abc, xyz
2 | bhi, ipq, sko
3 | ask
Since there could be many users I want to show any 10 users belonging to same organization.
The database here is timescale db, the table is also a time series table showing when user was registered
If I understand rightly that you want 10 users per organisation you can use the query below.
I have added group by in the CTE to avoid returning duplicate user_name's.
In the test schema there are duplicate values of 'pqr' for organisation 2 but this username is only returned once even though there are less then 10 user_name's for 2
test schema db Fiddle here
With topTen as
(Select
Organisation_id,
User_name,
Rank() over (
partition by organisation_id
order by user_name) rn
From table_name
group by
Organisation_id,
user_name)
Select
Organisation_id,
String_agg(user_name,',') users
From topTen
Where rn <= 10
group by Organisation_id;
organisation_id | users
--------------: | :--------------------------------------
1 | abc,abk,def,ghi,jkl,mno,pqr,rst,ruk,stu
2 | abk,pqr,rst,ruk,stu,vwx
Another alternative which may be useful. If you remove the where and put the following after From topTen you will get all the distinct user_names, 10 per row.
group by Organisation_id,rn/10
order by Organisation_id,rn/10;
db<>fiddle here

find duplicated record by first and last name

I have a table called beneficials. Some facts about it:
A beneficial belongs to one organization
An organization has many beneficial
Beneficials have first and last names and no other identification form.
Some sample data from the table
| id | firstname | lastname | organization_id |
|----|-----------|----------|-----------------|
| 1 | jan | kowalski | 1 |
| 2 | jan | kovalski | 3 |
| 3 | john | doe | 1 |
| 4 | jan | kowalski | 2 |
I want to find if a beneficial from an organization is also present in other organizations through first and last name and if so, I want to get the organization or organizations ids.
in the sample data above, what I want is given organization id 1, the query should return 2 because jan kowalski is also beneficial on organization 2 but not 3 because even though they match the first name, they don't match the last name
I came up with the following query:
with org_beneficials as (
select firstname, lastname from beneficials where organization_id = ? and deleted_at is null
)
select organization_id from beneficials
where firstname in (select firstname from org_beneficials)
and lastname in (select lastname from org_beneficials)
and deleted_at is null
and organization_id <> ?;
it kinda works but returns some false positive if beneficial from different organizations share the same first or last name. I need to match both first and last names and I can't figure out how.
I thought about joining the table itself but I'm not sure if this would work since an organization has many beneficials. Adding a column like fullname is not something I want to do it here
You can group by first and last names, then filter for duplicates
SELECT firstname, lastname
FROM beneficials
GROUP BY firstname, lastname
HAVING COUNT(*) > 1;
After your edits, it seems you want to select the records of people of a given organization that also appear in a different organization
SELECT *
FROM beneficials a
WHERE a.organization_id != 1
AND EXISTS (
SELECT 1
FROM beneficials b
WHERE a.firstname = b.firstname
AND a.lastname = b.lastname
AND b.organization_id = 1
);

Retrieve a row of data with the most recent date when another row <> 'X' in T-SQL

I have a database of customers who have an effective date and end date of their membership, both separate columns. The data is a bit dirty, however, and a customer can have multiple rows of data, only one of which is their most recent membership record. A member is considered "active" if they have an end date that = NULL.
The data looks somewhat like this:
Name ID Membership_Effective_Date Membership_End_Date
---------------------------------------------------------------------------
Bob 1 1/1/2020 NULL
Bob 1 1/1/2017 1/2/2017
Bob 1 1/1/2017 9/1/2018
Kim 2 1/1/2019 1/1/2020
Kim 2 1/1/2019 12/31/2019
Susan 3 1/1/2018 12/31/2018
Susan 3 1/1/2019 1/1/2019
Larry 4 1/1/2020 1/1/2020
I need to retrieve the most recent membership end date for a list of customers that are both inactive and active.
My desired results should look like this:
Name ID Membership_Effective_Date Membership_End_Date
Bob 1 1/1/2020 NULL
Kim 2 1/1/2019 1/1/2020
Susan 3 1/1/2018 12/31/2018
Larry 4 1/1/2020 1/1/2020
I have been able to do this without a problem for customers that have both a row with a Membership_End_Date date value and a Membership_End_Date row with a NULL value (Bob), and customers that have multiple rows with only date values (Kim).
The challenge I am having is with data like Susan and Larry. They both have rows that contain date values where Membership_Effective_Date = Membership_End_Date. In Larry's case that is the only row of data he has. And in Susan's case the dates in the row where Membership_Effective_Date = Membership_End_Date is greater than the other row so my current query will pick it up automatically.
The problem is that I need to basically write a query that says if a customer has multiple rows of data and one row where Membership_Effective_Date = Membership_End_Date then chose the second most recent line of data. However, if a customer only has one row of data and that row only contains values where Membership_Effective_Date = Membership_End_Date then choose that one.
I can't figure out how to do this without removing Larry from the data pull completely and I need to include him and similar customers.
Any help is appreciated!
You could do this with row_number() and a conditional sort:
select name, id, membership_effective_date, membership_end_date
from (
select
t.*,
row_number() over(
partition by id
order by
case when membership_end_date is null then 0 else 1 end,
case when membership_end_date <> membership_effective_date then 0 else 1 end,
membership_end_date desc
) rn
from mytable t
) t
where rn = 1
The trick lies in the order by clause of row_number(): it gives priority to rows whose end date is null, then to rows whose end date is not equal to the start date, then to the greatest end date. You can run the subquery separately to see how the row number is assigned.
With this information at hand, all that is left to do is filter on the top ranked record per group.
Demo on DB Fiddle:
name | id | membership_effective_date | membership_end_date
:---- | -: | :------------------------ | :------------------
Bob | 1 | 2020-01-01 | null
Kim | 2 | 2019-01-01 | 2020-01-01
Susan | 3 | 2018-01-01 | 2018-12-31
Larry | 4 | 2020-01-01 | 2020-01-01
wonder what make you think that your code is better
First of all,with due respect,no offence to any one.
order by
case when membership_end_date is null then 0 else 1 end,
case when membership_end_date <> membership_effective_date then 0 else 1 end,
membership_end_date desc
I have no idea how the real data look like.
I will avoid Row_Number and Inequakity Operator if I have lot of rows to process.
Inequakity Operator often Scan complete table to check Inequakity condition.
I am sure about it.
That too Inequakity Operator in Order by clause along with Case Statement and Row_Number.
This may overwhelm Sql Optimizer.
I am not saying I always avoid Row_Number
Also you have not mention anything about Membership_Effective_Date
Try below script with various sample data,
create table customers1(Name varchar(40), ID int
, Membership_Effective_Date datetime, Membership_End_Date datetime)
insert into customers1 values
('Bob', 1 ,'2020-01-01' , NULL)
,('Bob', 1 ,'2017-01-01' , '1/2/2017')
,('Bob', 1 ,'2017-01-01' , '9/1/2018')
,('Kim', 2 ,'2019-01-01' , '1/1/2020')
,('Kim', 2 ,'2019-01-01' , '12/31/2019')
,('Susan', 3 ,'2018-01-01' , '12/31/2018')
,('Susan', 3 ,'2019-01-01' , '1/1/2019')
,('Larry', 4 ,'2020-01-01' , '1/1/2020')
SELECT ID
,NAME
,Membership_Effective_Date
,Membership_End_Date
INTO #temp
FROM customers1
WHERE Membership_End_Date IS NULL
OPTION (MAXDOP 1)
SELECT ID
,NAME
,Membership_Effective_Date
,Membership_End_Date
FROM #temp
UNION ALL
SELECT t.ID
,t.NAME
,min(t.Membership_Effective_Date) AS Membership_Effective_Date
,max(t.Membership_End_Date) AS Membership_End_Date
FROM customers1 t
WHERE Membership_End_Date IS NOT NULL
AND NOT EXISTS (
SELECT 1
FROM #temp ac
WHERE ac.ID = t.ID
)
GROUP BY t.ID
,t.NAME
OPTION (MAXDOP 1)
drop table #temp
drop table customers1
Yes you are right earlier when I was using CTE it would have Scan atleast twice.
Now I am using #temp table but idea is same as earlier.
More or less I stick with this idea only.

More efficient way to query shortest string value associated with each value in another column in Hive QL

I have a table in Hive containing store names, order IDs, and User IDs (as well as some other columns including item ID). There is a row in the table for every item purchased (so there can be more than one row per order if the order contains multiple items). Order IDs are unique within a store, but not across stores. A single order can have more than one user ID associated with it.
I'm trying to write a query that will return a list of all stores and order IDs and the shortest user ID associated with each order.
So, for example, if the data looks like this:
STORE | ORDERID | USERID | ITEMID
------+---------+--------+-------
| a | 1 | bill | abc |
| a | 1 | susan | def |
| a | 2 | jane | abc |
| b | 1 | scott | ghi |
| b | 1 | tony | jkl |
Then the output would look like this:
STORE | ORDERID | USERID
------+---------+-------
a | 1 | bill
a | 2 | jane
b | 1 | tony
I've written a query that will do this, but I feel like there must be a more efficient way to go about it. Does anybody know a better way to produce these results?
This is what I have so far:
select
users.store, users.orderid, users.userid
from
(select
store, orderid, userid, length(userid) as len
from
sales) users
join
(select distinct
store, orderid,
min(length(userid)) over (partition by store, orderid) as len
from
sales) len on users.store = len.store
and users.orderid = len.orderid
and users.len = len.len
Check out probably this will work for you, here you can achieve your goal of single "SELECT" clause with no extra overhead on SQL.
select distinct
store, orderid,
first_value(userid) over(partition by store, orderid order by length(userid) asc) f_val
from
sales;
The result will be:
store orderid f_val
a 1 bill
a 2 jane
b 1 tony
Probably rank() is the best way:
select s.*
from (select s.*, rank() over (partition by store order by length(userid) as seqnum
from sales s
) s
where seqnum = 1;

sql, getting parents and child from one table

I'm trying to write a simple sql statement to select a parent and dependents from one table based on the parents hiring date. Because the hiring date field in dependents row is null, I'm only getting the parents. Can someone help?
PRIM KEY RECORD LAST FIRST HIRE DATE
12345 1 JONES MARY 1/1/2017
12345 2 JONES TIM
6789 1 SMITH CAROL 5/12/2014
23456 1 WHITAKE REGINA 5/14/2017
23456 2 WHITAKE JOE
parent has a row for each child in the table. Parent is 1 and all dependents have a 2. They share a primary key (parent's ssn). I want to select all parents who was hired between specific date range and their dependants rows. The dependent hire date column is null. So when I write the following... I'm only getting the parent rows...
SELECT PRIMARY_KEY_VALUE, RECORD_ID, LAST_NAME, FIRST_NAME, HIRE_DATE
FROM CIGNA_ELIGIBILITY
WHERE(HIRE_DATE BETWEEN '20171101' AND '20171131');
If i understand your problem correctly, that on the date range provided, you want to return records associated with it and all dependents(provided that parents/childs has same prim_key) then one way could be to use IN.
select *
from table1 t1
where t1.prim_key in
(
select t2.prim_key
from table1 t2
where t2.hire_date between '2017-01-01' AND '2017-01-30'
);
what the above query does is that from sub-query select PRIM_KEY of the date range specified and then in main query select all record associated with it.
Result:
+---+----------+--------+-------+-------+---------------------+
| | prim_key | record | last | first | hire_date |
+---+----------+--------+-------+-------+---------------------+
| 1 | 12345 | 1 | JONES | MARY | 01.01.2017 00:00:00 |
| 2 | 12345 | 2 | JONES | TIM | NULL |
+---+----------+--------+-------+-------+---------------------+
DEMO
Update:
Another option could be to use exists:
select *
from table1 t1
where exists
(
select 1
from table1 t2
where t1.prim_key = t2.prim_key
and t2.hire_date between '2017-01-01' AND '2017-01-30'
)