Separating Duplicate Rows into Separate Columns TSQL - sql

I have a block of data which has duplicate row ID's (CLIENT_DIWOR is the ROW ID) but relate to different groups. I can't just delete the duplicate row as they tie in to two different groups, so what I am trying to do is move the duplicate to the next column, so I can get the calculations correct at the end of my query. So for an example of what I am after
This is what I have
CLIENT_DIWOR GROUP_NAME
-1 Priv Client Serv (Sector)
-1 Social Business (Sector)
This is what I want
CLIENT_DIWOR GROUP_NAME Second Group Name
-1 Priv Client Serv (Sector) Social Business (Sector)
I have tried using COUNT(*) with a group by but that doesn't bring the correct results as it will just tell me there are 1 of everything, and what I am after is every time client_DIWOR duplicates add 1 to the previous number, as that will give me what I need to separate them out and rebuild it into a table, but I just cant see how to count it without grouping the numbers together, this is what I have so far with the count removed as I know that is wrong.
SELECT A.CLIENT_DIWOR,B.GROUP_NAME
from CLIENT_GRP_MEMBER A
JOIN CLIENT_GROUP B on B.DIWOR = A.CLIENT_GRP_DIWOR
order by CLIENT_DIWOR

A more general solution, using ROW_NUMBER:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY A.CLIENT_DIWOR ORDER BY B.GROUP_NAME) rn
FROM CLIENT_GRP_MEMBER A
INNER JOIN CLIENT_GROUP B
ON B.DIWOR = A.CLIENT_GRP_DIWOR
)
SELECT
CLIENT_DIWOR,
MAX(CASE WHEN rn = 1 THEN GROUP_NAME END) AS GROUP_NAME,
MAX(CASE WHEN rn = 2 THEN GROUP_NAME END) AS SECOND_GROUP_NAME
FROM cte
GROUP BY
CLIENT_DIWOR;
The advantage of this approach is that you ever need to cater to more than two columns in the output, you can easily extend the query.

If you want to put two groups for each CLIENT_DIWOR, you can use aggregation:
SELECT CLIENT_DIWOR,
MIN(GROUP_NAME) as GROUP_NAME
NULLIF(MAX(GROUP_NAME), MIN(GROUP_NAME)) as GROUP_NAME_2
from CLIENT_GRP_MEMBER cgm
GROUP BY CLIENT_DIWOR ;

Related

Group by after a partition by in MS SQL Server

I am working on some car accident data and am stuck on how to get the data in the form I want.
select
sex_of_driver,
accident_severity,
count(accident_severity) over (partition by sex_of_driver, accident_severity)
from
SQL.dbo.accident as accident
inner join SQL.dbo.vehicle as vehicle on
accident.accident_index = vehicle.accident_index
This is my code, which counts the accidents had per each sex for each severity. I know I can do this with group by but I wanted to use a partition by in order to work out % too.
However I get a very large table (I assume for each row that is each sex/severity. When I do the following:
select
sex_of_driver,
accident_severity,
count(accident_severity) over (partition by sex_of_driver, accident_severity)
from
SQL.dbo.accident as accident
inner join SQL.dbo.vehicle as vehicle on
accident.accident_index = vehicle.accident_index
group by
sex_of_driver,
accident_severity
I get this:
sex_of_driver
accident_severity
(No column name)
1
1
1
1
2
1
-1
2
1
-1
1
1
1
3
1
I won't give you the whole table, but basically, the group by has caused the count to just be 1.
I can't figure out why group by isn't working. Is this an MS SQL-Server thing?
I want to get the same result as below (obv without the CASE etc)
select
accident.accident_severity,
count(accident.accident_severity) as num_accidents,
vehicle.sex_of_driver,
CASE vehicle.sex_of_driver WHEN '1' THEN 'Male' WHEN '2' THEN 'Female' end as sex_col,
CASE accident.accident_severity WHEN '1' THEN 'Fatal' WHEN '2' THEN 'Serious' WHEN '3' THEN 'Slight' end as serious_col
from
SQL.dbo.accident as accident
inner join SQL.dbo.vehicle as vehicle on
accident.accident_index = vehicle.accident_index
where
sex_of_driver != 3
and
sex_of_driver != -1
group by
accident.accident_severity,
vehicle.sex_of_driver
order by
accident.accident_severity
You seem to have a misunderstanding here.
GROUP BY will reduce your rows to a single row per grouping (ie per pair of sex_of_driver, accident_severity values. Any normal aggregates you use with this, such as COUNT(*), will return the aggregate value within that group.
Whereas OVER gives you a windowed aggregated, and means you are calculating it after reducing your rows. Therefore when you write count(accident_severity) over (partition by sex_of_driver, accident_severity) the aggregate only receives a single row in each partition, because the rows have already been reduced.
You say "I know I can do this with group by but I wanted to use a partition by in order to work out % too." but you are misunderstanding how to do that. You don't need PARTITION BY to work out percentage. All you need to calculate a percentage over the whole resultset is COUNT(*) * 1.0 / SUM(COUNT(*)) OVER (), in other words a windowed aggregate over a normal aggregate.
Note also that count(accident_severity) does not give you the number of distinct accident_severity values, it gives you the number of non-null values, which is probably not what you intend. You also have a very strange join predicate, you probably want something like a.vehicle_id = v.vehicle_id
So you want something like this:
select
sex_of_driver,
accident_severity,
count(*) as Count,
count(*) * 1.0 /
sum(count(*)) over (partition by sex_of_driver) as PercentOfSex
count(*) * 1.0 /
sum(count(*)) over () as PercentOfTotal
from
dbo.accident as accident a
inner join dbo.vehicle as v on
a.vehicle_id = v.vehicle_id
group by
sex_of_driver,
accident_severity;

SQL - Count new entries based on last date

I have a table with the follow structure
ID ReportDate Object_id
What I need to know, is the count of new and count of old (Object id's)
For example: If I have the data below:
I want the following output grouped by ReportDate:
I thought a way doing it using a Where clause based on date, however i need the data for all the dates I have in the table. To see the count of what already existed in the previous report and what is new at that report. Any Ideas?
Edit: New/Old definition- New would be the records that never appeared before that report run date and appeared on this one, whereas old is the number of records that had at least one match in previous dates. I'll edit the post to include this info.
managed to do it using a left join. Below is my solution in case it helps anyone in the future :)
SELECT table.ReportRunDate,
-1*sum(table.ReportRunDate = new_table.init_date) as count_new,
-1*sum(table.ReportRunDate <> new_table.init_date) as count_old,
count(*) as count_total
FROM table LEFT JOIN
((SELECT Object_ID, min(ReportRunDate) as init_date
FROM table
GROUP By OBJECT_ID) as new_table)
ON table.Object_ID = new_table.Object_ID
GROUP BY ReportRunDate
This would work in Oracle, not sure about ms-access:
SELECT ReportDate
,COUNT(CASE WHEN rnk = 1 THEN 1 ELSE NULL END) count_of_new
,COUNT(CASE WHEN rnk <> 1 THEN 1 ELSE NULL END)count_of_old
FROM (SELECT ID
,ReportDate
,Object_id
,RANK() OVER (PARTITION BY Object_id ORDER BY ReportDate) rnk
FROM table_name)
GROUP BY ReportDate
Inner query should rank each occurence of object_id based on the ReportDate so the 1st occurrence of certain object_id will have rank = 1, the next one rank = 2 etc.
Then the outer query counts how many records with rank equal/not equal 1 are the within each group.
I assumed that 1 object_id can appear only once within each reportDate.

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here
You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).
Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).
A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

removing non-matching rows based on order in SQL within a WITH statement

I've got a many-to-many setup where there are items and item names(based on languageID)
I want to retrieve all names for a set id, where the name is replace with an alternate name (same itemID, but different languageID) when name is NULL.
I've set up a table that receives all combinations of itemids and itemnames, even the missing ones, and have the name ordered by an hasName flag, that is set based on name existing to 0,1 or 2. 0 means languageId and name exist, 1 means only name exists, and 2 means neither. I then sort the results: ORDER BY itemId, hasName, languageId this works well enough, because the top 1 row of every itemid meats the critera, and I can just pull that.
However I still need to process other queries using the result, so this doesn't work well, because as soon as I use a WITH statement, the order cannot be used, so it breaks the functionality
What I'm using instead is a join, where I select the top 1 matching row on the ordered table
the problem there is that the time to execute goes up 10x
any ideas what else I could try?
using SQL server 10.50
the slow query:
SELECT
*,
(SELECT top 1 ItemName FROM ItemNameMultiLang x WHERE x.ItemId = tc.ItemId ORDER BY ItemID, hasName, LangID) AS ItemName
FROM ItemCategories tc
ORDER BY ItemId
One way to approach this is with row_number(), so you can get the first row from itemNameMultiLang, which is what you want:
SELECT tc.*, inml.ItemName
FROM ItemCategories tc left outer join
(select inml.*, row_number() over (partition by inml.ItemId order by hasname, langId) as seqnum
from ItemNameMultiLang
) inml
on tc.ItemItem = inml.ItemId and
inml.seqnum = 1
ORDER BY tc.ItemId;

filtering rows by checking a condition for group in one statement only

I have the following statement:
SELECT
(CONVERT(VARCHAR(10), f1, 120)) AS ff1,
CONVERT(VARCHAR(10), f2, 103) AS ff2,
...,
Bonus,
Malus,
ClientID,
FROM
my_table
WHERE
<my_conditions>
ORDER BY
f1 ASC
This select returns several rows for each ClientID. I have to filter out all the rows with the Clients that don't have any row with non-empty Bonus or Malus.
How can I do it by changing this select by one statement only and without duplicating all this select?
I could store the result in a #temp_table, then group the data and use the result of the grouping to filter the temp table. - BUT I should do it by one statement only.
I could perform this select twice - one time grouping it and then I can filter the rows based on grouping result. BUT I don't want to select it twice.
May be CTE (Common Table Expressions) could be useful here to perform the select one time only and to be able to use the result for grouping and then for selecting the desired result based on the grouping result.
Any more elegant solution for this problem?
Thank you in advance!
Just to clarify what the SQL should do I add an example:
ClientID Bonus Malus
1 1
1
1 1
2
2
3 4
3 5
3 1
So in this case I don't want the ClientID=2 rows to appear (they are not interesting). The result should be:
ClientID Bonus Malus
1 1
1
1 1
3 4
3 5
3 1
SELECT Bonus,
Malus,
ClientID
FROM my_table
WHERE ClientID not in
(
select ClientID
from my_table
group by ClientID
having count(Bonus) = 0 and count(Malus) = 0
)
A CTE will work fine, but in effect its contents will be executed twice because they are being cloned into all the places where the CTE is being used. This can be a net performance win or loss compared to using a temp table. If the query is very expensive it might come out as a loss. If it is cheap or if many rows are being returned the temp table will lose the comparison.
Which solution is better? Look at the execution plans and measure the performance.
The CTE is the easier, more maintainable are less redundant alternative.
You haven't specified what are data types of Bonus and Malus columns. So if they're integer (or can be converted to integer), then the query below should be helpful. It calculates sum of both columns for each ClientID. These sums are the same for each detail line of the same client so we can use them in WHERE condition. Statement SUM() OVER() is called "windowed function" and can't be used in WHERE clause so I had to wrap your select-list with a parent one just because of syntax.
SELECT *
FROM (
SELECT
CONVERT(VARCHAR(10), f1, 120) AS ff1,
CONVERT(VARCHAR(10), f2, 103) AS ff2,
...,
Bonus,
Malus,
ClientID,
SUM(Bonus) OVER (PARTITION BY ClientID) AS ClientBonusTotal,
SUM(Malus) OVER (PARTITION BY ClientID) AS ClientMalusTotal
FROM
my_table
WHERE
<my_conditions>
) a
WHERE ISNULL(a.ClientBonusTotal, 0) <> 0 OR ISNULL(a.ClientMalusTotal, 0) <> 0
ORDER BY f1 ASC