Eliminating duplicate rows with null values using with clause - sql

How do we eliminate duplicates by only selecting those with values in a certain field using with clause statement?
Query is something like this:
with x as (--queries with multiple join tables, etc.)
select distinct * from x
Output below:
Com_no Company Loc Rewards
1 Mccin India 50
1 Mccin India
2 Rowle China 18
3 Draxel China 11
3 Draxel China
4 Robo UK
As you can see, I get duplicate records. I want to get rid of the null values that are NOT unique. That is to say, Robo is unique since it only has 1 record with a null value in Rewards, so I want to keep that.
I tried this:
with x as (--queries with multiple join tables, etc.)
select distinct * from x where Rewards is not null
And of course that wasn't right, since it also got rid of 4 Robo UK
Expected output should be:
1 Mccin India 50
2 Rowle China 18
3 Draxel China 11
4 Robo UK

The problem is you're calling those rows duplicates, but they're not duplicates. They're different. So what you want to do is exclude rows where Rewards is null UNLESS there aren't any rows with a not null value, and then select the distinct rows. So something like:
select distinct *
from x a
where Rewards is not null
or (Rewards is null and not exists (select 1 from x b where a.Com_no = b.Com_no
and b.Rewards is not null)
Now your Robo row will still be included as there isn't a row in x for Robo where Rewards is not null, but the rows for the other Companies with null Rewards will be excluded as there are not null rows for them.

This is a prioritization query. One method is to use row_number(). If you want only one value per Com_no/Company/Loc, then:
select x.*
from (select x.*,
row_number() over (partition by Com_no, Company, Loc order by Rewards nulls last) as seqnum
from x
) x
where seqnum = 1;
Or even:
select Com_no, Company, Loc, max(Rewards)
from x
group by Com_no, Company, Loc;

Related

How to add columns that shows the total number of rows in a table with condition in SQL Server

I've got this table and I wish to add columns that summarized it:
table now:
Name
PAT_ID
Has_T
Has_Y
Has_G
Brian
123
X
X
Brian
356
X
X
Brian
3546
X
X
Brian
987
X
What I wish is to add columns that counts stuff in the table and give a value in each row:
Desired output:
Name
PAT_ID
Has_T
Has_Y
Has_G
Total_T
Total_Y
Total_PATS
Brian
123
X
X
3
2
4
Brian
356
X
X
3
2
4
Brian
3546
X
X
3
2
4
Brian
987
X
3
2
4
Someone helped me with the last one (Total_PATS) with counting all row with:
COUNT(*) OVER () AS [total] << for all rows.
how to do it with conditions? I have 'X' so I want to count all the rows where has_T has an X...
As you are storing blank values, or values of spaces, the COUNT will still count those valuesl COUNT counts non-NULL values. Ideally, you should be storing NULL, not '', ' ' (or even ' ') in such values, it makes COUNTing the values much easier.
You could, however, NULL the values in the COUNT:
SELECT name,
pat_id,
has_t,
has_y,
has_g,
COUNT(NULLIF(has_t,'')) OVER() AS total_t,
COUNT(NULLIF(has_y,'')) OVER() AS total_y,
COUNT(NULLIF(has_g,'')) OVER() AS total_g,
COUNT(*) OVER() AS total
FROM dbo.Yourtable;
We can use conditional count with CASE:
SELECT
name, pat_id, has_t, has_y, has_g,
COUNT (CASE WHEN has_t = 'X' THEN 1 END) OVER() AS Total_T,
COUNT (CASE WHEN has_y = 'X' THEN 1 END) OVER() AS Total_Y,
COUNT (CASE WHEN has_g = 'X' THEN 1 END) OVER() AS Total_G,
COUNT (*) OVER() AS Total_PATS
FROM yourtable;
This will count X values only.
As already said in comments, using NULL rather than spaces/empty strings would be much better, for example because COUNT ignores NULL, so we could simply write COUNT(has_t) etc.

JOIN on aggregate function

I have a table showing production steps (PosID) for a production order (OrderID) and which machine (MachID) they will be run on; I’m trying to reduce the table to show one record for each order – the lowest position (field “PosID”) that is still open (field “Open” = Y); i.e. the next production step for the order.
Example data I have:
OrderID
PosID
MachID
Open
1
1
A
N
1
2
B
Y
1
3
C
Y
2
4
C
Y
2
5
D
Y
2
6
E
Y
Example result I want:
OrderID
PosID
MachID
1
2
B
2
4
C
I’ve tried two approaches, but I can’t seem to get either to work:
I don’t want to put “MachID” in the GROUP BY because that gives me all the records that are open, but I also don’t think there is an appropriate aggregate function for the “MachID” field to make this work.
SELECT “OrderID”, MIN(“PosID”), “MachID”
FROM Table T0
WHERE “Open” = ‘Y’
GROUP BY “OrderID”
With this approach, I keep getting error messages that T1.”PosID” (in the JOIN clause) is an invalid column. I’ve also tried T1.MIN(“PosID”) and MIN(T1.”PosID”).
SELECT T0.“OrderID”, T0.“PosID”, T0.“MachID”
FROM Table T0
JOIN
(SELECT “OrderID”, MIN(“PosID”)
FROM Table
WHERE “Open” = ‘Y’
GROUP BY “OrderID”) T1
ON T0.”OrderID” = T1.”OrderID”
AND T0.”PosID” = T1.”PosID”
Try this:
SELECT “OrderID”,“PosID”,“MachID” FROM (
SELECT
T0.“OrderID”,
T0.“PosID”,
T0.“MachID”,
ROW_NUMBER() OVER (PARTITION BY “OrderID” ORDER BY “PosID”) RNK
FROM Table T0
WHERE “Open” = ‘Y’
) AS A
WHERE RNK = 1
I've included the brackets when selecting columns as you've written it in the question above but in general it's not needed.
What it does is it first filters open OrderIDs and then numbers the OrderIDs from 1 to X which are ordered by PosID
OrderID
PosID
MachID
Open
RNK
1
2
B
Y
1
1
3
C
Y
2
2
4
C
Y
1
2
5
D
Y
2
2
6
E
Y
3
After it filters on the "rnk" column indicating the lowest PosID per OrderID. ROW_NUMBER() in the select clause is called a window function and there are many more which are quite useful.
P.S. Above solution should work for MSSQL

How to consecutively count everything greater than or equal to itself in SQL?

Let's say if I have a table that contains Equipment IDs of equipments for each Equipment Type and Equipment Age, how can I do a Count Distinct of Equipment IDs that have at least that Equipment Age.
For example, let's say this is all the data we have:
equipment_type
equipment_id
equipment_age
Screwdriver
A123
1
Screwdriver
A234
2
Screwdriver
A345
2
Screwdriver
A456
2
Screwdriver
A567
3
I would like the output to be:
equipment_type
equipment_age
count_of_equipment_at_least_this_age
Screwdriver
1
5
Screwdriver
2
4
Screwdriver
3
1
Reason is there are 5 screwdrivers that are at least 1 day old, 4 screwdrivers at least 2 days old and only 1 screwdriver at least 3 days old.
So far I was only able to do count of equipments that falls within each equipment_age (like this query shown below), but not "at least that equipment_age".
SELECT
equipment_type,
equipment_age,
COUNT(DISTINCT equipment_id) as count_of_equipments
FROM equipment_table
GROUP BY 1, 2
Consider below join-less solution
select distinct
equipment_type,
equipment_age,
count(*) over equipment_at_least_this_age as count_of_equipment_at_least_this_age
from equipment_table
window equipment_at_least_this_age as (
partition by equipment_type
order by equipment_age
range between current row and unbounded following
)
if applied to sample data in your question - output is
Use a self join approach:
SELECT
e1.equipment_type,
e1.equipment_age,
COUNT(*) AS count_of_equipments
FROM equipment_table e1
INNER JOIN equipment_table e2
ON e2.equipment_type = e1.equipment_type AND
e2.equipment_age >= e1.equipment_age
GROUP BY 1, 2
ORDER BY 1, 2;
GROUP BY restricts the scope of COUNT to the rows in the group, i.e. it will not let you reach other rows (rows with equipment_age greater than that of the current group). So you need a subquery or windowing functions to get those. One way:
SELECT
equipment_type,
equipment_age,
(Select COUNT(*)
from equipment_table cnt
where cnt.equipment_type = a.equipment_type
AND cnt.equipment_age >= a.equipment_age
) as count_of_equipments
FROM equipment_table a
GROUP BY 1, 2, 3
I am not sure if your environment supports this syntax, though. If not, let us know we will find another way.

duplication of rows in table

I have a table which has many rows which are same, except for the id column. How can I show only one row for other duplicate row?
id name roll_number
1 a 1
2 b 2
3 a 1
4 b 2
5 c 3
6 d 4
7 d 4
show output like this
id name roll_number
1 a 1
2 b 2
5 c 3
6 d 4
We can use DISTINCT ON here:
SELECT DISTINCT ON (name) id, name, roll_number
FROM yourTable
ORDER BY name, id;
This query is selecting one record with the lowest id from each group of records having the same name.
Simple aggregation using min
select Min(id), name,roll_number
from t
group by name, roll_number
You could use the numpy.unique(filt, trim='fb') function:
>>> import numpy as np
>>> np.unique(array)
This problem requires to "filter out" tuples during the projection based on groups. The solution is to use distinct on.
SELECT DISTINCT ON (name, roll_number) id, name, roll_number
FROM table
ORDER BY name, id;
it basically creates groups by the attributes within the "DISTINCT_ON" and non-deterministically chooses one tuple, which it outputs.

Fill Users table with data using percentages from another table

I have a Table Users (it has millions of rows)
Id Name Country Product
+----+---------------+---------------+--------------+
1 John Canada
2 Kate Argentina
3 Mark China
4 Max Canada
5 Sam Argentina
6 Stacy China
...
1000 Ken Canada
I want to fill the Product column with A, B or C based on percentages.
I have another table called CountriesStats like the following
Id Country A B C
+-----+---------------+--------------+-------------+----------+
1 Canada 60 20 20
2 Argentina 35 45 20
3 China 40 10 50
This table holds the percentage of people with each product. For example in Canada 60% of people have product A, 20% have product B and 20% have product C.
I would like to fill the Users table with data based on the Percentages in the second data. So for example if there are 1 million user in canada, I would like to fill 600000 of the Product column in the Users table with A 200000 with B and 200000 with C
Thanks for any help on how to do that. I do not mind doing it in multiple steps I jsut need hints on how can I achieve that in SQL
The logic behind this is not too difficult. Assign a sequential counter to each person in each country. Then, using this value, assign the correct product based on this value. For instance, in your example, when the number is less than or equal to 600,000 then 'A' gets assigned. For 600,001 to 800,000 then 'B', and finally 'C' to the rest.
The following SQL accomplishes this:
with toupdate as (
select u.*,
row_number() over (partition by country order by newid()) as seqnum,
count(*) over (partition by country) as tot
from users u
)
update u
set product = (case when seqnum <= tot * A / 100 then 'A'
when seqnum <= tot * (A + B) / 100 then 'B'
else 'C'
end)
from toupdate u join
CountriesStats cs
on u.country = cs.country;
The with statement defines an updatable subquery with the sequence number and total for each each country, on each row. This is a nice feature of SQL Server, but is not supported in all databases.
The from statement is joining back to the CountriesStats table to get the needed values for each country. And the case statement does the necessary logic.
Note that the sequential number is assigned randomly, using newid(), so the products should be assigned randomly through the initial table.