Using self join to find duplicates in SQL

Using self join to find duplicates in SQL - sql

I know that there are other questions like this. However, my question is about why the query that I am using is not returning the optimal results. Below is the query. To give context, I have a single table that has 113 columns/fields. However, only 4 really matter; acct, year, qtr, cnty (county). This table is a list of employers by establishment. An employer can appear more than once. The same person owning 12 starbucks being the best example. What I am looking for is a query that will show when acct values have different cnty values. The below query works without error but it shows far too much. It shows rows where the acct value is the same but the cnty value is the same as well. Any thoughts on looking at this query as to why it shows too much?
select distinct t1.acct, t1.year, t1.qtr, t1.cnty
from dbo.table t1 join dbo.table t2 on t1.acct=t2.acct
where (t1.cnty <> t2.cnty)
order by t1.acct, t1.year, t1.qtr, t1.cnty
Intended result
acct year qtr cnty
1234567777 2007 4 7
1234567777 2008 1 9
1234567890 2006 4 31
1234567890 2007 1 3
2345678901 2006 4 7
2345678901 2007 2 1

Is this what you want?
select distinct t.acct, t.year, t.qtr, t.cnty
from (select t.*, min(cnty) over (partition by acct, year, qtr) as min_cnty,
max(cnty) over (partition by acct, year, qtr) as max_cnty
from dbo.table t
) t
where min_cnty <> max_cnty;

Related

LAG function alternative. I need the results for the missing year in between

I have this table so far. However, I would like to obtain the results for 2019 which there are no records so it becomes 0. Are there any alternatives to the LAG funciton.
ID
Year
Year_Count
1
2018
10
1
2020
20
Whenever I use the LAG function in SQL it gives me the results for 2018. However, I would like to get 0 for 2019 and then 10 for 2018
LAG(YEAR_COUNT) OVER (PARTITION BY ID ORDER BY YEAR) AS previous_year_count

untested notepad scribble
CASE
WHEN 1 = YEAR - LAG(YEAR) OVER (PARTITION BY ID ORDER BY YEAR)
THEN LAG(YEAR_COUNT) OVER (PARTITION BY ID ORDER BY YEAR)
ELSE 0
END AS previous_year_count

I'll add on to Nick's comment here with an example.
The YEARS CTE here is creating that table of years as he suggested, the RECORDS table is matching the above posted. Then they get joined together with COALESCE to fill in the null values left by the LEFT JOIN (filled ID with 0, not sure what your case would be).
You would need to LEFT JOIN onto the YEAR table and select the YEAR variable from the YEAR table in the final query, otherwise you'd only end up with only 2018/2020 or those years and some null values
WITH
YEARS AS
(
SELECT 2016 AS YEAR UNION ALL
SELECT 2017 UNION ALL
SELECT 2018 UNION ALL
SELECT 2019 UNION ALL
SELECT 2020 UNION ALL
SELECT 2021 UNION ALL
SELECT 2022
)
,
RECORDS AS
(
SELECT 1 ID, 2018 YEAR, 10 YEAR_COUNT UNION ALL
SELECT 1, 2020, 20)
SELECT
COALESCE(ID, 0) AS ID,
Y.YEAR,
COALESCE(YEAR_COUNT, 0) AS YEAR_COUNT
FROM YEARS AS Y
LEFT JOIN RECORDS AS R
ON R.YEAR = Y.YEAR
Here is the dbfiddle so you can visualize - https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=9e777ad925b09eb8ba299d610a78b999
Vertica SQL is not an available test environment, so this may not work directly but should at least get you on the right track.
The LAG function would not work to get 2019 for a few reasons
It's a window function and can only grab from data that is available - the default for LAG in your case appears to be 1 aka LAG(YEAR_COUNT, 1)
Statements in the select typically can't add any rows data back into a table, you would need to add in data with JOINs
If 2019 does exist in a prior table and you're using group by to get year count, it's possible that you have a where clause excluding the data.

Excluding the data with two consecutive conditions in Sql

I have a table looks like this.
Numbers No ZNo Place Year AId ABC
2 201905190611122 9208363 A/C/T/0/434 2019 4BBA17BB-01A9-41A6-BFA7-004CA0E6686F 1448
2 201802262493590 9208363 A/C/T/0/434 2018 4A895857-4E51-4ADC-836A-22D04E5D0B62 2008
1 20180119827875 9208364 A/C/T/0/435 2018 89BFD858-92AC-463B-91DF-54C22FDF7517 1150
1 20180119827875 9208365 A/C/T/0/436 2018 89BFD858-92AC-463B-91DF-54C22FDF7517 1150
2 201804273541023 9208366 A/C/T/0/437 2018 B01EFCA6-8397-4FA9-9EAD-13BE985D63DD 1348
2 201905197566364 9208366 A/C/T/0/437 2019 43E3D908-4AAD-4832-9981-115A5F9E9FC3 1466
2 201802084364285 9208367 A/C/T/0/438 2018 20BB4E90-6F59-484E-ADD3-5635F7CAACC3 1138
2 201802091458406 9208367 A/C/T/0/438 2018 E9085238-8437-4628-A125-09E5C811AB8D 1248
I want to write down a query that will first find same "Place" values. Then it will check out the "Year" columns. If the year values are the same for same place values, the data will be kept. So, it will basically should look like this:
Numbers No ZNo Place Year AId ABC
2 201802084364285 9208367 A/C/T/0/438 2018 20BB4E90-6F59-484E-ADD3-5635F7CAACC3 1138
2 201802091458406 9208367 A/C/T/0/438 2018 E9085238-8437-4628-A125-09E5C811AB8D 1248
Can you help me with this?

Have a derived table (the subquery) returning place/year combinations that exist at least twice. JOIN its result:
select t.*
from tablename t
join (select Place, Year
from tablename
group by Place, Year
having count(*) >= 2) dt
on t.place = dt.place and t.year = dt.year

You can use not exists :
select t.*
from table t
where not exists (select 1 from table t1 where t1.Place = t.Place and t1.year <> t.year);
By this way, you will also get A/C/T/0/435 & A/C/T/0/436 as because it doesn't have a other years too. So, you can add other unique column in sub-query. I found no is unique in sample data :
select t.*
from table t
where not exists (select 1
from table t1
where t1.Place = t.Place and t1.year <> t.year and t1.no <> t.no
);

You can use window functions:
select t.*
from (select t.*, count(*) over (partition by place, year) as cnt
from t
) t
where cnt >= 2;

Case Statement for multiple criteria

I would like to ignore some of the results of my query as for all intents and purposes, some of the results are a duplicate, but based on the way the request was made, we need to use this hierarchy and although we are seeing different 'Company_Name' 's, we need to ignore one of the results.
Query:
SELECT
COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM
some_table AS A12
GROUP BY
2
ORDER BY
3 ASC, 2 ASC
This code omits half a doze joins and where statements that are not germane to this question.
Results:
Customer_Name_Count Company_Name Total_Sales
-------------------------------------------------------------
1 3 Blockbuster 1,000
2 6 Jimmy's Bar 1,500
3 6 Jimmy's Restaurant 1,500
4 9 Impala Hotel 2,000
5 12 Sports Drink 2,500
In the above set, we can see that numbers 2 & 3 have the same count and the same total_sales number and similar company names. Is there a way to create a case statement that takes these 3 factors into consideration and then drops one or the other for Jimmy's enterprises? The other issue is that this has to be variable as there are other instances where this happens. And I would only want this to happen if the count and sales number match each other with a similar name in the company name.
Desired result:
Customer_Name_Count Company_Name Total_Sales
--------------------------------------------------------------
1 3 Blockbuster 1,000
2 6 Jimmy's Bar 1,500
3 9 Impala Hotel 2,000
4 12 Sports Drink 2,500

Looks like other answers are accurate based on assumption that Company_IDs are the same for both.
If Company_IDs are different for both Jimmy's Bar and Jimmy's Restaurant then you can use something like this. I suggest you get functional users involved and do some data clean-up else you'll be maintaining this every time this issue arise:
SELECT
COUNT(DISTINCT CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END) AS Customer_Name_Count
,CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END AS Company_Name
,SUM(A12.Total_Sales) AS Total_Sales
FROM some_table er
GROUP BY CASE
WHEN A12.Company_Name = 'Name2' THEN 'Name1'
ELSE A12.Company_Name
END

Your problem is that the joins you are using are multiplying the number of rows. Somewhere along the way, multiple names are associated with exactly the same entity (which is why the numbers are the same). You can fix this by aggregating by the right id:
SELECT COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
MAX(Company_Name) as Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM some_table AS A12
GROUP BY Company_id -- I'm guessing the column is something like this
ORDER BY 3 ASC, 2 ASC;
This might actually overstate the sales (I don't know). Better would be fixing the join so it only returned one name. One possibility is that it is a type-2 dimension, meaning that there is a time component for values that change over time. You may need to restrict the join to a single time period.

You need to have function to return a common name for the companies and then use DISTINCT:
SELECT DISTINCT
Customer_Name_Count,
dbo.GetCommonName(Company_Name) as Company_Name,
Total_Sales
FROM dbo.theTable

You can try to use ROW_NUMBER with window function to make row number by Customer_Name_Count and Total_Sales then get rn = 1
SELECT * FROM (
SELECT *,ROW_NUMBER() OVER(PARTITION BY Customer_Name_Count,Total_Sales ORDER BY Company_Name) rn
FROM (
SELECT
COUNT(DISTINCT A12.Company_name) AS Customer_Name_Count,
Company_Name,
SUM(Total_Sales) AS Total_Sales
FROM
some_table AS A12
GROUP BY
Company_Name
)t1
)t1
WHERE rn = 1

How to make a single line query include multiple lines in Oracle

I would like to take a set of data and expand it by adding date rows based an existing field. For instance, If I have the following table (TABLE1):
ID NAME YEAR
1 John 2001
2 Jim 2012
3 Sally 2005
I want to take this data and put it into another table but expand it to include a set of months (and from there I can add monthly information). If I just look at the first record (John) my result would be:
ID NAME YEAR MONTH
1 John 2001 01-JAN-2001
1 John 2001 01-FEB-2001
1 John 2001 01-MAR-2001
...
1 John 2001 01-DEC-2001
I have the mechanism to derive my monthly dates but how do I extract the data from TABLE1 to make TABLE2. Here is just a quick query but, of course, I get the ORA-01427 single-row subquery returns more than one row as expect. Just not sure how to organize the query to put these two pieces together:
select id,
name,
year,
book_cd,
(SELECT ADD_MONTHS('01-JAN-'|| year, LEVEL - 1)
FROM DUAL CONNECT BY LEVEL <= 12) month
from table1 ;
I realize I cant do this but I'm not sure how to put the two pieces together. I plan to bulk process records so it wont be one ID at a time Thanks for the help.

You can use a cross join:
select t.id,
t.name,
t.year,
t.book_cd,
ADD_MONTHS(to_date(t.year || '-01-01', 'YYYY-MM-DD'), m.rn) as mnth
from table1 t
cross join (select rownum - 1 as rn
from dual
connect by rownum <= 12) m

Sum values in one column and add to another table

My Table(BOB) is look like this:
Year Month Value
2010 1 100
2010 2 100
2010 3 100
2010 4 100
2010 5 100
I would like to add YTD values to another table (BOB2)
more exactly I want to see BOB 2 table like
Year Month Value
2010 1 100
2010 2 200
2010 3 300
2010 4 400
2010 5 500

See the answer below. I have simplified the query.
select
concat(cast(t1.year as char), cast(t1.month as char)) period_current,
sum(t1.amount) amount
from bob t1
left join bob t2 on
(t2.year + t2.month) <= (t1.year + t1.month)
group by
(t1.year + t1.month);
What the query is doing is using t1 as the base table and joining on the period (year + month) then you want to sum the amounts prior to that including the current amount. I haven't added in all the edge cases, but this gives you something to start from. If you are restricting your query to a single year, this should be enough.

Well, I think I understand what you are trying to do.. but if not, please re-phrase your question... You can accomplish what you have asked by using the following SQL.
--INSERT INTO BOB2 (Year, ID, Value)
SELECT a.Year, a.ID, (SELECT SUM(b.Value)
FROM BOB b
WHERE b.ID <= a.ID) as RunningTotalValue
FROM BOB a
ORDER BY a.Value;
Here is a SQLFiddle for you to look at.
EDIT: Change the ID column to "Month" after seeing the edit to your post.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using self join to find duplicates in SQL - sql

Is this what you want? select distinct t.acct, t.year, t.qtr, t.cnty from (select t.*, min(cnty) over (partition by acct, year, qtr) as min_cnty, max(cnty) over (partition by acct, year, qtr) as max_cnty from dbo.table t ) t where min_cnty <> max_cnty;

Related

LAG function alternative. I need the results for the missing year in between

Excluding the data with two consecutive conditions in Sql

Case Statement for multiple criteria

How to make a single line query include multiple lines in Oracle

Sum values in one column and add to another table

Categories

Resources