SQL query to remove these duplicates

SQL query to remove these duplicates - sql

So I have no experience in SQL but I have no resource to use for identifying duplicates.
The problem: A SER_NO(serial #) can show up on 2 different dates (EVENT_TS), and I only want to see the first occurrence the SER_NO shows up.
If I have the choice, I would keep the date that the SER_NO showed up on, and not any other date after, but at this point, I just don't want to see duplicate SER_NO
I went the SELECT DISTINCT route and that doesn't help... I need to identify if the SER_NO occurs more than once, and then if it does, I aim to keep the first occurrence (MIN DATE).
SELECT
EVENT_TS, EVENT_NO, FAC_PROD_FAM_CD, SER_NO, DISC_AREA_ID, DISC_AREA_DESC,
QUALITY_VELOCITY, CMPNT_SERIAL_NO, PROTOTYPE_IND, EXT_CPY_STAT
FROM ABUS_DW.V_BIQ_R8_QWB_EVENTS
WHERE
(FAC_PROD_FAM_CD='ACOM' OR FAC_PROD_FAM_CD='SCOM' OR FAC_PROD_FAM_CD='LAP' OR
FAC_PROD_FAM_CD='RM' OR FAC_PROD_FAM_CD='SCRD')
AND (DISC_AREA_ID='400' OR DISC_AREA_ID='450')
AND PROTOTYPE_IND<>'Y' AND EXT_CPY_STAT<>'D'
AND EVENT_TS>=<Parameters.Start Date> ORDER BY EVENT_TS
Also, I am doing this in Tableau's Custom SQL Query feature... which.. without knowing anything about SQL or the basic syntax... seems to not like any fancy tricks. Maybe it does... I don't know... But all I've gotten are errors using other people's scripts. It seems very specific on the syntax it wants to see.

Assuming that every other field you're selecting on (besides the date) is also duplicated, you can use some aggregation in your query to squish the output records together, using either the min(event_ts) or max(event_ts) depending on what you want to see:
SELECT MIN(EVENT_TS) as EVENT_TS
,EVENT_NO
,FAC_PROD_FAM_CD
,SER_NO
,DISC_AREA_ID
,DISC_AREA_DESC
,QUALITY_VELOCITY
,CMPNT_SERIAL_NO
,PROTOTYPE_IND
,EXT_CPY_STAT
FROM ABUS_DW.V_BIQ_R8_QWB_EVENTS
WHERE FAC_PROD_FAM_CD IN ('ACOM', 'SCOM', 'LAP', 'RM', 'SCRD')
AND DISC_AREA_ID IN ('400','450')
AND PROTOTYPE_IND <> 'Y'
AND EXT_CPY_STAT <> 'D'
AND EVENT_TS >= <Parameters.Start DATE>
GROUP BY
EVENT_NO
,FAC_PROD_FAM_CD
,SER_NO
,DISC_AREA_ID
,DISC_AREA_DESC
,QUALITY_VELOCITY
,CMPNT_SERIAL_NO
,PROTOTYPE_IND
,EXT_CPY_STAT
ORDER BY EVENT_TS
If each row of your duplicate records has distinct values for FAC_PROD_FAM_CD, SER_NO, DISC_AREA_ID, etc... then you'll have to get fancier. You could use a correlated subquery as one option:
SELECT EVENT_TS
,EVENT_NO
,FAC_PROD_FAM_CD
,SER_NO
,DISC_AREA_ID
,DISC_AREA_DESC
,QUALITY_VELOCITY
,CMPNT_SERIAL_NO
,PROTOTYPE_IND
,EXT_CPY_STAT
FROM ABUS_DW.V_BIQ_R8_QWB_EVENTS AS t1
WHERE FAC_PROD_FAM_CD IN ('ACOM', 'SCOM', 'LAP', 'RM', 'SCRD')
AND DISC_AREA_ID IN ('400','450')
AND PROTOTYPE_IND <> 'Y'
AND EXT_CPY_STAT <> 'D'
AND EVENT_TS =
(
SELECT MIN(EVENT_TS)
FROM ABUS_DW.V_BIQ_R8_QWB_EVENTS
WHERE FAC_PROD_FAM_CD IN ('ACOM', 'SCOM', 'LAP', 'RM', 'SCRD')
AND DISC_AREA_ID IN ('400','450')
AND PROTOTYPE_IND <> 'Y'
AND EXT_CPY_STAT <> 'D'
AND EVENT_TS >= <Parameters.Start DATE>
AND t1.EVENT_ID = EVENT_ID
)
ORDER BY EVENT_TS;
Here, in a subquery, we get the min(event_ts) for the event_id given all the same WHERE conditions and then we restrict the main query by that min(event_id).

Related

SQL BigQuery - Error that variable is not grouped by even though it is

SQL Code:
SELECT community_table.community_name,
community_table.id,
DATE(timestamp) as date,
ifnull(COUNT(distinct app_opened.user_id), 0) as num_opened_DAU,
lag(COUNT(distinct app_opened.user_id)) OVER
(ORDER BY community_table.community_name, community_table.id, DATE(timestamp)) as pre_Value
FROM *** app_opened
LEFT JOIN (
SELECT DISTINCT id, community_id_2, context_traits_first_name, context_traits_last_name
FROM (
SELECT *
FROM ***,
UNNEST (JSON_EXTRACT_ARRAY(context_traits_community_ids, "$")) as community_id_2
)
GROUP by community_id_2, id, context_traits_first_name, context_traits_last_name) as community_id_table
ON community_id_table.id = app_opened.user_id
LEFT JOIN (
SELECT DISTINCT id, name as community_name
FROM ***) as community_table
ON TO_JSON_STRING(community_table.id) = community_id_table.community_id_2
WHERE app_opened.user_id is not null AND
EXTRACT(DAYOFWEEK FROM DATE(timestamp)) = 2 AND
community_table.community_name is not null
GROUP BY community_table.community_name, community_table.id, DATE(timestamp)
Error Message:
I am quite confused on what could be going wrong here, as the error says that timestamp is not grouped, even though I have grouped it at the bottom. I tried including just timestamp rather than Date(timestamp), but that ruins the table data that I am trying to create, where I find the number of users on a single day. Does anyone have any other ideas? My goal is for a single row, get the previous row's data, but because I am grouping by specific metrics, I need to make sure they are ordered by them as well. Thank you so much!

I think you simply need to modify OVER part as:
OVER (PARTITION BY community_table.community_name, community_table.id, DATE(timestamp)) as pre_Value
UPDATE. Seems that the problem was caused by using DATE() function within OVER so it can be solved by using DATE(timestamp) inside of subquery and passing alias to OVER

Getting a query result taken from the same data but with temporary var

I got a simple thing to do.
Well, maybe not, but someone somewhere surely can help me out : P
I got a simple data structure that contains
expedition date
delivery date
transaction type
I would need to create a query which could
order the rows by a date specific to the transaction type.
(ie : using the expedition date for transaction of type "selling", and delivery date for transaction of type "purchasing")
I was wondering if there was a more efficient way to do this than
by fetching 2 times the same data with different clause where(while adding a column used to order them(tempDate)) and then using another select to encompass these 2 queries to which I would add the order clause on the tempDate.
--> the initial fetching I would do 2 times works on many tables(many, many, many joins)
Basically my current solution is :
Select * from
(
Select ...
date_exp as dateTemp;
from ...
where conditions* And dateRelatedCondition
UNION
Select ...
date_livraison as dateTemp;
from ...
Where conditions* And NOT(dateRelatedCondition)
) as comboSelect
Order By MIN(comboSelect.dateTemp)
OVER(PARTITION BY(REF_product)),
(REF_product),
comboSelect.dateTemp asc;
*
->Those conditions are the same in both inner Select query
Thank you for your time.

Without the UNION:
dateRelatedCondition should be removed from WHERE and put to the SELECT like:
CASE WHEN dateRelatedCondition THEN date_exp ELSE date_livraison END as dateTemp
Without the subquery:
in ORDER BY you need the same expression in the window function:
Order By MIN(CASE WHEN dateRelatedCondition THEN date_exp ELSE date_livraison END)
OVER(PARTITION BY(REF_product)),
(REF_product),
dateTemp asc

You mean like this?:
ORDER BY CASE
WHEN TransactionType = 'Selling' THEN ExpeditionDate
WHEN TransactionType = 'purchasing' THEN DeliveryDate
END

Calculated column syntax when using a group by function Teradata

I'm trying to include a column calculated as a % of OTYPE.
IE
Order type | Status | volume of orders at each status | % of all orders at this status
SELECT
T.OTYPE,
STATUS_CD,
COUNT(STATUS_CD) AS STATVOL,
(STATVOL / COUNT(ROW_ID)) * 100
FROM Database.S_ORDER O
LEFT JOIN /* Finding definitions for status codes & attaching */
(
SELECT
ROW_ID AS TYPEJOIN,
"NAME" AS OTYPE
FROM database.S_ORDER_TYPE
) T
ON T.TYPEJOIN = ORDER_TYPE_ID
GROUP BY (T.OTYPE, STATUS_CD)
/*Excludes pending and pending online orders */
WHERE CAST(CREATED AS DATE) = '2018/09/21' AND STATUS_CD <> 'Pending'
AND STATUS_CD <> 'Pending-Online'
ORDER BY T.OTYPE, STATUS_CD DESC
OTYPE STATUS_CD STATVOL TOTALPERC
Add New Service Provisioning 2,740 100
Add New Service In-transit 13 100
Add New Service Error - Provisioning 568 100
Add New Service Error - Integration 1 100
Add New Service Complete 14,387 100
Current output just puts 100 at every line, need it to be a % of total orders
Could anyone help out a Teradata & SQL student?
The complication making this difficult is my understanding of the group by and count syntax is tenuous. It took some fiddling to get it displayed as I have it, I'm not sure how to introduce a calculated column within this combo.
Thanks in advance

There are a couple of places the total could be done, but this is the way I would do it. I also cleaned up your other sub query which was not required, and changed the date to a non-ambiguous format (change it back if it cases an issue in Teradata)
SELECT
T."NAME" as OTYPE,
STATUS_CD,
COUNT(STATUS_CD) AS STATVOL,
COUNT(STATUS_CD)*100/TotalVol as Pct
FROM database.S_ORDER O
LEFT JOIN EDWPRDR_VW40_SBLCPY.S_ORDER_TYPE T on T.ROW_ID = ORDER_TYPE_ID
cross join (select count(*) as TotalVol from database.S_ORDER) Tot
GROUP BY T."NAME", STATUS_CD, TotalVol
WHERE CAST(CREATED AS DATE) = '2018-09-21' AND STATUS_CD <> 'Pending' AND STATUS_CD <> 'Pending-Online'
ORDER BY T."NAME", STATUS_CD DESC

A where clause comes before a group by clause, so the query
shown in the question isn't valid.
Always prefix every column reference with the relevant table alias, below I have assumed that where you did not use the alias that it belongs to the orders table.
You probably do not need a subquery for this left join. While there are times when a subquery is needed or good for performance, this does not appear to be the case here.
Most modern SQL compliant databases provide "window functions", and Teradata does do this. They are extremely useful, and here when you combine count() with an over clause you can get the total of all rows without needing another subquery or join.
Because there is neither sample data nor expected result provided with the question I do not actually know which numbers you really need for your percentage calculation. Instead I have opted to show you different ways to count so that you can choose the right ones. I suspect you are getting 100 for each row because the count(status_cd) is equal to the count(row_id). You need to count status_cd differently to how you count row_id. nb: The count() function increases by 1 for every non-null value
I changed the way your date filter is applied. It is not efficient to change data on every row to suit constants in a where clause. Leave the data untouched and alter the way you apply the filter to suit the data, this is almost always more efficient (search sargable)
SELECT
t.OTYPE
, o.STATUS_CD
, COUNT(o.STATUS_CD) count_status
, COUNT(t.ROW_ID count_row_id
, count(t.row_id) over() count_row_id_over
FROM dbo.S_ORDER o
LEFT JOIN dbo.S_ORDER_TYPE t ON t.TYPEJOIN = o.ORDER_TYPE_ID
/*Excludes pending and pending online orders */
WHERE o.CREATED >= '2018-09-21' AND o.CREATED < '2018-09-22'
AND o.STATUS_CD <> 'Pending'
AND o.STATUS_CD <> 'Pending-Online'
GROUP BY
t.OTYPE
, o.STATUS_CD
ORDER BY
t.OTYPE
, o.STATUS_CD DESC

As #TomC already noted, there's no need for the join to a Derived Table. The simplest way to get the percentage is based on a Group Sum. I also changed the date to an Standard SQL Date Literal and moved the where before group by.
SELECT
t."NAME",
o.STATUS_CD,
Count(o.STATUS_CD) AS STATVOL,
-- rule of thumb: multiply first then divide, otherwise you will get unexpected results
-- (Teradata rounds after each calculation)
100.00 * STATVOL / Sum(STATVOL) Over ()
FROM database.S_ORDER AS O
/* Finding definitions for status codes & attaching */
LEFT JOIN database.S_ORDER_TYPE AS t
ON t.ROW_ID = o.ORDER_TYPE_ID
/*Excludes pending and pending online orders */
-- if o.CREATED is a Timestamp there's no need to apply the CAST
WHERE Cast(o.CREATED AS DATE) = DATE '2018-09-21'
AND o.STATUS_CD NOT IN ('Pending', 'Pending-Online')
GROUP BY (T.OTYPE, o.STATUS_CD)
ORDER BY T.OTYPE, o.STATUS_CD DESC
Btw, you probably don't need an Outer Join, Inner should return the same result.

SQL Filter by Excluding Specific String

I am looking to run my query (below) by displaying latest value for "DATA_POINT_UPLOAD_DATA"."VALUE" , except 'READY'. Currently, it displays all 'READY' values, however, I want to do the opposite by displaying any values up to the time of execution except 'READY'.
Here is my current query:
select "DATA_POINT_UPLOAD_DATA"."LAST_UPDATED_TIMESTAMP" as "TIMESTAMP",
"DATA_POINT_UPLOAD_DATA"."VALUE" as "COMMENTS"
from "DB"."COMPONENT" "COMPONENT",
"DB"."COMPONENT_DATA_POINT" "COMPONENT_DATA_POINT",
"DB"."DATA_POINT_UPLOAD_DATA" "DATA_POINT_UPLOAD_DATA"
where "COMPONENT_DATA_POINT"."ID"="DATA_POINT_UPLOAD_DATA"."COMPONENT_DATA_POINT_ID"
and "COMPONENT"."ID"="COMPONENT_DATA_POINT"."COMPONENT_ID"
and "DATA_POINT_UPLOAD_DATA"."VALUE" ='READY'
and "DATA_POINT_UPLOAD_DATA"."LAST_UPDATED_TIMESTAMP" between ('01-JUN-17') and ('30-JUN-17')
and "COMPONENT_DATA_POINT"."NAME" ='StateOfItem'
and "COMPONENT"."SITE_ID" in('abc123');
Any help would be greatly appreciated.

In your WHERE clause you have this: "DATA_POINT_UPLOAD_DATA"."VALUE" ='READY'. That means you want to display the rows where DATA_POINT_UPLOAD_DATA has the value 'READY'.
Change your query and instead of using = try using != or <>.
SELECT "DATA_POINT_UPLOAD_DATA"."LAST_UPDATED_TIMESTAMP" AS "TIMESTAMP",
"DATA_POINT_UPLOAD_DATA"."VALUE" AS "COMMENTS"
FROM "DB"."COMPONENT" "COMPONENT",
"DB"."COMPONENT_DATA_POINT" "COMPONENT_DATA_POINT",
"DB"."DATA_POINT_UPLOAD_DATA" "DATA_POINT_UPLOAD_DATA"
WHERE "COMPONENT_DATA_POINT"."ID" ="DATA_POINT_UPLOAD_DATA"."COMPONENT_DATA_POINT_ID"
AND "COMPONENT"."ID" ="COMPONENT_DATA_POINT"."COMPONENT_ID"
AND "DATA_POINT_UPLOAD_DATA"."VALUE" !='READY'
AND "DATA_POINT_UPLOAD_DATA"."LAST_UPDATED_TIMESTAMP" BETWEEN ('01-JUN-17') AND ('30-JUN-17')
AND "COMPONENT_DATA_POINT"."NAME" ='StateOfItem'
AND "COMPONENT"."SITE_ID" IN('abc123');

You are asking for the latest record per VALUE now. You are only selecting VALUE and LAST_UPDATED_TIMESTAMP, however. So what you are asking is merely the maximum LAST_UPDATED_TIMESTAMP per VALUE. In SQL this translates to MAX(last_updated_timestamp) with GROUP BY value.
Select
max(last_updated_timestamp) as "timestamp",
value as comments
From db.data_point_upload_data
Where value <> 'READY'
and last_updated_timestamp between '2017-06-01' and '2017-06-30'
and cdp_id in
(
select id
from db.component_data_point
where name = 'StateOfItem'
and component_id in (select id from db.component where site_id = 'abc123')
)
Group by value;

Sorry to say, but that is a horrible query. Almost only upper case so as to minimize readability, table alias names that are no alias names, a join syntax that was made redundant twentyfive years ago, date string literals that only work in certain language settings, and unnecessary joins.
Then you select records with value = 'READY' and say that you want records that are not 'READY'. Well, then: WHERE NOT value = 'READY' or simply WHERE value <> 'READY'.
Here is the altered query:
Select
last_updated_timestamp as "timestamp",
value as comments
From db.data_point_upload_data
Where value <> 'READY'
and last_updated_timestamp between '2017-06-01' and '2017-06-30'
and cdp_id in
(
select id
from db.component_data_point
where name = 'StateOfItem'
and component_id in (select id from db.component where site_id = 'abc123')
);
If you only want to see the latest n rows, then order by last_updated_timestamp desc limit <n>.

multiple count(distinct)

I get an error unless I remove one of the count(distinct ...). Can someone tell me why and how to fix it?
I'm in vfp. iif([condition],[if true],[else]) is equivalent to case when
SELECT * FROM dpgift where !nocalc AND rectype = "G" AND sol = "EM112" INTO CURSOR cGift
SELECT
list_code,
count(distinct iif(language != 'F' AND renew = '0' AND type = 'IN',donor,0)) as d_Count_E_New_Indiv,
count(distinct iif(language = 'F' AND renew = '0' AND type = 'IN',donor,0)) as d_Count_F_New_Indiv /*it works if i remove this*/
FROM cGift gift
LEFT JOIN
(select didnumb, language, type from dp) d
on cast(gift.donor as i) = cast(d.didnumb as i)
GROUP BY list_code
ORDER by list_code
edit:
apparently, you can't use multiple distinct commands on the same level. Any way around this?

VFP does NOT support two "DISTINCT" clauses in the same query... PERIOD... I've even tested on a simple table of my own, DIRECTLY from within VFP such as
select count( distinct Col1 ) as Cnt1, count( distinct col2 ) as Cnt2 from MyTable
causes a crash. I don't know why you are trying to do DISTINCT as you are just testing a condition... I more accurately appears you just want a COUNT of entries per each category of criteria instead of actually DISTINCT
Because you are not "alias.field" referencing your columns in your query, I don't know which column is the basis of what. However, to help handle your DISTINCT, and it appears you are running from WITHIN a VFP app as you are using the "INTO CURSOR" clause (which would not be associated with any OleDB .net development), I would pre-query and group those criteria, something like...
select list_code,
donor,
max( iif( language != 'F' and renew = '0' and type = 'IN', 1, 0 )) as EQualified,
max( iif( language = 'F' and renew = '0' and type = 'IN', 1, 0 )) as FQualified
from
list_code
group by
list_code,
donor
into
cursor cGroupedByDonor
so the above will ONLY get a count of 1 per donor per list code, no matter how many records that qualify. In addition, if one record as an "F" and another does NOT, then you'll have a value of 1 in EACH of the columns... Then you can do something like...
select
list_code,
sum( EQualified ) as DistEQualified,
sum( FQualified ) as DistFQualified
from
cGroupedByDonor
group by
list_code
into
cursor cDistinctByListCode
then run from that...

You can try using either another derived table or two to do the calculations you need, or using projections (queries in the field list). Without seeing the schema, it's hard to know which one will work for you.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas