SQL query producing duplicate rows and I can't see why - sql

My query always produces duplicate results. How best do I go about troubleshooting this query with a database > 1 million rows.
Select segstart
,segment
,callid
,Interval
,dialed_num
,FiscalMonthYear
,SegStart_Date
,row_date
,Name
,Xferto
,TransferType
,Agent
,Sup
,Manager
,'MyCenter' = Case Center
When 'Livermore Call Center' Then 'LCC'
When 'Natomas Call Center' Then 'NCC'
When 'Concord Call Center' Then 'CCC'
When 'Virtual Call Center' Then 'VCC'
When 'Morgan Hill Call Center' Then 'MHCC'
Else Center
End
,Xferfrom
,talktime
,ANDREWSTABLE.transferred
,ANDREWSTABLE.disposition
,dispsplit
,callid
,hsplit.starttime
,CASE
WHEN hsplit.callsoffered > 0
THEN (CAST(hsplit.acceptable as DECIMAL)/hsplit.callsoffered)*100
ELSE '0'
END AS 'Service Level'
,hsplit.callsoffered
,hsplit.acceptable
FROM
(
Select segstart,
100*DATEPART(HOUR, segstart) + 30*(DATEPART(MINUTE, segstart)/30) as Interval,
FiscalMonthYear,
SegStart_Date,
dialed_num,
callid,
Name,
t.Queue AS 'Xferto',
TransferType,
RepLName+', '+RepFName AS Agent,
SupLName+', '+SupFName AS Sup,
MgrLName+', '+MgrFName AS Manager,
q.Center,
q.Queue AS 'Xferfrom',
e.anslogin,
e.origlogin,
t.Extension,
transferred,
disposition,
talktime,
dispsplit,
segment
From CMS_ECH.dbo.CaliforniaECH e
INNER JOIN Cal_RemReporting.dbo.TransferVDNs t on e.dialed_num = t.Extension
INNER JOIN InfoQuest.dbo.IQ_Employee_Profiles_v3_AvayaId q on e.origlogin = q.AvayaID
INNER JOIN Cal_RemReporting.dbo.udFiscalMonthTable f on e.SegStart_Date = f.Tdate
Where SegStart_Date between getdate()-90 and getdate()-1
And q.Center not in ('Collections Center',
'Cable Store',
'Business Services Center',
'Escalations')
And SegStart_Date between RepToSup_StartDate and RepToSup_EndDate
And SegStart_Date between SupToMgr_StartDate and SupToMgr_EndDate
And SegStart_Date between Avaya_StartDate and Avaya_EndDate
And SegStart_Date between RepQueue_StartDate and RepQueue_EndDate
AND (e.transferred like '1'
OR e.disposition like '4') order by segstart
) AS ANDREWSTABLE
--Left Join CMS_ECH.dbo.hsplit hsplit on hsplit.starttime = ANDREWSTABLE.Interval and hsplit.row_date = ANDREWSTABLE.SegStart_Date and ANDREWSTABLE.dispsplit = hsplit.split

There are two possibities:
There are multiple records in your system which will appear to produce duplicate rows in your resultset because your projection doesn't select sufficent columns to distinguish them or your where clause doesn't filter them out.
Your joins are generating spurious duplicates because the ON clauses are not complete.
Both of these can only be solved by somebody with the requisite level of domain knowledge. So we are not going to fix that query for you. Sorry.
What you need to do is comapare some duplicate results with some non-duplicate results and discover what the first group has in common which also distinguishes it from the second group.
I'm not saying it is easy, esecially with millions of rows. But if it was easy it wouldn't be worth doing.

I have into this a couple of times myself and it always ends up being one of my join statements. I would try removing your join statements one at a time and seeing if removing one of them reduced the number of duplicates.
You other option is to find a duplicate set of rows and query each table in the join on the join values and see what you get back.
Also, what database are you running and what version?

Related

When doing a hive query, I do not get all all the outputs when I scroll up in terminal

I have saved the table through a CREATE TABLE statement and even before I did a CREATE TABLE when I scroll up I can't see all the outputs in my hive query. I do not think it's my query statement but more something I pressed or did in the terminal. When I first started this project I could scroll and see all outputs but now I can only see some of them.
Any and all advice is helpful. (I can attach my code and sample out if people want)
""" select P.gender, F.eoy_age, F.NumberOfChildren, F.Homeowner, F.HouseholdIncome,
Case when HouseholdIncome='G' then '$80K-$90K' when HouseholdIncome='H' then '$90K-$100K'
when HouseholdIncome='I' then '$100K-$110K'
when HouseholdIncome='J' then '$110K-$120K'
when HouseholdIncome='K' then '$120K-$130K'
when HouseholdIncome='L' then '$130K-$140K'
when HouseholdIncome='M' then '$140K-$150K'
when HouseholdIncome='O' then '$175K-$200K'
when HouseholdIncome='P' then '$200K-$225K'
when HouseholdIncome='Q' then '$225K-$250K'
when HouseholdIncome='R' then '$250K-$275K'
when HouseholdIncome='S' then '$275K-$300K'
when HouseholdIncome='T' then '$300K-$400K'
when HouseholdIncome='U' then '$400K-$500K'
when HouseholdIncome='V' then '$500K-$600K'
when HouseholdIncome='W' then '$600K-$750K'
when HouseholdIncome='X' then '$750K-$1000K'
when HouseholdIncome='Y' then '$1000K-$2000K'
when HouseholdIncome='Z' then '$2000K+' end AS HouseholdIncomeRange,
F.State, count(*), count(distinct P.Email)
from pi_table P
LEFT JOIN feature_table F ON P.ID=F.ID
where F.age<45 and F.NumberOfChildren >= 1 AND F.Homeowner in ('H','W') and F.HouseholdIncome not in ('1','2','A','B','C','D','E ','F') AND F.State in ('MA', 'RI', 'NH', 'ME', 'VT', 'FL', 'GA', 'NC') AND P.Email is not null
GROUP BY F.NumberOfChildren, P.gender, F.eoy_age,F.Homeowner,F.HouseholdIncome,F.State;

Joined tables returning correct value when selecting single row but incorrect when entire dataset

I have 3 tables which I have joined: campaign level, ad level, and keyword level, and I need certain things from each of these. All 3 contain the following identical columns: campaign_name, campaign_id, day. Two of them also contain 'ad_group_name'.
The query is functioning and returning all the right values for data coming from keyword and campaign, but the values I need from ad level (conversion_name and it's values) are not. But the confusing part for me is that when I use the 'WHERE' clause and select only one row, the values are correct and match up to the source tables. Additionally, the values of a.conversion_name add up to 'conversion' total (+/- 1/2).
Results from single row (WHERE clause)
When I remove the WHERE clause and select the entire table, my numbers are significantly larger than they should be. The a.conversion_name values no longer add up to the 'conversion' total - in fact sometimes conversions = 0, and the a.conversion_name values return values.
Results when selecting entire table
Results when selecting entire table 2
I think I understand why this is happening, it's a grouping issue (?), and I have searched through lots of the existing threads, tried out sub queries and DISTINCTS, but my skill level at the moment means I am really struggling to figure this out.
Should I change how things are grouped? I have also tried adding the a.conversion_name as a dimension and then selecting it, but this doesn't help either.
WITH raw AS (SELECT
k.day,
k.campaign_name,
k.ad_group_name,
k.ad_group_type,
k.ad_group_id,
k.campaign_id,
k.keyword,
k.keyword_match_type,
AVG(CASE WHEN a.conversion_name = 'Verification Submitted' THEN a.conversions END) AS conv_verification_submitted,
AVG(CASE WHEN a.conversion_name = 'Email Confirmed' THEN a.conversions END) AS conv_email_confirmed,
AVG(CASE WHEN a.conversion_name = 'Account created' THEN a.conversions END) AS conv_account_created,
AVG(CASE WHEN a.conversion_name = 'Verification Started' THEN a.conversions END) AS conv_verification_started,
AVG(CASE WHEN a.conversion_name = 'Deposit Succeeded' THEN a.conversions END) AS conv_deposit_succeeded,
AVG(CASE WHEN a.conversion_name = 'Trade Completed' THEN a.conversions END) AS conv_trade_completed,
AVG(K.clicks) as clicks,
AVG(K.conversions) as conversions,
AVG(K.costs) as spend,
AVG(K.impressions) as impressions,
AVG(k.quality_score) as quality_score,
AVG(c.search_impression_share) as search_impression_share,
AVG(k.search_exact_match_impression_share) as search_exact_match_impression_share,
AVG(c.search_lost_impression_share_rank) as search_lost_impression_share_rank,
AVG(c.search_top_impression_share) as search_top_impression_share,
AVG(c.search_lost_impression_share_budget) as search_lost_impression_share_budget,
FROM `bigqpr.keyword-level-data` as k
LEFT JOIN `bigqpr.campaign-level-data` as c
ON c.campaign_name = k.campaign_name and c.day = k.day
LEFT JOIN `bigqpr.ad-level-data` as a
ON a.campaign_name = k.campaign_name and a.day = k.day and a.ad_group_name = k.ad_group_name
group by 1,2,3,4,5,6,7,8,a.conversion_name)
SELECT
day,
campaign_name,
ad_group_name,
ad_group_type,
ad_group_id,
campaign_id,
keyword,
keyword_match_type,
AVG(conv_verification_submitted) as conv_verification_submitted,
AVG(conv_email_confirmed) as conv_email_confirmed,
AVG(conv_account_created) as conv_account_created,
AVG(conv_verification_started) as conv_verification_started,
AVG(conv_deposit_succeeded) as conv_deposit_succeeded,
AVG(conv_trade_completed) as conv_trade_completed,
AVG(clicks) as clicks,
AVG(conversions) as conversions,
AVG(spend) as spend,
AVG(impressions) as impressions,
AVG(quality_score) as quality_score,
AVG(search_impression_share) as search_impression_share,
AVG(search_exact_match_impression_share) as search_exact_match_impression_share,
AVG(search_lost_impression_share_rank) as search_lost_impression_share_rank,
AVG(search_top_impression_share) as search_top_impression_share,
AVG(search_lost_impression_share_budget) as search_lost_impression_share_budget,
FROM raw
WHERE keyword = "specifickeyword" and day = "2022-05-22" and ad_group_name = "specificadgroup"
GROUP BY 1,2,3,4,5,6,7,8
I am also a beginner, but my first thought is to try an Inner Join instead of a Left Join. Sometimes this helps my result sets fit the number I'm looking for instead of being so large.

T-SQL Get Events after an order

So I have data that looks like this:
What I am trying to do in SQL is get all of the CollectedDT that are later than a date and less than a date, for example all of the values in yellow in both columns belong together, the records with just one column in yellow I don't care about and the ones in all green are keepers too. The idea is to try and implicitly say that one set of collection times belong to an order and another set belong to the other. There is no rule for hours difference between each, could be 1 hour, could be 100 hours.
While the query returned what it should have, it is decipherable that the CollectedDT of 11-04-2011 15:35 and newer most likely belongs to the 11-03-2011 21:12 order, there is no hard logic to dictate this, it is simply implied and needs to be treated as such.
Really no good starting point on how to go from here.
The query is as follows:
SELECT ORD.[episode_no],
ORD.[ord_no],
ORD.[pty_name] AS 'Ordering Provider',
ORD.[ent_dtime] AS 'Order Entered',
ASMT.[CollectedDT],
ORD.[str_dtime],
ORD.last_cng_dtime,
PMS.vst_start_dtime AS 'Admit Dt',
PMS.[vst_end_dtime] AS 'Discharge Dt',
ORD.[ord_qty],
CASE
WHEN ORD.[ord_sts] = 27
THEN 'Complete'
WHEN ORD.ord_sts = 34
THEN 'Discontinued'
ELSE ORD.ord_sts
END AS 'Order Status',
ORD.[desc_as_written],
ASMT.[FormUsage] AS 'Assessment',
ASMT.[AssessmentID],
datediff(minute, ORD.ent_dtime, ASMT.CollectedDT) AS [order_entry_to_collected_minutes],
datediff(hour, ORD.ent_dtime, ASMT.CollectedDT) AS [order_entry_to_collected_hours]
FROM [SMSPHDSSS0X0].[smsmir].[mir_sr_ord] AS ORD
LEFT OUTER JOIN [smsmir].[mir_sr_vst_pms] AS PMS ON PMS.episode_no = ORD.episode_no
LEFT OUTER JOIN [smsmir].[mir_sc_Assessment] AS ASMT ON ASMT.PatientVisit_oid = PMS.vst_no
WHERE (ORD.desc_as_written LIKE 'physical Therapy%')
AND (
ASMT.FormUsage IN ('Physical Therapy Initial Asmt', 'Physical Therapy Re-evaluation', 'PT Flowsheet')
AND ASMT.CollectedDT > ORD.ent_dtime
)
AND ORD.ord_sts = 27
AND ASMT.AssessmentStatus = 'Complete'

Query using the results from another query

I'm trying to write a query that will produce set of results based on the results from another query. All the required data for the queries are from one table, just a different set of query criteria. I'm not sure if what I'm asking is even possible to create in a single query.
select OBJID, case_type, activity, TITLE, x_rev_esc_prim_reason, x_rev_esc_sec_reason, x_esc_third_reason, x_create_dt, x_update_dt, x_sales_rep
from table_case
The results from the above query basically captures everything I need from the table, but I need to reduce the number or filter the results returned.
The results from the below query is what i need, to further reduce the results from the above query. In addition, the query will also need the x_create_dt from the below query results to be greater than the x_create_dt from the results from the above query results.
select OBJID, case_type, activity, TITLE, x_rev_esc_prim_reason, x_rev_esc_sec_reason, x_esc_third_reason, x_create_dt, x_update_dt, x_sales_rep
from table_case
where case_type = 'Sales'
and activity = 'DO'
IMAGE OF THE SAMPLE DATA AND RESULTS
DB is currently Oracle10g
You wanted to return all case_type='request' rows for each sales_rep, where the x_create_date is earlier than the latest x_create_date in a case_type='Sales' and activity='DO' row for that same sales_rep.
First we create the query that returns the qualifying information:
SELECT sales_rep, MAX(x_create_date) AS latest_date
FROM table_case
WHERE case_type = 'Sales'
AND activity = 'DO'
GROUP BY sales_rep
You can test this query by itself, and verify that it produces the desired results.
Then we embed this query in the outer query, and join the two, like this:
SELECT c.OBJID, c.case_type, c.activity, c.TITLE,
c.x_rev_esc_prim_reason, c.x_rev_esc_sec_reason, c.X_esc_third_reason,
c.x_create_dt, c.x_update_dt
FROM table_case AS c
INNER JOIN (
SELECT sales_rep, MAX(x_create_date) AS latest_date
FROM table_case
WHERE case_type = 'Sales'
AND activity = 'DO'
GROUP BY sales_rep) AS q
ON c.sales_rep = q.sales_rep and q.latest_date > c.x_create_date
WHERE c.case_type = 'request'

Query taking too long - Optimization

I am having an issue with the following query returning results a bit too slow and I suspect I am missing something basic. My initial guess is the 'CASE' statement is taking too long to process its result on the underlying data. But it could be something in the derived tables as well.
The question is, how can I speed this up? Are there any glaring errors in the way I am pulling the data? Am I running into a sorting or looping issues somewhere? The query runs for about 40 seconds, which seems quite long. C# is my primary expertise, SQL is a work in progress.
Note I am not asking "write my code" or "fix my code". Just for a pointer in the right direction, I can't seem to figure out where the slow down occurs. Each derived table runs very quickly (less than a second) by themselves, the joins seem correct and the result set is returning exactly what I need. It's just too slow and I'm sure there are better SQL scripter's out there ;) Any tips would be greatly appreciated!
SELECT
hdr.taker
, hdr.order_no
, hdr.po_no as display_po
, cust.customer_name
, hdr.customer_id
, 'INCORRECT-LARGE ORDER' + CASE
WHEN (ext_price_calc >= 600.01 and ext_price_calc <= 800) and fee_price.unit_price <> round(ext_price_calc * -.01,2)
THEN '-1%: $' + cast(cast(ext_price_calc * -.01 as decimal(18,2)) as varchar(255))
WHEN ext_price_calc >= 800.01 and ext_price_calc <= 1000 and fee_price.unit_price <> round(ext_price_calc * -.02,2)
THEN '-2%: $' + cast(cast(ext_price_calc * -.02 as decimal(18,2)) as varchar(255))
WHEN ext_price_calc > 1000 and fee_price.unit_price <> round(ext_price_calc * -.03,2)
THEN '-3%: $' + cast(cast(ext_price_calc * -.03 as decimal(18,2)) as varchar(255))
ELSE
'OK'
END AS Status
FROM
(myDb_view_oe_hdr hdr
LEFT OUTER JOIN myDb_view_customer cust
ON hdr.customer_id = cust.customer_id)
LEFT OUTER JOIN wpd_view_sales_territory_by_customer territory
ON cust.customer_id = territory.customer_id
LEFT OUTER JOIN
(select
order_no,
SUM(ext_price_calc) as ext_price_calc
from
(select
hdr.order_no,
line.item_id,
(line.qty_ordered - isnull(qty_canceled,0)) * unit_price as ext_price_calc
from myDb_view_oe_hdr hdr
left outer join myDb_view_oe_line line
on hdr.order_no = line.order_no
where
line.delete_flag = 'N'
AND line.cancel_flag = 'N'
AND hdr.projected_order = 'N'
AND hdr.delete_flag = 'N'
AND hdr.cancel_flag = 'N'
AND line.item_id not in ('LARGE-ORDER-1%','LARGE-ORDER-2%', 'LARGE-ORDER-3%', 'FUEL','NET-FUEL', 'CONVENIENCE-FEE')) as line
group by order_no) as order_total
on hdr.order_no = order_total.order_no
LEFT OUTER JOIN
(select
order_no,
count(order_no) as convenience_count
from oe_line with (nolock)
left outer join inv_mast inv with (nolock)
on oe_line.inv_mast_uid = inv.inv_mast_uid
where inv.item_id in ('LARGE-ORDER-1%','LARGE-ORDER-2%', 'LARGE-ORDER-3%')
and oe_line.delete_flag <> 'Y'
group by order_no) as fee_count
on hdr.order_no = fee_count.order_no
INNER JOIN
(select
order_no,
unit_price
from oe_line line with (nolock)
where line.inv_mast_uid in (select inv_mast_uid from inv_mast with (nolock) where item_id in ('LARGE-ORDER-1%','LARGE-ORDER-2%', 'LARGE-ORDER-3%'))) as fee_price
ON fee_count.order_no = fee_price.order_no
WHERE
hdr.projected_order = 'N'
AND hdr.cancel_flag = 'N'
AND hdr.delete_flag = 'N'
AND hdr.completed = 'N'
AND territory.territory_id = ‘CUSTOMERTERRITORY’
AND ext_price_calc > 600.00
AND hdr.carrier_id <> '100004'
AND fee_count.convenience_count is not null
AND CASE
WHEN (ext_price_calc >= 600.01 and ext_price_calc <= 800) and fee_price.unit_price <> round(ext_price_calc * -.01,2)
THEN '-1%: $' + cast(cast(ext_price_calc * -.01 as decimal(18,2)) as varchar(255))
WHEN ext_price_calc >= 800.01 and ext_price_calc <= 1000 and fee_price.unit_price <> round(ext_price_calc * -.02,2)
THEN '-2%: $' + cast(cast(ext_price_calc * -.02 as decimal(18,2)) as varchar(255))
WHEN ext_price_calc > 1000 and fee_price.unit_price <> round(ext_price_calc * -.03,2)
THEN '-3%: $' + cast(cast(ext_price_calc * -.03 as decimal(18,2)) as varchar(255))
ELSE
'OK' END <> 'OK'
Just as a clue to the right direction for optimization:
When you do an OUTER JOIN to a query with calculated columns, you are guaranteeing not only a full table scan, but that those calculations must be performed against every row in the joined table. It appears that you can actually do your join to oe_line without the column calculations (i.e. by filtering ext_price_calc to a specific range).
You don't need to do most of the subqueries that are in your query--the master query can be recrafted to use regular table join syntax. Joins to subqueries containing subqueries presents a challenge to the SQL optimizer that it may not be able to meet. But by using regular joins, the optimizer has a much better chance at identifying more efficient query strategies.
You don't tag which SQL engine you're using. Every database has proprietary extensions that may allow for speedier or more efficient queries. It would be easier to provide useful feedback if you indicated whether you were using MySQL, SQL Server, Oracle, etc.
Regardless of the database you're using, reviewing the query plan is always a good place to start. This will tell you where most of the I/O and time in your query is being spent.
Just on general principle, make sure your statistics are up-to-date.
It's may not be solvable by any of us without the real stuff to test with.
IF that's the case and nobody else posts the answer, I can still help. Here is how to trouble shoot it.
(1) take joins and pieces out one by one.
(2) this will cause errors. Remove or fake the references to get rid of them.
(3) see how that works.
(4) Put items back before you try taking something else out
(5) keep track...
(6) also be aware where a removal of something might drastically reduce the result set.
You might find you're missing an index or some other smoking gun.
I was having the same problem and I was able to solve it by indexing one of the tables and setting a primary key.
I strongly suspect that the problem lies in the number of joins you're doing. A lot of databases do joins basically by systemically checking all possible combinations of the various tables as being valid - so if you're joinging table A and B on column C, and A looks like:
Name:C
Fred:1
Alice:2
Betty:3
While B looks like:
C:Pet
1:Alligator
2:Lion
3:T-Rex
When you do the join, it checks all 9 possibilities:
Fred:1:1:Alligator
Fred:1:2:Lion
Fred:1:3:T-Rex
Alice:2:1:Alligator
Alice:2:2:Lion
Alice:2:3:T-Rex
Betty:3:1:Alligator
Betty:3:2:Lion
Betty:3:3:T-Rex
And goes through and deletes the non-matching ones:
Fred:1:1:Alligator
Alice:2:2:Lion
Betty:3:3:T-Rex
... which means with three entries in each table, it creates nine temporary records, sorts through them all, and deletes six of them ... all before it actually sorts through the results for what you're after (so if you are looking for Betty's Pet, you only want one row on that final result).
... and you're doing how many joins and sub-queries?