how to create multiple Concatenation on multiple left joins in a query - sql

I am trying to left join table HE to table S as shown in the code below.
In order for me to get a match and join in the data, i have to concat client id and market as shown below in [Headendmarket&tbcode] and [SETClientID+Market] from both S and HE tables.
but because some of the data in the columns are not clean. the concat above is not enough and there are some missing things coming in S in the join.
the solution for me would be to further vlook up the data for the data that is left null after running the query below and do another concat on S.[Market]&s.[client] and leftjoin it further with HE.[HeadendMarket] & [HE.Advertisername]. this is going to match the remaining data thats null to see if there is a possible match and bring back the data it can for the S table.
not sure what is the best way for me to do this in this query?
SELECT DISTINCT CONCAT(HE.[AdvertiserTBCode],HE.[HeadendMarket]) AS [Headendmarket&tbcode]
,CONCAT(S.[CLIENTID],S.[Market]) AS [SETClientID+Market]
,HE.[CurrentRegionName]
,HE.[CurrentMarketName]
,HE.[CurrentSalesTeamName]
,HE.[CurrentSalesOfficeName]
,HE.[CurrentCorpAEName]
,HE.[CurrentAEType]
,HE.[AdvertiserTBCode] --Has all the correct data--
,HE.[AdvertiserName]
,HE.[ParentAdvertiserName]
,HE.[HeadendRegion]
,HE.[HeadendMarket] --has all correct data--
,HE.[CorpCategoryGroup]
,S.[Actuals vs projections]
,S.[Year]
,S.[Month]
,S.[Area]
,CASE S.[Market] --some markets not the same name which is why we are doing this to have the same markets as HE Table--
WHEN 'Twin Cities' THEN 'Minneapolis - St. Paul'
WHEN 'Fort Myers' THEN 'Ft. Myers - Naples'
WHEN 'Bowling Green' THEN 'Nashville'
WHEN 'North Miss' THEN 'TUPELO'
WHEN 'Monroe, LA' THEN 'Monroe'
WHEN 'Southern Miss-Hattiesburg/Laurel/Meridian' THEN 'SOUTHERN MISS'
WHEN 'Northern Miss-Columbus/Tupelo' THEN 'Tulepo'
WHEN 'Little Rock, AR' THEN 'Little Rock'
WHEN 'Fort Wayne' THEN 'Ft. Wayne'
WHEN 'Wheeling/Youngstown/Canfield' THEN 'WYC'
WHEN 'Johnstown/Altoona/State College' THEN 'Johnstown-Altoona'
WHEN 'Washington, D.C.' THEN 'Washington'
ELSE S.[Market] END AS [SET Market]
,S.[Zone Type]
,S.[Category]
,S.[Subcategory]
,S.[Event]
,S.[Network]
,S.[AE]
,S.[Client] -- some wrong fields--
,S.[ClientID] --some incorrect data entered. this with the market concatenation should match the data in the HE table in a perfect world.--
,S.[# Spots]
,S.[Gross ($)]
FROM [REO].[dbo].[Sports] S
LEFT JOIN [REO].[dbo].[HF_Final] HE
ON CONCAT(S.[CLIENTID],CASE S.[Market]
WHEN 'Twin Cities' THEN 'Minneapolis - St. Paul'
WHEN 'Fort Myers' THEN 'Ft. Myers - Naples'
WHEN 'Bowling Green' THEN 'Nashville'
WHEN 'North Miss' THEN 'TUPELO'
WHEN 'Monroe, LA' THEN 'Monroe'
WHEN 'Southern Miss-Hattiesburg/Laurel/Meridian' THEN 'SOUTHERN MISS'
WHEN 'Northern Miss-Columbus/Tupelo' THEN 'Tulepo'
WHEN 'Little Rock, AR' THEN 'Little Rock'
WHEN 'Fort Wayne' THEN 'Ft. Wayne'
WHEN 'Wheeling/Youngstown/Canfield' THEN 'WYC'
WHEN 'Johnstown/Altoona/State College' THEN 'Johnstown-Altoona'
WHEN 'Washington, D.C.' THEN 'Washington'
ELSE S.[Market] END)
= CONCAT(HE.[AdvertiserTBCode],HE.[HeadendMarket])

Please identify ALL Table&Column(s) that do not have "Clean Data".
And give examples--eg Bad=;ShouldBe=. These columns may affect how we proceed. And, get rid of the extraneous columns that are not part of the problem.
In the meantime, you can identify the exact dirty data and add some more conditioning... Note: this sql will allow you to see nulls on either side of the ON condition. That is, some HE data may not match because the S data is incorrect.
SELECT
AllKeys.AdvertiserTBCode
,AllKeys.Market
,S.AdvertiserTBCode
.S.Market
,HE.CLIENTID
.HE.Market
From
-- get all of the possible keys from each side
(SELECT DISTINCT
,AdvertiserTBCode
,Market -- or CASE Market...
FROM [REO].[dbo].[Sports]
UNION
SELECT
CLIENTID
,Market -- or CASE Market...
FROM [REO].[dbo].[HF_Final]
) as AllKeys
-- join each side to its keys
LEFT JOIN [REO].[dbo].[Sports] S
ON AllKeys.AdvertiserTBCode = S.AdvertiserTBCode
And AllKeys.Market = S.Market
LEFT JOIN [REO].[dbo].[HF_Final] HE
ON AllKeys.AdvertiserTBCode = HE.CLIENTID
And AllKeys.Market = HE.Market
Order By AllKeys.AdvertiserTBCode, AllKeys.Market
-- now filter those that you want to pay attention to
--Having maybe some nulls in certain fields

Related

When doing a hive query, I do not get all all the outputs when I scroll up in terminal

I have saved the table through a CREATE TABLE statement and even before I did a CREATE TABLE when I scroll up I can't see all the outputs in my hive query. I do not think it's my query statement but more something I pressed or did in the terminal. When I first started this project I could scroll and see all outputs but now I can only see some of them.
Any and all advice is helpful. (I can attach my code and sample out if people want)
""" select P.gender, F.eoy_age, F.NumberOfChildren, F.Homeowner, F.HouseholdIncome,
Case when HouseholdIncome='G' then '$80K-$90K' when HouseholdIncome='H' then '$90K-$100K'
when HouseholdIncome='I' then '$100K-$110K'
when HouseholdIncome='J' then '$110K-$120K'
when HouseholdIncome='K' then '$120K-$130K'
when HouseholdIncome='L' then '$130K-$140K'
when HouseholdIncome='M' then '$140K-$150K'
when HouseholdIncome='O' then '$175K-$200K'
when HouseholdIncome='P' then '$200K-$225K'
when HouseholdIncome='Q' then '$225K-$250K'
when HouseholdIncome='R' then '$250K-$275K'
when HouseholdIncome='S' then '$275K-$300K'
when HouseholdIncome='T' then '$300K-$400K'
when HouseholdIncome='U' then '$400K-$500K'
when HouseholdIncome='V' then '$500K-$600K'
when HouseholdIncome='W' then '$600K-$750K'
when HouseholdIncome='X' then '$750K-$1000K'
when HouseholdIncome='Y' then '$1000K-$2000K'
when HouseholdIncome='Z' then '$2000K+' end AS HouseholdIncomeRange,
F.State, count(*), count(distinct P.Email)
from pi_table P
LEFT JOIN feature_table F ON P.ID=F.ID
where F.age<45 and F.NumberOfChildren >= 1 AND F.Homeowner in ('H','W') and F.HouseholdIncome not in ('1','2','A','B','C','D','E ','F') AND F.State in ('MA', 'RI', 'NH', 'ME', 'VT', 'FL', 'GA', 'NC') AND P.Email is not null
GROUP BY F.NumberOfChildren, P.gender, F.eoy_age,F.Homeowner,F.HouseholdIncome,F.State;

SQL query with many 'AND NOT CONTAINS' statements

I am trying to exclude timezones that have a substring in them so I only have records likely from the US.
The query works fine (e.g., the first line after the OR will remove local_timezones that include 'Africa/Abidjan'), but there's got to be a better way to write it.
It's too verbose, repetitive, and I suspect it's slower than it could be. Any advice greatly appreciated. (I'm using Snowflake's flavor of SQL but not sure that matters in this case).
NOTE: I'd like to keep a timezone such as America/Los_Angeles, but not America/El_Salvador, so for this reason I don't think wildcards are a good solution.
SELECT a_col
FROM a_table
WHERE
(country = 'United States')
OR
((country is NULL and not contains (local_timezone, 'Africa')
AND
country is NULL and not contains (local_timezone, 'Asia')
AND
country is NULL and not contains (local_timezone, 'Atlantic')
AND
country is NULL and not contains (local_timezone, 'Australia')
AND
country is NULL and not contains (local_timezone, 'Etc')
AND
country is NULL and not contains (local_timezone, 'Europe')
AND
country is NULL and not contains (local_timezone, 'Araguaina')
etc etc
If you have a known list of "good things" I would make a table, and then just JOIN to id. Here I made you a list of good timezones:
CREATE TABLE acceptable_timezone (tz_name text) AS
SELECT * FROM VALUES
('Pacific/Auckland'),
('Pacific/Fiji'),
('Pacific/Tahiti');
I love me some Pacific... now we have some important data in a CTE
WITH data(id, timezone) AS (
SELECT * FROM VALUES
(1, 'Pacific/Auckland'),
(2, 'Pacific/Fiji'),
(3, 'America/El_Salvador')
)
SELECT d.*
FROM data AS d
JOIN acceptable_timezone AS a
ON a.tz_name = d.timezone
ORDER BY 1;
which total does not match the El Salvador:
ID
TIMEZONE
1
Pacific/Auckland
2
Pacific/Fiji
You cannot get much faster than an equijoin, but if your DATA has the timezones as substrings, then the TABLE can have the wildcard matches % and you can use a LIKE just like Felipe's answer does but as
JOIN acceptable_timezone AS a
ON d.timezone LIKE a.tz_name
You can use LIKE ANY:
with data as
(select null country, 'something Australia maybe' local_timezone)
select *
from data
where country = 'United States'
or (
country is null
and not local_timezone like any ('%Australia%', '%Africa%', '%Atlantic%')
)

CASE Statement - An expression services limit has been reached

I'm getting the following error:
An expression services limit has been reached. Please look for potentially complex expressions in your query, and try to simplify them.
I'm attempting to run the below query, however it appears there is one line too many in my case statement (when i remove the "London" Line, it works perfectly) or "Scotland" for example.
I can't think of the best way to split this statement.
If i split it into 2 queries and union all, it does work. however the ELSE 'No Region' becomes a problem. Everything which is included in the first part of the query shows as "No Region" for the second part of the query, and vice versa.
(My end goal is essentially to create a list of customers per region) I can then use this as the foundation of a regional sales report.
Many Thanks
Andy
SELECT T0.CardCode, T0.CardName, T0.PostCode,
CASE
WHEN T0.PostCodeABR IN ('DG','KW','IV','PH','AB','DD','PA','FK','KY','G','EH','ML','KA','TD') THEN 'Scotland'
WHEN T0.PostCodeABR IN ('BT') THEN 'Ireland'
WHEN T0.PostCodeABR IN ('CA','NE','DH','SR','TS','DL','LA','BD','HG','YO','HX','LS','FY','PR','BB','L','WN','BL','OL') THEN 'North M62'
WHEN T0.PostCodeABR IN ('CH','WA','CW','SK','M','HD','WF','DN','HU','DE','NG','LN','S') THEN 'South M62'
WHEN T0.PostCodeABR IN ('LL','SY','LD','SA','CF','NP') THEN 'Wales'
WHEN T0.PostCodeABR IN ('NR','IP','CB') THEN 'East Anglia'
WHEN T0.PostCodeABR IN ('SN','BS','BA','SP','BH','DT','TA','EX','TQ','PL','TR') THEN 'South West'
WHEN T0.PostCodeABR IN ('LU','AL','HP','SG','SL','RG','SO','GU','PO','BN','RH','TN','ME','CT','SS','CM','CO') THEN 'South East'
WHEN T0.PostCodeABR IN ('ST','TF','WV','WS','DY','B','WR','HR','GL','OX','CV','NN','MK','PE','LE') THEN 'Midlands'
WHEN T0.PostCodeABR IN ('WD','EN','HA','N','NW','UB','W','WC','EC','E','IG','RM','DA','BR','CR','SM','KT','TW','SW') THEN 'London'
ELSE 'No Region'
END AS 'Region'
FROM [dbo].[REPS-PostcodeABBR] T0
As I mentioned in the comment, I would suggest you create a "lookup" table for the post codes, then all you need to do is JOIN to the table, and not have a "messy" and large CASE expression (T-SQL doesn't support Case (Switch) statements).
So your lookup table would look a little like this:
CREATE TABLE dbo.PostcodeRegion (Postcode varchar(2),
Region varchar(20));
GO
--Sample data
INSERT INTO dbo.PostcodeRegion (Postcode,Region)
VALUES('DG','Scotland'),
('BT','Ireland'),
('LL','Wales');
And then your query would just do a LEFT JOIN:
SELECT RPA.CardCode,
RPA.CardName,
RPA.PostCode,
COALESCE(PR.Region,'No Region') AS Region
FROM [dbo].[REPS-PostcodeABBR] RPA --T0 is a poor choice of an alias, there is no T not 0 in "REPS-PostcodeABBR"
LEFT JOIN dbo.PostcodeRegion PR ON RPA.PostCodeABR = PR.Region;
Note you would likely want to INDEX the table as well, and/or apply a UNIQUE CONSTRAINT or PRIMARY KEY to the PostCode column.
Thanks for the help... I tried multiple ways mentioned above, and they all did work, however the most efficient seemed to be this way.
Created a lookup table within SAP; This table included PostCodeFrom, PostCodeTo, PostCodeABR, Region
This would look like; TS00, TS99, TS, North M62
I then done;
SELECT OCRD.ZipCode PCLOOKUP.Region, PCLOOKUP.PostCodeABR FROM OCRD T0 LEFT OUTER JOIN PCLOOKUP ON OCRD.ZipCode >= PCLOOKUP.PostCodeFROM AND OCRD.ZipCode <= PCLOOKUP.PostCodeFrom
Basically, if the postcode is between
FROM AND To Display the abbreviation and region.

SQL Join Problems

I'm having a problem that I assume is related to the Join in my SQL statement.
select s.customer as 'Customer',
s.store as 'Store',
s.item as 'Item',
d.dlvry_dt as 'Delivery',
i.item_description as 'Description',
mj.major_class_description as 'Major Description',
s.last_physical_inventory_dt as 'Last Physical Date',
s.qty_physical as 'Physical Qty',
s.avg_unit_cost as 'Unit Cost',
[qty_physical]*[avg_unit_cost] as Value
from argus.DELIVERY d
join argus.STORE_INVENTORY s
ON (s.store = d.store)
join argus.ITEM_MASTER i
ON (s.item = i.item)
join argus.MINOR_ITEM_CLASS mi
ON (i.minor_item_class = mi.minor_item_class)
join argus.MAJOR_ITEM_CLASS mj
ON (mi.major_item_class = mj.major_item_class)
where s.last_physical_inventory_dt between '6/29/2011' and '7/2/2012'
and s.customer = '20001'
and s.last_physical_inventory_dt IS NOT NULL
It comes back with a seemingly infinite amount of copies of one record. Is there something wrong with the way I'm joining these tables?
join argus.MINOR_ITEM_CLASS mi
ON (i.minor_item_class = mi.minor_item_class)
join argus.MAJOR_ITEM_CLASS mj
ON (mi.major_item_class = mj.major_item_class)
My guess is that your error resides in one of these 2 joins. When you only use the word JOIN it assumes that you are trying to do an INNER JOIN which returns all records that have at least 1 to 1. I don't know what your data looks like but I am assuming that there is a many to many relationship between minor item class and major item class so when you run this query you are receiving duplicated records for almost every field, but the major item class differs.
I would look at the results. Most of the columns will have repeating data that doesn't change while one of the columns will have a different value for every row. That should tell you that the column with differing data for each row is the column that you should be joining differently.
Otherwise, I would say that your query is formatted correctly.

SQL query producing duplicate rows and I can't see why

My query always produces duplicate results. How best do I go about troubleshooting this query with a database > 1 million rows.
Select segstart
,segment
,callid
,Interval
,dialed_num
,FiscalMonthYear
,SegStart_Date
,row_date
,Name
,Xferto
,TransferType
,Agent
,Sup
,Manager
,'MyCenter' = Case Center
When 'Livermore Call Center' Then 'LCC'
When 'Natomas Call Center' Then 'NCC'
When 'Concord Call Center' Then 'CCC'
When 'Virtual Call Center' Then 'VCC'
When 'Morgan Hill Call Center' Then 'MHCC'
Else Center
End
,Xferfrom
,talktime
,ANDREWSTABLE.transferred
,ANDREWSTABLE.disposition
,dispsplit
,callid
,hsplit.starttime
,CASE
WHEN hsplit.callsoffered > 0
THEN (CAST(hsplit.acceptable as DECIMAL)/hsplit.callsoffered)*100
ELSE '0'
END AS 'Service Level'
,hsplit.callsoffered
,hsplit.acceptable
FROM
(
Select segstart,
100*DATEPART(HOUR, segstart) + 30*(DATEPART(MINUTE, segstart)/30) as Interval,
FiscalMonthYear,
SegStart_Date,
dialed_num,
callid,
Name,
t.Queue AS 'Xferto',
TransferType,
RepLName+', '+RepFName AS Agent,
SupLName+', '+SupFName AS Sup,
MgrLName+', '+MgrFName AS Manager,
q.Center,
q.Queue AS 'Xferfrom',
e.anslogin,
e.origlogin,
t.Extension,
transferred,
disposition,
talktime,
dispsplit,
segment
From CMS_ECH.dbo.CaliforniaECH e
INNER JOIN Cal_RemReporting.dbo.TransferVDNs t on e.dialed_num = t.Extension
INNER JOIN InfoQuest.dbo.IQ_Employee_Profiles_v3_AvayaId q on e.origlogin = q.AvayaID
INNER JOIN Cal_RemReporting.dbo.udFiscalMonthTable f on e.SegStart_Date = f.Tdate
Where SegStart_Date between getdate()-90 and getdate()-1
And q.Center not in ('Collections Center',
'Cable Store',
'Business Services Center',
'Escalations')
And SegStart_Date between RepToSup_StartDate and RepToSup_EndDate
And SegStart_Date between SupToMgr_StartDate and SupToMgr_EndDate
And SegStart_Date between Avaya_StartDate and Avaya_EndDate
And SegStart_Date between RepQueue_StartDate and RepQueue_EndDate
AND (e.transferred like '1'
OR e.disposition like '4') order by segstart
) AS ANDREWSTABLE
--Left Join CMS_ECH.dbo.hsplit hsplit on hsplit.starttime = ANDREWSTABLE.Interval and hsplit.row_date = ANDREWSTABLE.SegStart_Date and ANDREWSTABLE.dispsplit = hsplit.split
There are two possibities:
There are multiple records in your system which will appear to produce duplicate rows in your resultset because your projection doesn't select sufficent columns to distinguish them or your where clause doesn't filter them out.
Your joins are generating spurious duplicates because the ON clauses are not complete.
Both of these can only be solved by somebody with the requisite level of domain knowledge. So we are not going to fix that query for you. Sorry.
What you need to do is comapare some duplicate results with some non-duplicate results and discover what the first group has in common which also distinguishes it from the second group.
I'm not saying it is easy, esecially with millions of rows. But if it was easy it wouldn't be worth doing.
I have into this a couple of times myself and it always ends up being one of my join statements. I would try removing your join statements one at a time and seeing if removing one of them reduced the number of duplicates.
You other option is to find a duplicate set of rows and query each table in the join on the join values and see what you get back.
Also, what database are you running and what version?