LEFT JOIN is killing my query - how to speed up? - sql

I have two tables:
a list of documents created in our system (offers and invoices) when something is logged as "done".
BELEG_ART
BELEG_NR
DATUM
BELEG_TYPE
160
337691
11.01.2021
Invoice
10
195475
04.01.2021
Offer
20
195444
04.01.2021
Confirmation
A list of transactions (sales etc) with article information AND the document info
ANUMMER
KDNR
NAME1
REC_LIST
181557
59301205
Fred
332240
195973
59306391
John
338225
189661
59304599
Steve
335495
189718
49302475
Ed
196483
59303491
Mark
338204
190021
49302595
Jones
You can see that the Offers and Confirmation... they start with a "1". Invoices with "3".
I need everything to be linked and identified by it's ANUMMER with "1". Later on, I'll pull other tables based on this number, thus it's the critical point for me.
The problem is - in the documents table, when you see the invoice, you don't see the ANUMMER. You only see the "3".
So, I have created a join as below to pull everything together.
SELECT
DAB700.BELEG_ART,DAB700.BELEG_NR,DAB700.DATUM,DAB700.BUCH_DATUM,
case // rename the documents to something more meaningful
when DAB700.BELEG_ART = 10 then 'Angebote'
when DAB700.BELEG_ART = 20 then 'Auftrag'
when DAB700.BELEG_ART = 60 then 'Lieferschein'
when DAB700.BELEG_ART = 160 then 'Rechnung'
else 'not defined'
end as "BELEG_TYPE",
DAB050.ANUMMER,
case // if the document is an offer, then copy it to order_number. If it's a invoice, copy that number to order_number.
when DAB700.BELEG_ART = 10 then DAB700.BELEG_NR
when DAB700.BELEG_ART = 160 then DAB050.ANUMMER
else 'NA'
end as "order_number"
FROM "DAB700.ADT" DAB700
// if the document is an invoice, then join to the table DAB050 and reference the same key field REC_LIST.
left join "DAB050.ADT" DAB050 on
Case
When DAB700.BELEG_ART = 160 then DAB700.BELEG_NR = DAB050.REC_LIST
End
WHERE (DAB700.DATUM={d '2021-01-12'})
So - to my problem and question: when running this join query, it's much slower than I'd like (even with small datasets). Is there a way to restructure this, so it's faster?
Long story short - to simplify:
I want to add some columns to table 1
when table 1-BELEG_NR starts with a "1", good - just move the number into a new column
when table 1-BELEG_NR starts with a "3", then I have to link to table 3, and pull in the ANUMMER
thanks for your help

The problem is that the join condition is not optimizable because 1. it is a complex expression, and 2. it involves a memo/clob columen (REC_LIST is probably a memo based on your other question). What that mean is that the joint condition has to be evaluated for each combination of rows in DAB700 and DAB050.
It seems to me that the join condition can be simplified to:
DAB700.BELEG_ART = 160 And DAB700.BELEG_NR = DAB050.REC_LIST
There is no need for the Case because the default case is null or false. That will cut down the number of rows in DAB700 participating in the join. However, the second part of the join is still not optimal if REC_LIST is MEMO or CLOB data type.
In any case, I would also suggest checking to see if REC_LIST should be a fixed length string column (CHAR, NCHAR, VARCHAR or NVARCHAR) instead of MEMO which cannot be indexed and thus not useful for optimization.

Related

Should I use an SQL full outer join for this?

Consider the following tables:
Table A:
DOC_NUM
DOC_TYPE
RELATED_DOC_NUM
NEXT_STATUS
...
Table B:
DOC_NUM
DOC_TYPE
RELATED_DOC_NUM
NEXT_STATUS
...
The DOC_TYPE and NEXT_STATUS columns have different meanings between the two tables, although a NEXT_STATUS = 999 means "closed" in both. Also, under certain conditions, there will be a record in each table, with a reference to a corresponding entry in the other table (i.e. the RELATED_DOC_NUM columns).
I am trying to create a query that will get data from both tables that meet the following conditions:
A.RELATED_DOC_NUM = B.DOC_NUM
A.DOC_TYPE = "ST"
B.DOC_TYPE = "OT"
A.NEXT_STATUS < 999 OR B.NEXT_STATUS < 999
A.DOC_TYPE = "ST" represents a transfer order to transfer inventory from one plant to another. B.DOC_TYPE = "OT" represents a corresponding receipt of the transferred inventory at the receiving plant.
We want to get records from either table where there is an ST/OT pair where either or both entries are not closed (i.e. NEXT_STATUS < 999).
I am assuming that I need to use a FULL OUTER join to accomplish this. If this is the wrong assumption, please let me know what I should be doing instead.
UPDATE (11/30/2021):
I believe that #Caius Jard is correct in that this does not need to be an outer join. There should always be an ST/OT pair.
With that I have written my query as follows:
SELECT <columns>
FROM A LEFT JOIN B
ON
A.RELATED_DOC_NUM = B.DOC_NUM
WHERE
A.DOC_TYPE IN ('ST') AND
B.DOC_TYPE IN ('OT') AND
(A.NEXT_STATUS < 999 OR B.NEXT_STATUS < 999)
Does this make sense?
UPDATE 2 (11/30/2021):
The reality is that these are DB2 database tables being used by the JD Edwards ERP application. The only way I know of to see the table definitions is by using the web site http://www.jdetables.com/, entering the table ID and hitting return to run the search. It comes back with a ton of information about the table and its columns.
Table A is really F4211 and table B is really F4311.
Right now, I've simplified the query to keep it simple and keep variables to a minimum. This is what I have currently:
SELECT CAST(F4211.SDDOCO AS VARCHAR(8)) AS SO_NUM,
F4211.SDRORN AS RELATED_PO,
F4211.SDDCTO AS SO_DOC_TYPE,
F4211.SDNXTR AS SO_NEXT_STATUS,
CAST(F4311.PDDOCO AS VARCHAR(8)) AS PO_NUM,
F4311.PDRORN AS RELATED_SO,
F4311.PDDCTO AS PO_DOC_TYPE,
F4311.PDNXTR AS PO_NEXT_STATUS
FROM PROD2DTA.F4211 AS F4211
INNER JOIN PROD2DTA.F4311 AS F4311
ON F4211.SDRORN = CAST(F4311.PDDOCO AS VARCHAR(8))
WHERE F4211.SDDCTO IN ( 'ST' )
AND F4311.PDDCTO IN ( 'OT' )
The other part of the story is that I'm using a reporting package that allows you to define "virtual" views of the data. Virtual views allow the report developer to specify the SQL to use. This is the application where I am using the SQL. When I set up the SQL, there is a validation step that must be performed. It will return a limited set of results if the SQL is validated.
When I enter the query above and validate it, it says that there are no results, which makes no sense. I'm guessing the data casting is causing the issue, but not sure.
UPDATE 3 (11/30/2021):
One more twist to the story. The related doc number is not only defined as a string value, but it contains leading zeros. This is true in both tables. The main doc number (in both tables) is defined as a numeric value and therefore has no leading zeros. I have no idea why those who developed JDE would have done this, but that is what is there.
So, there are matching records between the two tables that meet the criteria, but I think I'm getting no results because when I convert the numeric to a string, it does not match, because one value is, say "12345", while the other is "00012345".
Can I pad the numeric -> string value with zeros before doing the equals check?
UPDATE 4 (12/2/2021):
Was able to finally get the query to work by converting the numeric doc num to a left zero padded string.
SELECT <columns>
FROM PROD2DTA.F4211 AS F4211
INNER JOIN PROD2DTA.F4311 AS F4311
ON F4211.SDRORN = RIGHT(CONCAT('00000000', CAST(F4311.PDDOCO AS VARCHAR(8))), 8)
WHERE F4211.SDDCTO IN ( 'ST' )
AND F4311.PDDCTO IN ( 'OT' )
AND ( F4211.SDNXTR < 999
OR F4311.PDNXTR < 999 )
You should write your query as follows:
SELECT <columns>
FROM A INNER JOIN B
ON
A.RELATED_DOC_NUM = B.DOC_NUM
WHERE
A.DOC_TYPE IN ('ST') AND
B.DOC_TYPE IN ('OT') AND
(A.NEXT_STATUS < 999 OR B.NEXT_STATUS < 999)
LEFT join is a type of OUTER join; LEFT JOIN is typically a contraction of LEFT OUTER JOIN). OUTER means "one side might have nulls in every column because there was no match". Most critically, the code as posted in the question (with a LEFT JOIN, but then has WHERE some_column_from_the_right_table = some_value) runs as an INNER join, because any NULLs inserted by the LEFT OUTER process, are then quashed by the WHERE clause
See Update 4 for details of how I resolved the "data conversion or mapping" error.

Defaulting missing data

I have a complex set of schema that I am trying to pull data out of for a report. The query for it joins a bunch of tables together and I am specifically looking to pull a subset of data where everything for it might be null. The original relations for the tables look as such.
Location.DeptFK
Dept.PK
Section.DeptFK
Subsection.SectionFK
Question.SubsectionFK
Answer.QuestionFK, SubmissionFK
Submission.PK, LocationFK
From here my problems begin to compound a little.
SELECT Section.StepNumber + '-' + Question.QuestionNumber AS QuestionNumberVar,
Question.Question,
Subsection.Name AS Subsection,
Section.Name AS Section,
SUM(CASE WHEN (Answer.Answer = 0) THEN 1 ELSE 0 END) AS NA,
SUM(CASE WHEN (Answer.Answer = 1) THEN 1 ELSE 0 END) AS AnsNo,
SUM(CASE WHEN (Answer.Answer = 2) THEN 1 ELSE 0 END) AS AnsYes,
(select count(distinct Location.Abbreviation) from Department inner join Plant on location.DepartmentFK = Department.PK WHERE(Department.Name = 'insertParameter'))
as total
FROM Department inner join
section on Department.PK = section.DepartmentFK inner JOIN
subsection on Subsection.SectionFK = Section.PK INNER JOIN
question on Question.SubsectionFK = Subsection.PK INNER JOIN
Answer on Answer.QuestionFK = question.PK inner JOIN
Submission on Submission.PK = Answer.SubmissionFK inner join
Location on Location.DepartmentFK = Department.PK AND Location.pk = Submission.PlantFK
WHERE (Department.Name = 'InsertParameter') AND (Submission.MonthTested = '1/1/2017')
GROUP BY Question.Question, QuestionNumberVar, Subsection.Name, Section.Name, Section.StepNumber
ORDER BY QuestionNumberVar;
There are 15 total locations, with this query I get 12. If I remove a relation in the join for Location I get 15 total locations but my answer data gets multiplied by 15. My issue is that not all locations are required to test at the same time so their answers should default to NA, They don't get records placed in the DB so the relationship between Location/Submission is absent.
I have a workaround almost in place via the select count distinct but, The second part is a query for finding what each location answered instead of a sum which brings the problem right back around. It also has to be dynamic because the input parameters for a department won't bring a static number of locations back each time.
I am still learning my SQL so any additional material to look at for building this query would also be appreciated. So I guess the big question here is, How would I go about creating default data in this query for anytime the Location/Submission relation has a null value?
Edit: Dummy Data
QuestionNumberVar | Section | Subsection | Question | AnsYes | AnsNo | NA (expected)
1-1.1 Math Algebra Did you do your homework? 10 1 1(4)
1-1.2 Math Algebra Did your dog eat it? 9 3 0(3)
2-1.1 English Greek Did you do your homework? 8 0 4(7)
I have tried making left joins at various applicable portions of the code to no avail. All attempts at left joins have ended with no effect on info output. This query feeds into the Dataset for an SSRS report. There are a couple workarounds for this particular section via an expression to take total Locations and subtract AnsYes and AnsNo to get the true NA value but as explained above doesn't help with my next query.
Edit: SQL Server 2012 for those who asked
Edit: my attempt at an isnull() on the missing data returns nothing I suspect because the query already eliminates the "null/missing" data. Left joining while doing this has also failed. The point of failure is on Submissions. if we bind it to Locations there are locations missing but if we don't bind it there are multiplied duplicates because Department has a One-To-Many with Location and not vice versa. I am unable to make any schema changes to improve this process.
There is a previous report that I am trying to emulate/update. It used C# logic to process data and run multiple queries to attain the same data. I don't have this luxury. (previous report exports to excel directly instead of SSRS). Here is the previous logic used.
select PK from Department where Name = 'InsertParameter';
select PK from Submission where LocationFK = 'Location.PK_var' and MonthTested = '1/1/2017'
Then it runs those into a loop where it processes nulls into NA using C# logic
EDIT (Mediocre Solution): I ended up doing the workaround of making a calculated field that subtracts Yes and No from the total # of Locations that have that Dept. This is a mediocre solution because I didn't solve my original problem and made 3 datasets that should have been displayed as a singular dataset. One for question info, one for each locations answer and one for locations that didnt participate. If a true answer comes up I will check its validity but for now, Problem psuedo solved.

How to make a query to obtain only results that have N number within a range of values?

I'm trying to extract nutrient data in MS Access 2007 from the USDA food database, freely available at http://www.ars.usda.gov/Services/docs.htm?docid=24912
I need records that have ALL nutrients from NUT_DATA.Nutr_No . Those records have values between '501' and '511' . But I wish to exclude incomplete records that have missing values.
Currently, Baby food banana has all from nutrient 501 to 511, but Baby food Beverage has only 9 of the nutrients listed, and many others are like that.
As a last resort, I guess it would be acceptable to have all records, showing null for missing values, as long as each FOOD_DES.Long_Desc has exactly 11 records, one for each NUT_DATA.Nutr_No OR NUTR_DEF.NutrDesc (which correspond to each other).
SELECT
FOOD_DES.NDB_No, FOOD_DES.FdGrp_Cd, FOOD_DES.Long_Desc, NUT_DATA.Nutr_No, NUTR_DEF.NutrDesc, NUT_DATA.Nutr_Val, WEIGHT.Amount, WEIGHT.Msre_Desc, WEIGHT.Gm_Wgt, [WEIGHT]![Amount] & " " & [WEIGHT]![Msre_Desc] AS msre
FROM
NUTR_DEF inner JOIN ((FOOD_DES INNER JOIN NUT_DATA ON FOOD_DES.NDB_No=NUT_DATA.NDB_No) INNER JOIN WEIGHT ON FOOD_DES.NDB_No=WEIGHT.NDB_No) ON NUTR_DEF.Nutr_No=NUT_DATA.Nutr_No
WHERE
(NUT_DATA.Nutr_No between '501' and '511' ) and ((WEIGHT.Seq)="1") and NUT_DATA.Nutr_Val > '0' and
// this part is me out of ideas trying stuff, but didn't help
EXISTS (SELECT 1
FROM
NUTR_DEF inner JOIN ((FOOD_DES INNER JOIN NUT_DATA ON FOOD_DES.NDB_No=NUT_DATA.NDB_No) INNER JOIN WEIGHT ON FOOD_DES.NDB_No=WEIGHT.NDB_No) ON NUTR_DEF.Nutr_No=NUT_DATA.Nutr_No
WHERE count FOOD_DES.Long_Desc = "11" )
//end wild of experimentation
ORDER BY FOOD_DES.Long_Desc, NUTR_DEF.SR_Order;
This is a sample of the data. I just copied the most important columns. The red is not what I'm looking for because it doesn't have all 11 nutrients. I can paste on the google doc the whole table if someone thinks that would help.
https://docs.google.com/spreadsheets/d/1FghDD59wy2PYlpsqUlYVc3Ulwvy4MMLagpBUYtvLBfI/edit?usp=sharing
As your starting point, identify which food items have values > 0 for all 11 of those nutrients. Check whether this simpler GROUP BY query shows you the correct items:
SELECT ndat.NDB_No
FROM
NUT_DATA AS ndat
INNER JOIN WEIGHT AS wt
ON ndat.NDB_No = wt.NDB_No
WHERE
ndat.Nutr_Val>0
AND ndat.Nutr_No IN('501','502','503','504','505','506','507','508','509','510','511')
AND wt.Seq='1'
GROUP BY ndat.NDB_No
HAVING Count(ndat.Nutr_No)=11;
Note you could use Val(ndat.Nutr_No) Between 501 And 511 as the Nutr_No restriction, which would give you a more concise statement. However, evaluating Val() for every row of the table means that approach would forego the performance benefit of indexed retrieval ... so that version of the query should be noticeably slower.
Save that query and create a new query which joins it to the base tables for the additional data you need from other columns. Or use it as a subquery instead of a named query if you prefer.

Ways to Clean-up messy records in sql

I have the following sql data:
ID Company Name Customer Address 1 City State Zip Date
0108500 AAA Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046
0108500 AAA.Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046
1802600 AAA Test Company Ban, Adj.~Gorge PO Box 83 MouLaurel CA 153 09JS0025
1210600 AAA Test Company Biwel~Brce 97kehst ve Jacn CA 153 04JS0190
AAA Test, AAA.Test and AAA Test Company are considered as one company.
Since their data is messy I'm thinking either to do this:
Is there a way to search all the records in the DB wherein it will search the company name with almost the same name then re-name it to the longest name?
In this case, the AAA Test and AAA.Test will be AAA Test Company.
OR Is there a way to filter only record with company name that are almost the same then they can have option to change it?
If there's no way to do it via sql query, what are your suggestions so that we can clean-up the records? There are almost 1 million records in the database and it's hard to clean it up manually.
Thank you in advance.
You could use String matching algorithm like Jaro-Winkler. I've written an SQL version that is used daily to deduplicate People's names that have been typed in differently. It can take awhile but it does work well for the fuzzy match you're looking for.
Something like a self join? || is ANSI SQL concat, some products have a concat function instead.
select *
from tablename t1
join tablename t2 on t1.companyname like '%' || t2.companyname || '%'
Depending on datatype you may have to remove blanks from the t2.companyname, use TRIM(t2.companyname) in that case.
And, as Miguel suggests, use REPLACE to remove commas and dots etc.
Use case-insensitive collation. SOUNDEX can be used etc etc.
I think most Database Servers support Full-Text search ability, and if so there are some functions related to Full-Text search that support Proximity.
for example there is a Near function in SqlServer and here is its documentation https://msdn.microsoft.com/en-us/library/ms142568.aspx
You can do the clean-up in several stages.
Create new columns
Convert everything to upper case, remove punctuation & whitespace, then match on the first 6 to 10 characters (using self join). Assuming your table is called "vendor": add two columns, "status", "dupstr", then update as follows
/** Populate dupstr column for fuzzy match **/
update vendor v
set v.dupstr = left(upper(regex_replace(regex_replace(v.companyname,'.',''),' ','')),6)
;
Identify duplicate records
Add an index on the dupstr column, then do an update like this to identify "good" records:
/** Mark the good duplicates **/
update vendor v
set v.status = 'keep' --indicate keeper record
where
--dupes to clean up
exists ( select 1 from vendor v1 where v.dupstr = v1.dupstr
and v.id != v1.id )
and
( --keeper has longest name
length(v.companyname) =
( select max(length(v2.companyname)) from vendor v2
where v.dupstr = v2.dupstr
)
or
--keeper has latest record (assuming ID is sequential)
v.id =
( select max(v3.id) from vendor v3
where v.dupstr = v3.dupstr
)
)
group by v.dupstr
;
The above SQL can be refined to add "dupe" status to other records , or you can do a separate update.
Clean Up Stragglers
Report any remaining partial matches to be reviewed by a human (i.e. dupe records without a keeper record)
You can use SQL query with SOUDEX of DIFFRENCE
For example:
SELECT DIFFERENCE ('AAA Test','AAA Test Company')
DIFFERENCE returns 0 - 4 ( 4 = almost the same, 0 - totally diffrent)
See also: https://learn.microsoft.com/en-us/sql/t-sql/functions/difference-transact-sql?view=sql-server-2017

JOINS and absentee values

INSERT INTO Shipments (Column1...Column200)
SELECT
O.Value1,...
CL.Value199,
isnull(P.PriceFactor1,1)
FROM Orders O
JOIN Clients CL on O.ClientNo = CL.ClientNo
JOIN Calc C on CL.CalcCode = C.CalcCode
JOIN Prices P on CL.PriceKey = P.PriceKey
WHERE O.PriceFactor1 = P.PriceFactor1
AND O.PriceFactor2 = P.PriceFactor2
AND O.PriceFactor3 = P.PriceFactor3
The above query (part of a new stored procedure meant to replace an old and nasty one that used a cursor...) fails to return some rows, because the rows in Orders do not have matching rows in Prices. In such cases, we want the last value in the INSERT list to be 1 by default. Instead, the row is never built; or, when we tried to fix it by changing the WHERE conditions, it brought PriceFactor1 from a different row, which is also no good.
How it's supposed to work:
A row is created in table Orders. A third-party program then executes a stored procedure (Asp_BuildShipments) and displays the results once they have been inserted into table Shipments. This SP is meant to populate the table Shipments by pulling values from Orders, Clients, Drivers, Vehicles, Prices, Routes, and others. It's a long SP, and the array of tables is big and varied.
In table Orders:
PriceFactor1 | PriceFactor2 | PriceFactor3
12 | 10 | 8
In table Prices:
PriceFactor1 | PriceFactor2 | PriceFactor3
18 | 12 | 10
In a case such as this, the SP needs to recognize that no such rows exist in Prices and use a default value of 1 rather than skipping the row or pulling the price from a different row.
We've tried isnull(), CASE statements, and WHERE EXISTS, but to no avail.
The new SP is set based, and we want to leave it that way - the old one took minutes, the new one takes only a few seconds. But without passing row by row, we aren't sure how to check each individual Order to see if has a matching Price before building the row in Shipments.
I know there are details missing here, but I didn't want to write a 1,000 page question. If these details are insufficient, I'll post as much as I need to to help get your brains storming. Been stuck on this for a while now...
Thanks in advance :)
Could this be as simple as:
SELECT
O.Value1,...
CL.Value199,
isnull(P.PriceFactor1,1)
FROM Orders O
JOIN Clients CL on O.ClientNo = CL.ClientNo
JOIN Calc C on CL.CalcCode = C.CalcCode
LEFT JOIN Prices P on CL.PriceKey = P.PriceKey
AND O.PriceFactor1 = P.PriceFactor1
AND O.PriceFactor2 = P.PriceFactor2
AND O.PriceFactor3 = P.PriceFactor3
Richard Hansell's answer should work if the problem is as you suggest there are no matching rows in Prices. Since it's not working the problem is in the other joins.
One of these joins filters out data.
JOIN Clients CL on O.ClientNo = CL.ClientNo
JOIN Calc C on CL.CalcCode = C.CalcCode