SQL Determining differences in near-identical rows

SQL Determining differences in near-identical rows - sql

If I have a table of correct data I need to check with my actual table to make sure the data is correct and I have some rows like the following:
Data_Check_Table
FRUIT ------- PRICE ------- WEEKS_FRESH ------- SUPPLIER
Apple $1 1 Big Co.
Banana $1 1 Super Co.
and the actual table with this info:
Data_Table
FRUIT ------- PRICE ------- WEEKS_FRESH ------- SUPPLIER
Apple $2 1 Big Co.
Banana $1 1 Super Co.
...and assume there are many other rows, some match up fine and others have inconsistencies in certain areas (Maybe the wrong price? Or wrong supplier? Maybe even both.) How would I do a select to find these rows that are inconsistent with the actual data?

Select dt.Fruit,dt.Price, dt.Weeks_Fresh,dtc.Fruit,dtc.Price, dtc.Weeks_Fresh,...
From DataTable dt
FULL OUTER JOIN
DataTable_Check dtc
ON dt.Fruit = dtc.Fruit
AND dt.Price = dtc.Price
.....
Where dt.Fruit IS NULL OR dtc.Fruit IS NULL
The full join includes records from each table regardless of whether there is a match, so if either side is null then you know there is a mismatch.

The following to find actual records not matching correct records:
select *
from Data_Table
minus
select *
from Data_Check_Table

Related

Getting Duplicates in Person ID and ASSIGNMENT_ID

This is the query I'm using:
select DISTINCT "HRG_GOAL_ACCESS"."PERSON_ID" as "PERSON_ID",
"HRG_GOAL_ACCESS"."BUSINESS_GROUP_ID" as "BUSINESS_GROUP_ID",
"HRG_GOALS"."GOAL_ID" as "GOAL_ID",
"HRG_GOALS"."ASSIGNMENT_ID" as "ASSIGNMENT_ID",
"HRG_GOALS"."GOAL_NAME" as "GOAL_NAME",
"HRG_MASS_REQ_RESULTS"."ORGANIZATION_ID" as "ORGANIZATION_ID",
"HRG_MASS_REQ_RESULTS"."RESULT_CODE" as "RESULT_CODE",
"HRG_GOAL_PLN_ASSIGNMENTS"."CREATED_BY" as "CREATED_BY"
from "FUSION"."HRG_GOAL_PLN_ASSIGNMENTS" "HRG_GOAL_PLN_ASSIGNMENTS",
"FUSION"."HRG_MASS_REQ_RESULTS" "HRG_MASS_REQ_RESULTS",
"FUSION"."HRG_GOALS" "HRG_GOALS",
"FUSION"."HRG_GOAL_ACCESS" "HRG_GOAL_ACCESS"
where "HRG_GOAL_ACCESS"."PERSON_ID"="HRG_GOALS"."PERSON_ID"
and "HRG_MASS_REQ_RESULTS"."PERSON_ID"="HRG_GOALS"."PERSON_ID"
and "HRG_GOAL_PLN_ASSIGNMENTS"."PERSON_ID"="HRG_MASS_REQ_RESULTS"."PERSON_ID"
Output
PERSON_ID BUSINESS_GROUP_ID GOAL_ID ASSIGNMENT_ID GOAL_NAME RESULT_CODE CREATED_BY
---------------- ----------------- --------------- --------------- ------------------ -------------------- -------------------
300000048030404 1 300000137711224 300000048033078 NANO_CLASS SUCCESS anonymous G_1
300000048030404 1 300000137637946 300000048033078 INCREASE SALES BY 40% SUCCESS REDDI.SAREDDY G_1
300000048030404 1 300000137637946 300000048033078 INCREASE SALES BY 40% SUCCESS CURTIS.FEITTY

Your output does not contain duplicates. You have more than one row for PERSON_ID (300000048030404) but that's because the master table (? HRG_GOAL_ACCESS ?) has multiple rows in its child tables.
Each row has different details, so the set is valid. There are different values of HRG_GOALS.GOAL_ID, HRG_GOALS.GOAL_NAME and HRG_GOAL_PLN_ASSIGNMENTS.CREATED_BY.
If this response does not make you happy you need to explain more clearly what your desire output would look like. Alternatively you need to figure out your data model and understand why your query returns the data it does. Probably you have a missing join condition; the use of distinct could be hindering you in finding that out.

SSRS query and WHERE with multiple

Being new with SQL and SSRS and can do many things already, but I think I must be missing some basics and therefore bang my head on the wall all the time.
A report that is almost working, needs to have more results in it, based on conditions.
My working query so far is like this:
SELECT projects.project_number, project_phases.project_phase_id, project_phases.project_phase_number, project_phases.project_phase_header, project_phase_expensegroups.projectphase_expense_total, invoicerows.invoicerow_total
FROM projects INNER JOIN
project_phases ON projects.project_id = project_phases.project_id
LEFT OUTER JOIN
project_phase_expensegroups ON project_phases.project_phase_id = project_phase_expensegroups.project_phase_id
LEFT OUTER JOIN
invoicerows ON project_phases.project_phase_id = invoicerows.project_phase_id
WHERE ( projects.project_number = #iProjectNumber )
AND
( project_phase_expensegroups.projectphase_expense_total >0 )
The parameter is for selectionlist that is used to choose a project to the report.
How to have also records that have
( project_phase_expensegroups.projectphase_expense_total ) with value 0 but there might be invoices for that project phase?
Tried already to add another condition like this:
WHERE ( projects.project_number = #iProjectNumber )
AND
( project_phase_expensegroups.projectphase_expense_total > 0 )
OR
( invoicerows.invoicerow_total > 0 )
but while it gives some results - also the one with projectphase_expense_total with value 0, but the report is total mess.
So my question is: what am I doing wrong here?

There is a core problem with your query in that you are left joining to two tables, implying that rows may not exist, but then putting conditions on those tables, which will eliminate NULLs. That means your query is internally inconsistent as is.
The next problem is that you're joining two tables to project_phases that both may have multiple rows. Since these data are not related to each other (as proven by the fact that you have no join condition between project_phase_expensegroups and invoicerows, your query is not going to work correctly. For example, given a list of people, a list of those people's favorite foods, and a list of their favorite colors like so:
People
Person
------
Joe
Mary
FavoriteFoods
Person Food
------ ---------
Joe Broccoli
Joe Bananas
Mary Chocolate
Mary Cake
FavoriteColors
Person Color
------ ----------
Joe Red
Joe Blue
Mary Periwinkle
Mary Fuchsia
When you join these with links between Person <-> Food and Person <-> Color, you'll get a result like this:
Person Food Color
------ --------- ----------
Joe Broccoli Red
Joe Bananas Red
Joe Broccoli Blue
Joe Bananas Blue
Mary Chocolate Periwinkle
Mary Chocolate Fuchsia
Mary Cake Periwinkle
Mary Cake Fuchsia
This is essentially a cross-join, also known as a Cartesian product, between the Foods and the Colors, because they have a many-to-one relationship with each person, but no relationship with each other.
There are a few ways to deal with this in the report.
Create ExpenseGroup and InvoiceRow subreports, that are called from the main report by a combination of project_id and project_phase_id parameters.
Summarize one or the other set of data into a single value. For example, you could sum the invoice rows. Or, you could concatenate the expense groups into a single string separated by commas.
Some notes:
Please, please format your query before posting it in a question. It is almost impossible to read when not formatted. It seems pretty clear that you're using a GUI to create the query, but do us the favor of not having to format it ourselves just to help you
While formatting, please use aliases, Don't use full table names. It just makes the query that much harder to understand.

You need an extra parentheses in your where clause in order to get the logic right.
WHERE ( projects.project_number = #iProjectNumber )
AND (
(project_phase_expensegroups.projectphase_expense_total > 0)
OR
(invoicerows.invoicerow_total > 0)
)
Also, you're using a column in your WHERE clause from a table that is left joined without checking for NULLs. That basically makes it a (slow) inner join. If you want to include rows that don't match from that table you also need to check for NULL. Any other comparison besides IS NULL will always be false for NULL values. See this page for more information about SQL's three value predicate logic: http://www.firstsql.com/idefend3.htm
To keep your LEFT JOINs working as you intended you would need to do this:
WHERE ( projects.project_number = #iProjectNumber )
AND (
project_phase_expensegroups.projectphase_expense_total > 0
OR project_phase_expensegroups.project_phase_id IS NULL
OR invoicerows.invoicerow_total > 0
OR invoicerows.project_phase_id IS NULL
)

I found the solution and it was kind easy after all. I changed the only the second LEFT OUTER JOIN to INNER JOIN and left away condition where the query got only results over zero. Also I used SELECT DISTINCT
Now my report is working perfectly.

How to specify row names in MS Access 2007

I have a cross tab query and it pulls only the row name if there is data associated with it in the database. For example, if I have three types of musical instruments:
Guitar
Piano
Drums
Other
My results will show up as:
Guitar 1
Drums 2
It doesn't list Piano because there is no ID associated with Piano in the DB. I know I can specify columns in the properties menu, i.e. "1, 2, 3, 4, 5" will put columns in the DB for each, regardless of whether or not there is data to populate them.
I am looking for a similar solution for rows. Any ideas?
Also, I need NULL values to show up as 0.
Here's the actual SQL (forget the instrument example above)
TRANSFORM Count(Research.Patient_ID) AS CountOfPatient_ID
SELECT
Switch(
[Age]<22,"21 and under",
[Age]>=22 And [AGE]<=24,"Between 22 And 24",
[Age]>=25 And [AGE]<=29,"Between 25 And 29",
[Age]>=30 And [AGE]<=34,"30-34",
[Age]>=35 And [AGE]<=39,"35-39",
[Age]>=40 And [AGE]<=44,"40-44",
[Age]>44,"Over 44"
) AS Age_Range
FROM (Research
INNER JOIN (
SELECT ID, DateDiff("yyyy",DOB,Date()) AS AGE FROM Demographics
) AS Demographics ON Research.Patient_ID=Demographics.ID)
INNER JOIN [Letter Status] ON Research.Patient_ID=[Letter Status].Patient_ID
WHERE ((([Letter Status].Letter_Count)=1))
GROUP BY Demographics.AGE, [Letter Status].Letter_Count
PIVOT Research.Site In (1,2,3,4,5,6,7,8,9,10);
In short, I need all of the rows to show up regardless of whether or not there is a value (for some reason the LEFT JOIN isn't working, so if you can, please use my code to form your answer), and I also need to replace NULL values with 0.
Thanks

I believe this has to do with the way you are joining the instruments table to the IDs table. If you use a left outer join from instruments to IDs, Piano should be included. It would be helpful to see your actual tables and queries though, as your question is kind of vague.

What if you union the select with a hard coded select with one value for each age group.
select 1 as Guitar, 1 as Piano, 1 as Drums, 1 as Other
When you do the transform, each row will have a result that is +1 of the result you want.
foo barTmpCount
-------- ------------
Guitar 2
Piano 1
Drums 3
Other 1
You can then do a
select foo, barTmpCount - 1 as barCount from <query>
and get something like this
foo barCount
-------- ---------
Guitar 1
Piano 0
Drums 2
Other 0

SQL Alternative to performing an INNER JOIN on a single table

I have a large table (TokenFrequency) which has millions of rows in it. The TokenFrequency table that is structured like this:
Table - TokenFrequency
id - int, primary key
source - int, foreign key
token - char
count - int
My goal is to select all of the rows in which two sources have the same token in it. For example if my table looked like this:
id --- source --- token --- count
1 ------ 1 --------- dog ------- 1
2 ------ 2 --------- cat -------- 2
3 ------ 3 --------- cat -------- 2
4 ------ 4 --------- pig -------- 5
5 ------ 5 --------- zoo ------- 1
6 ------ 5 --------- cat -------- 1
7 ------ 5 --------- pig -------- 1
I would want a SQL query to give me source 1, source 2, and the sum of the counts. For example:
source1 --- source2 --- token --- count
---- 2 ----------- 3 --------- cat -------- 4
---- 2 ----------- 5 --------- cat -------- 3
---- 3 ----------- 5 --------- cat -------- 3
---- 4 ----------- 5 --------- pig -------- 6
I have a query that looks like this:
SELECT F.source AS source1, S.source AS source2, F.token,
(F.count + S.count) AS sum
FROM TokenFrequency F
INNER JOIN TokenFrequency S ON F.token = S.token
WHERE F.source <> S.source
This query works fine but the problems that I have with it are that:
I have a TokenFrequency table that has millions of rows and therefore need a faster alternative to obtain this result.
The current query that I have is giving duplicates. For example its selecting:
source1=2, source2=3, token=cat, count=4
source1=3, source2=2, token=cat, count=4
Which isn't too much of a problem but if there is a way to elimate those and in turn obtain a speed increase then it would be very useful
The main issue that I have is speed of the query with my current query it takes hours to complete. The INNER JOIN on a table to itself is what I believe to be the problem. Im sure there has to be a way to eliminate the inner join and get similar results just using one instance of the TokenFrequency table. The second problem that I mentioned might also promote a speed increase in the query.
I need a way to restructure this query to provide the same results in a faster, more efficient manner.
Thanks.

I'd need a little more info to diagnose the speed issue, but to remove the dups, add this to the WHERE:
AND F.source<S.source

Try this:
SELECT token, GROUP_CONCAT(source), SUM(count)
FROM TokenFrequency
GROUP BY token;
This should run a lot faster and also eliminate the duplicates. But the sources will be returned in a comma-separated list, so you'll have to explode that in your application.
You might also try creating a compound index over the columns token, source, count (in that order) and analyze with EXPLAIN to see if MySQL is smart enough to use it as a covering index for this query.
update: I seem to have misunderstood your question. You don't want the sum of counts per token, you want the sum of counts for every pair of sources for a given token.
I believe the inner join is the best solution for this. An important guideline for SQL is that if you need to calculate an expression with respect to two different rows, then you need to do a join.
However, one optimization technique that I mentioned above is to use a covering index so that all the columns you need are included in an index data structure. The benefit is that all your lookups are O(log n), and the query doesn't need to do a second I/O to read the physical row to get other columns.
In this case, you should create the covering index over columns token, source, count as I mentioned above. Also try to allocate enough cache space so that the index can be cached in memory.

If token isn't indexed, it certainly should be.

SQL Query Advice - Most recent item

I have a table where I store customer sales (on periodicals, like newspaper) data. The product is stored by issue. Example
custid prodid issue qty datesold
1 123 2 12 01052008
2 234 1 5 01022008
1 123 1 5 01012008
2 444 2 3 02052008
How can I retrieve (whats a faster way) the get last issue for all products, for a specific customer? Can I have samples for both SQL Server 2000 and 2005? Please note, the table is over 500k rows.
Thanks

Assuming that "latest" is determined by date (rather than by issue number), this method is usually pretty fast, assuming decent indexes:
SELECT
T1.prodid,
T1.issue
FROM
Sales T1
LEFT OUTER JOIN dbo.Sales T2 ON
T2.custid = T1.custid AND
T2.prodid = T1.prodid AND
T2.datesold > T1.datesold
WHERE
T1.custid = #custid AND
T2.custid IS NULL
Handling 500k rows is something that a laptop can probably handle without trouble, let alone a real server, so I'd stay clear of denormalizing your database for "performance". Don't add extra maintenance, inaccuracy, and most of all headaches by tracking a "last sold" somewhere else.
EDIT: I forgot to mention... this doesn't specifically handle cases where two issues have the same exact datesold. You might need to tweak it based on your business rules for that situation.

Generic SQL; SQL Server's syntax shouldn't be much different:
SELECT prodid, max(issue) FROM sales WHERE custid = ? GROUP BY prodid;

Is this a new project? If so, I would be wary of setting up your database like this and read up a bit on normalization, so that you might end up with something like this:
CustID LastName FirstName
------ -------- ---------
1 Woman Test
2 Man Test
ProdID ProdName
------ --------
123 NY Times
234 Boston Globe
ProdID IssueID PublishDate
------ ------- -----------
123 1 12/05/2008
123 2 12/06/2008
CustID OrderID OrderDate
------ ------- ---------
1 1 12/04/2008
OrderID ProdID IssueID Quantity
------- ------ ------- --------
1 123 1 5
2 123 2 12
I'd have to know your database better to come up with a better schema, but it sound like you're building too many things into a flat table, which will cause lots of issues down the road.

If you're looking for most recent sale by date maybe that's what you need:
SELECT prodid, issue
FROM Sales
WHERE custid = #custid
AND datesold = SELECT MAX(datesold)
FROM Sales s
WHERE s.prodid = Sales.prodid
AND s.issue = Sales.issue
AND s.custid = #custid

To query on existing growing historical table is way too slow!
Strongly suggest you create a new table tblCustomerSalesLatest which stores the last issue data of each customer. and select from there.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas