EXCEPT returning unpredictable results - sql

I have a table DB_Budget which has 3 columns Business_Unit_Code, Ledger_Period, Budget_GBP. For sake of simplicity, I have left out the other columns.
Data types are -
This table is present in my development and production environment.
While doing some quality checks I ran the below query:
select
Business_Unit_Code,
Ledger_Period,
Budget_GBP
from [SomeLinkedServer].[Database].dbo.DB_BUDGET
where business_unit_code = 'AV' and ledger_period = '200808'
and budget_gbp >= 32269
except
select
Business_Unit_Code,
Ledger_Period,
Budget_GBP
from [Database].dbo.DB_BUDGET
where business_unit_code = 'AV' and ledger_period = '200808'
and budget_gbp >= 32269
I got this -
If I remove the except, this is what I get -
Clearly, data is same in both tables! Why would EXCEPT give me one row?
Things get interesting. I wrap Budget_GBP around LTRIM(RTRIM( ... construct.
And things matched!
I did a bit of googling and found that LTRIM(RTRIM( basically rounds off the float to 32269.2. That might be the reason why they match.
So, to summarize, my first question is why the EXCEPT gives a row in result when the records are matching?
My second question might be simple. As you can see, I am restricted to use the clause budget_gbp >= 32269 in WHERE clause. Reason is when I provide the exact value(which I am copying from SSMS), I get no results. Please let me know what I am doing wrong here.
EDIT - Is there any way the data validation might work? There are 100s of table in the database and it is next to impossible for me to scavenge for float columsn and wrap them around cast. Using EXCEPT is one ways of validating the data in development environment.

Cast the values to fixed point types or strings and re-run your code. The first query would look like:
select Business_Unit_Code, Ledger_Period, cast(Budget_GBP as decimal(18, 2)) as Budget_GBP
With numbers, what you see is not always what you get. So you see 32669.19, but it might really be 32.669.189999.

you could define an accuracy and check if the value is within the specified accuracy of each other.
declare #accuracy float = 0.001
select *
from DB_BUDGET1 t1
full join DB_BUDGET2 t2
on t1.Business_Unit_Code=t2.Business_Unit_Code
and t1.Ledger_Period =t2.Ledger_Period
and t1.Budget_GBP between t2.Budget_GBP-#accuracy and t2.Budget_GBP+#accuracy
where t1.id is null
or t2.id is null
http://www.sqlfiddle.com/#!3/41da0/2/0

Related

How can I compare two values in SQL to a given decimal precision when they are of different data types?

I'm trying to show that there are no significant differences between two tables using an EXCEPT query. I'm only including the fields that I care about comparing. The problem is that tons of differences are being picked up in the result set due to one of the tables having data in a float format and the other having Decimal(24,7). I want my EXCEPT query to only include differences that are greater than 1.
I tried casting the float to Decimal(24,7) as well as casting both to Decimal(24,2) but due to rounding there are still differences flagged. For example, one table might show 2.55 and the other 2.5499999. That gets flagged as a difference. If I truncate the values, I still get differences (2.55 vs 2.54). If I round them or cast as Decimal(24,2) this particular instance is fixed, but others show up (e.g. rounding 2.355 vs 2.35499999 causes 2.36 vs 2.35).
How can I cast or round the decimal values such that any differences less than 1 are not returned by my EXCEPT query?
Sample code:
SELECT name, weight FROM Table1
EXCEPT
SELECT name, weight FROM Table2
/* That returns thousands of differences. If I cast both weights as Decimal(24,2) I get far fewer differences, but I want to only show differences greater than 1. */
This is probably not appropriate for EXCEPT. But you can try using a smaller number of decimal places:
SELECT name, CAST(weight as DECIMAL(24, 3)) FROM Table1
EXCEPT
SELECT name, CAST(weight as DECIMAL(24, 3)) FROM Table2;
Alternatively, you can use NOT EXISTS:
SELECT name, weight
FROM Table1 t1
WHERE NOT EXISTS (SELECT 1
FROM table2 t2
WHERE t2.name = t1.name AND
ABS(t2.weight - t1.weight) < 0.00001
);

Order by in subquery behaving differently than native sql query?

So I am honestly a little puzzled by this!
I have a query that returns a set of transactions that contain both repair costs and an odometer reading at the time of repair on the master level. To get an accurate Cost per mile reading I need to do a subquery to get both the first meter reading between a starting date and an end date, and an ending meter.
(select top 1 wf2.ro_num
from wotrans wotr2
left join wofile wf2
on wotr2.rop_ro_num = wf2.ro_num
and wotr2.rop_fac = wf2.ro_fac
where wotr.rop_veh_num = wotr2.rop_veh_num
and wotr.rop_veh_facility = wotr2.rop_veh_facility
AND ((#sdate = '01/01/1900 00:00:00' and wotr2.rop_tran_date = 0)
OR ([dbo].[udf_RTA_ConvertDateInt](#sdate) <= wotr2.rop_tran_date
AND [dbo].[udf_RTA_ConvertDateInt](#edate) >= wotr2.rop_tran_date))
order by wotr2.rop_tran_date asc) as highMeter
The reason I have the tables aliased as xx2 is because those tables are also used in the main query, and I don't want these to interact with each other except to pull the correct vehicle number and facility.
Basically when I run the main query it returns a value that is not correct; it returns the one that is second(keep in mind that the first and second have the same date.) But when I take the subquery and just copy and paste it into it's own query and run it, it returns the correct value.
I do have a work around for this, but I am just curious as to why this happening. I have searched quite a bit and found not much(other than the fact that people don't like order bys in subqueries). Talking to one of my friends that also does quite a bit of SQL scripting, it looks to us as if the subquery is ordering differently than the subquery by itsself when you have multiple values that are the same for the order by(i.e. 10 dates of 08/05/2016).
Any ideas would be helpful!
Like I said I have a work around that works in this one case, but don't know yet if it will work on a larger dataset.
Let me know if you want more code.

speed of SQL queries with variables

I am new to SQL and trying to query a large database so speed is an issue. I have been using a query (with line 1) of the form shown below which has been working fine, but when I modify it (to switch line 1 for line 2) to use a constant to make a cut rather than a value derived within the query itself then the query is significantly slower (running time of 1 is ~1sec and 2 is a few minutes). I would have actually expected it to be much quicker.
Can someone explain why this is happening or suggest how I might rewrite this query better?
Thanks
Query
with local_sample as
( SELECT b.mass, ...various other columns selected...
FROM table1 TAB, table2 b
WHERE ...a few clauses... )
SELECT min(prog.num), LTAB.mass, ...various other columns...
from local_sample LTAB, table2 prog
WHERE ...a few clauses...
[**1**] and prog.mass > LTAB.mass/2.0
[**2**] and prog.mass > 31.62
group by ...columns...
Information in the question is kind of scarce, so at a guess, it's an implicit conversion issue. My hunch is LTAB.mass is the same data type as prog.mass, so no conversion is necessary, but whatever that data type is doesn't play nicely with decimal.
Numbers in sql come in many flavors, and most of the time we don't have to think about it because the conversions are very fast and happen in the background. Occasionally though, you'll come across number types formats that don't play well with others (float for instance) and it can become a performance pain point.
So here is a way to test if that's the issue run the below query (assuming Microsoft SQL Server is your RDBMS):
select SQL_VARIANT_PROPERTY(Mass,'BaseType') AS 'Base Type'
From table2
If my hunch is correct it will return float as the base type. If that's the case and the implicit conversion is the issue then the below should work in a similar manner to query 1:
with local_sample as
( SELECT b.mass, ...various other columns selected...
FROM table1 TAB, table2 b
WHERE ...a few clauses... )
Declare #Mass float = 31.62
SELECT min(prog.num), LTAB.mass, ...various other columns...
from local_sample LTAB, table2 prog
WHERE ...a few clauses...
and prog.mass > #Mass
group by ...columns...
Anyway let me know if that doesn't work for you.

Multiple Joins + Lots of Data Optimization

I am working on a massive join at work and have very limited resources in terms of being able to add indexes and such as well as what I can do in the query itself due to the environment (i.e. I can only select data, no variables or table creations allowed). I have read somewhere that a subquery will automatically index the result, is this true? Also for my major join tables (3) each has ~140K rows. I have to join 2 extra tables to ensure filtering is correct. I have the query listed below which I currently have criteria on the JOIN clause. Another question is if I move my criteria to a where clause either in or out of the subquery will it benefit?
SELECT *
FROM (SELECT NULL AS A1,
DFS_ROHEADER.TECHID,
DFS_ROHEADER.RONUMBER,
DFS_ROHEADER.CUSTOMERNUMBER,
DFS_CUSTOMER.BNAME,
DFS_ROHEADER.UNITNUMBER,
DFS_ROHEADER.MILEAGE,
DFS_ROHEADER.OPENEDDATE,
DFS_ROHEADER.CLOSEDDATE,
DFS_ROHEADER.STATUS,
DFS_ROHEADER.PONUMBER,
DFS_TECH.REGION,
DFS_TECH.RSM,
DFS_ROPART.PARTID,
CONVERT(NVARCHAR(max), DFS_RODETAIL.STORY) AS STORY
FROM DFS_ROHEADER
LEFT JOIN DFS_CUSTOMER
ON DFS_ROHEADER.CUSTOMERNUMBER = DFS_CUSTOMER.CUST_NO
LEFT JOIN DFS_TECH
ON DFS_ROHEADER.TECHID = DFS_TECH.TECHID
INNER JOIN DFS_RODETAIL
ON DFS_ROHEADER.RONUMBER = DFS_RODETAIL.RONUMBER
INNER JOIN DFS_ROPART
ON DFS_RODETAIL.RONUMBER = DFS_ROPART.RONUMBER
AND DFS_RODETAIL.LINENUMBER = DFS_ROPART.LINENUMBER
AND DFS_ROHEADER.RONUMBER LIKE '%$FF_RONumber%'
AND DFS_ROHEADER.UNITNUMBER LIKE '%$FF_UnitNumber%'
AND DFS_ROHEADER.PONUMBER LIKE '%$FF_PONumber%'
AND ( DFS_CUSTOMER.BNAME LIKE '%$FF_Customer%'
OR DFS_CUSTOMER.BNAME IS NULL )
AND DFS_ROHEADER.TECHID LIKE '%$FF_TechID%'
AND DFS_ROHEADER.CLOSEDDATE BETWEEN
FF_ClosedBegin AND FF_ClosedEnd
AND ( DFS_TECH.REGION LIKE '%$FilterRegion%'
OR DFS_TECH.REGION IS NULL )
AND ( DFS_TECH.RSM LIKE '%$FF_RSM%'
OR DFS_TECH.RSM IS NULL )
AND DFS_RODETAIL.STORY LIKE '%$FF_Story%'
AND DFS_ROPART.PARTID LIKE '%$FF_PartID%'
WHERE DFS_ROHEADER.DELETED_BY < 0
AND DFS_RODETAIL.DELETED_BY < 0
AND DFS_ROPART.DELETED_BY < 0) T
ORDER BY T.RONUMBER
This query works; however, it can take forever to run, and can timeout. I have other queries that also run in the environment and I will take whatever you can give me in terms of suggestions and apply it to those. I am using SQLServer 2000, Thanks for the help.
EDIT:
Execution Plan:
https://dl.dropboxusercontent.com/u/99733863/ExecutionPlan.sqlplan
UPDATE:
I have come to the conclusion the environment I'm working in is the cause of the problem. My query works as intended and is not slow at all (1 sec. for 18,000 rows). As stated in the comments I have to fill grids with limited flexibility and I believe that these grids fill by first filling a temporary grid with the SQL statement and then copying row by row into the desired grid. There is a good chance that this is the cause of my issues. Thanks for the help.
I have come to the conclusion the environment I'm working in is the cause of the problem. My query works as intended and is not slow at all (1 sec. for 18,000 rows). As stated in the comments I have to fill grids with limited flexibility and I believe that these grids fill by first filling a temporary grid with the SQL statement and then copying row by row into the desired grid. There is a good chance that this is the cause of my issues. Thanks for the help everyone.
My 2 cents here.. In general LIKE is not very well optimized. In your case you also seem to be using LIKE with '%value%'. In that case the query optimizer has to scan the entire index. At a minimum I would see if there is a way to avoid using this.

Filtering simultaneously on count of related objects and on count of related objects that satisfy a condition in Django

So I have models amounting to this (very simplified, obviously):
class Mystery(models.Model):
name = models.CharField(max_length=100)
class Character(models.Model):
mystery = models.ForeignKey(Mystery, related_name="characters")
required = models.BooleanField(default=True)
Basically, in each mystery there are a number of characters, which can be essential to the story or not. The minimum number of actors that can stage a mystery is the number of required characters for that mystery; the maximum number is the number of characters total for the mystery.
Now I'm trying to query for mysteries that can be played by some given number of actors. It seemed straightforward enough using the way Django's filtering and annotation features function; after all, both of these queries work fine:
# Returns mystery objects with at least x characters in all
Mystery.objects.annotate(max_actors=Count('characters', distinct=True)).filter(max_actors__gte=x)
# Returns mystery objects with no more than x required characters
Mystery.objects.filter(characters__required=True).annotate(min_actors=Count('characters', distinct=True)).filter(min_actors__lte=x)
However, when I try to combine the two...
Mystery.objects.annotate(max_actors=Count('characters', distinct=True)).filter(characters__required=True).annotate(min_actors=Count('characters', distinct=True)).filter(min_actors__lte=x, max_actors__gte=x)
...it doesn't work. Both min_actors and max_actors come out containing the maximum number of actors. The relevant parts of the actual query being run look like this:
SELECT `mysteries_mystery`.`id`,
`mysteries_mystery`.`name`,
COUNT(DISTINCT `mysteries_character`.`id`) AS `max_actors`,
COUNT(DISTINCT `mysteries_character`.`id`) AS `min_actors`
FROM `mysteries_mystery`
LEFT OUTER JOIN `mysteries_character` ON (`mysteries_mystery`.`id` = `mysteries_character`.`mystery_id`)
INNER JOIN `mysteries_character` T5 ON (`mysteries_mystery`.`id` = T5.`mystery_id`)
WHERE T5.`required` = True
GROUP BY `mysteries_mystery`.`id`, `mysteries_mystery`.`name`
...which makes it clear that while Django is creating a second join on the character table just fine (the second copy of the table being aliased to T5), that table isn't actually being used anywhere and both of the counts are being selected from the non-aliased version, which obviously yields the same result both times.
Even when I try to use an extra clause to select from T5, I get told there is no such table as T5, even as examining the output query shows that it's still aliasing the second character table to T5. Another attempt to do this with extra clauses went like this:
Mystery.objects.annotate(max_actors=Count('characters', distinct=True)).extra(select={'min_actors': "SELECT COUNT(*) FROM mysteries_character WHERE required = True AND mystery_id = mysteries_mystery.id"}).extra(where=["`min_actors` <= %s", "`max_actors` >= %s"], params=[x, x])
But that didn't work because I can't use a calculated field in the WHERE clause, at least on MySQL. If only I could use HAVING, but alas, Django's .extra() does not and will never allow you to set HAVING parameters.
Is there any way to get Django's ORM to do what I want?
How about combining your Count()s:
Mystery.objects.annotate(max_actors=Count('characters', distinct=True),min_actors=Count('characters', distinct=True)).filter(characters__required=True).filter(min_actors__lte=x, max_actors__gte=x)
This seems to work for me but I didn't test it with your exact models.
It's been a couple of weeks with no suggested solutions, so here's how I ended up going about it, for anyone else who might be looking for an answer:
Mystery.objects.annotate(max_actors=Count('characters', distinct=True)).filter(max_actors__gte=x, id__in=Mystery.objects.filter(characters__required=True).annotate(min_actors=Count('characters', distinct=True)).filter(min_actors__lte=x).values('id'))
In other words, filter on the first count and on IDs that match those in an explicit subquery that filters on the second count. Kind of clunky, but it works well enough for my purposes.