POSTGRESQL: Using case with joined tables - sql

I'm a Postgresql newbie, so still struggling a little bit here. Please be gentle.
I'm left joining three tables, and would like to be able to use a case statement to introduce another column that brings across a desired value from one column based on another. I'm going to guess that my INNER JOIN and CASE statements are back to front, but I'm not sure how to rearrange them without breaking the intent.
Basically: Where model_best_fit == SUNNY, then I'd like a new column with name applied_f_model_hours_above4k to have the value from the column hoursabove4k_sunny
Code sample:
SELECT *
FROM px_fuel_weathercell
INNER JOIN f_descriptions ON px_f_weathercell.px_id = f_descriptions.fuel_id
INNER JOIN dailywx ON px_f_weathercell.fid_new_wx_cells = dailywx.location
CASE best_model_fit
WHEN 'SUNNY' then hoursabove4k_sunny
END applied_f_model_hours_above4k
WHERE best_model_fit = 'SUNNY' /* limiting my test case here, clause will be removed later */
LIMIT 1000;
The error is as follows:
ERROR: syntax error at or near "CASE"
LINE 5: CASE best_model_fit
^
SQL state: 42601
Character: 210
Thank you for any help you can offer.
Bonus points: CASE seems slow. 45 seconds to run this query. dailywx has 400,000 rows, px_f_weathercell has 6,000,000 rows. Is there a faster way to do this?
EDIT:
Made the following edit, no getting a column full of nulls when the column desired has numbers (including 0) in it. Both columns are of type double.
EDIT2: Updated a couple of table names to indicate where columns are coming from. Also updated to show left join. I've also used PGTune to set some recommended settings in order to address a situation where the process was disk bound. I've also set an index on px_f_weathercell.fid_new_wx_cells and px_f_weathercell.px_id. This has resulted in 100,000 records returning in approx 5-7 seconds. I'm still recieving null values from the CASE statement, however.
SELECT *,
CASE best_model_fit
WHEN 'SUNNY' then dailywx.hoursabove4k_sunny
END applied_f_model_hours_above4k
FROM px_fuel_weathercell
LEFT JOIN f_descriptions ON px_f_weathercell.px_id = f_descriptions.fuel_id
LEFT JOIN dailywx ON px_f_weathercell.fid_new_wx_cells = dailywx.location
WHERE fuel_descriptions.best_model_fit = 'SUNNY' /* limiting my test case here, clause will be removed later */
LIMIT 1000;

In a table, all rows have the same columns. You cannot have a column that exists only for some rows. Since a query result is essentially a table, that applies there as well.
So having NULL or 0 as a result for the rows where the information does not apply is your only choice.
The reason why the CASE expression is returning NULL is that you have no ELSE branch. If none of the WHEN conditions applies, the result is NULL.
The performance of the query is a different thing. You'd need to supply EXPLAIN (ANALYZE, BUFFERS) output to analyze that. But when joining big tables, it is often beneficial to set work_mem high enough.

Related

Why would a sub query perform better than a litteral value in a WHERE clause with multiple joins?

Take to following query:
SELECT *
FROM FactALSAppSnapshot AS LB
LEFT OUTER JOIN MLALSInfoStage AS LA ON LB.ProcessDate = LA.ProcessDate AND
LB.ALSAppID = LA.ALSNumber
LEFT OUTER JOIN MLMonthlyIncomeStage AS LC ON LB.ProcessDate = LC.ProcessDate AND
LB.ALSAppID = LC.ALSNumber
LEFT OUTER JOIN DimBranchCategory AS LI on LB.ALSAppBranchKey = LI.Branch
WHERE LB.ProcessDate=(SELECT TOP 1 LatestProcessDateKey
FROM DimDate)
Notice that the WHERE condition is a scalar sub query. The runtime for this is 0:54 resulting in 367,853 records.
However, if I switch the WHERE clause to the following:
WHERE LB.ProcessDate=20161116
This somehow causes the query runtime to jump up to 57:33 still resulting in 367,853 records. What is happening behind the scenes that would cause this huge jump in the runtime? I would have expected the sub query version to take longer, not the literal integer value.
The table aliased as LI (last join on the list) seems to be the only table that isn't indexed on its key, and seems to allow the query to perform closer to the 1st query if I remove that table as join and using the integer value instead of the sub query.
SQL Server 11
The real answer to your question lies in the execution plan for the query. You can see the actual plan in SSMS.
Without the plan, the rest is speculation. However, based on my experience, what changes is the way the joins are processed. In my experience, queries slow down considerably when a query switches to nested loop joins. This is at the whim of the optimizer, which -- when there is a constant -- thinks this is the best way to run the query.
I'm not sure why this would be the case. Perhaps an index on FactALSAppSnapshot(ProcessDate, ALSAppID, ALSAppBranchKey) would speed up both versions of the query.

Mystery query fail: Why did this create a massive output?

I was attempting to do some basic Venn Diagram subtraction to compare a temp table to some live data, and see how they were different.
This query blew up to well north of 15 million returned rows, and I noticed it was duplicating (by 10,000x or more) a known unique field - indicating something went very wrong with my query (I mean by this that rows were being duplicated and I could verify this by this Globally Unique Identifier field). I was expecting to get at most 200 rows returned:
select a.*
from TableOfLiveData a
inner join #TempDataToBeSubtracted b
on a.GUID <> b.guidTemp --I suspect the issue is here
where {here was a limiting condition that should have reduced my live
data to a "pre-join" count(*) of 20,000 at most...}
After I hit Execute the query ran much longer than expected and I could see that millions of rows were being returned before I had to cancel out.
Let me know what the obvious thing is!?!?
edit: FYI: If the where clause were not included, I would expect a VAST amount of rows returned...
Although your query is logically correct, the problem is you have a "Cartesian product" (n x m rows) in your join, but the where clause is executed after the join is made, so you have a colossal number of rows over which the where clause must be executed... so it will be very, very slow.
A better approach is to do an outer join on the key columns, but discard all successful joins by filtering for missed joins:
select a.*
from TableOfLiveData a
left join #TempDataToBeSubtracted b on b.guidTemp = a.GUID
where a.field1 = 3
and a.field2 = 1515
and b.guidTemp is null -- only returns rows that *don't* match
This works because when an outer join is missed, you still get the row from the main table and all columns in the joined table are null.
Creating an index on (field1, field2) will improve performance.
Thank you #Lamak and #MartinSmith for your comments that solved this problem.
By using a 'not equals' in my "on" clause, I ensured that I would be selecting every row in LiveTable that didn't have a GUID in my #TempTable, not just once as I intended, but for each entry in my #TempTable, multiplying my results by about 20,000 in this case (the cardinality of the #TempTable).
To fix this, I did a simple subquery on my #TempTable using the "Not In" Statement as recommended in the comments. This query finished in under a minute and returned under a 100 rows, which was much more in-line with my expectation:
select a.*
from TableOfLiveData a
where a.GUID not in (select b.guidTemp from #TempDataToBeSubtracted b)
and {subsequent constraint statement not relevant to question}

Oracle optimization -- weird execution plan to left join an uncorrelated subquery

I wrote this query to tie product forecast data to historical shipment data in a Oracle star schema database and the optimizer did not behave in the way that I expected, so I am kind of curious as to what is going on.
Essentially, I have a bunch of dimension tables that will be consistent for both the forecast and the sales fact tables but the fact tables are aggregated at a different level, so I set them up as two subqueries and roll them up so I can tie them together (query example below.) In this case, I want all of the forecast data but only the sales data that matches.
The odd thing is that if I use either of the subqueries by themselves, they each seem to behave the way I would expect and each returns in less than a second (using the same filters -- I tested by just removing one or the other subquery and changing the alias).
Here is an example of the query structure -- I kept it as generic as I could, so there may be a few typos from changing it:
SELECT
TIME_DIMENSION.GREGORIAN_DATE,
LOCATION_DIMENSION.LOCATION_CODE,
DESTINATION_DIMENSION.REGION,
PRODUCT_DIMENSION.PRODUCT_CODE,
SUM(NVL(FIRST_SUBQUERY.VALUE,0)) VALUE1,
SUM(NVL(SECOND_SUBQUERY.VALUE,0)) VALUE2
FROM
TIME_DIMENSION,
LOCATION_DMENSION SOURCE_DIMENSION,
LOCATION_DIMENSION DESTINATION_DIMENSION,
PRODUCT_DIMENSION,
(SELECT
FORECAST_FACT.TIME_KEY,
FORECAST_FACT.SOURCE_KEY,
FORECAST_FACT.DESTINATION_KEY,
FORECAST_FACT.PRODUCT_KEY,
SUM(FORECAST_FACT.VALUE) AS VALUE,
FROM FORECAST_FACT
WHERE [FORECAST_FACT FILTERS HERE]
GROUP BY
FORECAST_FACT.TIME_KEY,
FORECAST_FACT.SOURCE_KEY,
FORECAST_FACT.DESTINATION_KEY) FIRST_SUBQUERY
LEFT JOIN
(SELECT
--This is just as an example offset
(LAST_YEAR_FACT.TIME_KEY + 52) TIME_KEY,
LAST_YEAR_FACT.SOURCE_KEY,
LAST_YEAR_FACT.DESTINATION_KEY,
FORECAST_FACT.PRODUCT_KEY,
SUM(LAST_YEAR_FACT.VALUE) AS VALUE,
FROM LAST_YEAR_FACT
WHERE [LAST_YEAR_FACT FILTERS HERE]
GROUP BY
LAST_YEAR_FACT.TIME_KEY,
LAST_YEAR_FACT.SOURCE_KEY,
LAST_YEAR_FACT.DESTINATION_KEY) SECOND_SUBQUERY
ON
FORECAST_FACT.TIME_KEY = LAST_YEAR_FACT.TIME_KEY
AND FORECAST_FACT.SOURCE_KEY = LAST_YEAR_FACT.SOURCE_KEY
AND FORECAST_FACT.DESTINATION_KEY = LAST_YEAR_FACT.DESTINATION_KEY
--I also tried to tie the last_year subquery to the dimension tables here
WHERE
FORECAST_FACT.TIME_KEY = TIME_DIMENSION.TIME_KEY
AND FORECAST_FACT.SOURCE_KEY = SOURCE_DIMENSION.LOCATION_KEY
AND FORECAST_FACT.DESTINATION_KEY = DESTINATION_DIMENSION.LOCATION_KEY
AND FORECAST_FACT.PRODUCT_KEY = PRODUCT_DIMENSION.PRODUCT_KEY
--I also tried, separately, to tie the last_year subquery to the dimension tables here
AND TIME_DIMENSION.WEEK = 'VALUE'
AND SOURCE_DIMENSION.SOURCE_CODE = 'VALUE'
AND DESTINATION_DIMENSION.REGION IN ('VALUE', 'VALUE')
AND PRODUCT_DIMENSION.CLASS_CODE = 'VALUE'
GROUP BY
TIME_DIMENSION.GREGORIAN_DATE,
SOURCE_DIMENSION.LOCATION_CODE,
DESTINATION_DIMENSION.REGION,
PRODUCT_DIMENSION.PRODUCT_CODE
Essentially, when I run either subquery independently it will utilize the indexes and search only a specific range of the specific partition, whereas with the left join it always does a full table scan on one of the fact tables. What seems to be happening is Oracle is applying the dimension table filters to only the first subquery -- and thus to do the left join it first needs to scan the entire sales table -- even if I explicitly tie and filter the values twice, instead of relying on the implicit filtering... I tried that. Am I thinking about this wrong? To me, the optimizer should use the indexes on both of the fact tables to filter each by the values in the WHERE clause and then left join the resulting subset.
I realize that I could simply add the filters to each of the subqueries, or set this up as a union of two independent queries, but I am curious as to what exactly is going on in terms of the optimization engine -- I can post the execution plan, if that would help.
Thanks!
Be sure that the tables are all analysed. Do it again. The optimizer uses those values for calculating it's execution plan. In cases where Oracle really choose a wrong plan your workaround is to force the optimizer with hints /*+ ... */, specifying the use of indexes, join order, etc.

SQL query either runs endlessly or returns no values

having trouble with a multi-table query today. I tried writing it myself and it didn't seem to work, so I selected all of the columns in the Management Studio Design view. The code SHOULD work but alas it doesn't. If I run this query, it seems to just keep going and going. I left my desk for a minute and when I came back and stopped the query, it had returned something like 2,000,000 rows (there are only about 120,000 in the PODetail table!!):
SELECT PODetail.OrderNum, PODetail.VendorNum, vw_orderHistory.Weight, vw_orderHistory.StdSqft, vw_orderHistory.ReqDate, vw_orderHistory.City,
vw_orderHistory.State, FB_FreightVend.Miles, FB_FreightVend.RateperLoad
FROM PODetail CROSS JOIN
vw_orderHistory CROSS JOIN
FB_FreightVend
ORDER BY ReqDate
Not only that, but it seems that every record had an OrderNum of 0 which shouldn't be the case. So I tried to exclude it...
SELECT PODetail.OrderNum, PODetail.VendorNum, vw_orderHistory.Weight, vw_orderHistory.StdSqft, vw_orderHistory.ReqDate, vw_orderHistory.City,
vw_orderHistory.State, FB_FreightVend.Miles, FB_FreightVend.RateperLoad
FROM PODetail CROSS JOIN
vw_orderHistory CROSS JOIN
FB_FreightVend
WHERE PODetail.OrderNum <> 0
ORDER BY ReqDate
While it executes successfully (no errors), it also returns no records whatsoever. What's going on here? I'm also curious about the query's CROSS JOIN. When I tried writing this myself, I first used "WHERE PODetail.OrderNum = vw_orderHistory.OrderNum" to join those tables but I got the same no results issue. When I tried using JOIN, I got errors regarding "multi-part identifier could not be bound."
A cross join returns a zillion records. The product of the number of records in each table . . . That might be 10,000 * 100,000 * 100 -- this is a big number.
The one caveat is when a table is empty. Then the rows in that table is 0 . . . and 0 times anything is 0. So no rows are returned. And, no rows might be returned quite quickly.
I think you need to learn what join really does in SQL. Then you need to reimplement this with the correct join conditions. Not only will the query run faster, but it will return accurate results.
Do not use cross joins especially on large tables. The link below will help.
http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html
Also multi-part identifier could not be bound. means the column might not exist as defined. Verify the column exists, datatype and it's assigned name for join.
At condition <> 0 all non corresponding values from PODetail will be omited.
Use (Ordernumber <> 0 or Ordernumber is null)
Avoid CROSS JOINS like the plague. Explicitly define your Order, PO and VendorFreight JOINS.

ORDER BY column that has allow null is slow. Why?

So really my question is WHY this worked.
Anyway, I had this query that does a few inner joins, has a where clause and does an order by on a nvarchar column. If I run the query WITHOUT order by, the query takes less than a second. If I run the query WITH order by, it takes 12 seconds.
Now I had a great idea and changed all the INNER JOINs to LEFT JOINs. And also included the ORDER BY clause. That took less than a second. So I remembered the difference between LEFT JOINs and INNER JOINs. INNER JOINs check for NULL and LEFT JOINs don't. So I went into the table design and unchecked "Allow Nulls". Now I run the query WITH INNER JOINs and a ORDER BY clause and the query takes less than a second. WHY?
From what I understand, the FROM, JOINS, WHERE, then SELECT clauses should run first and return a result set. Then the ORDER BY clause runs at the very end on the resultant record set. Therefore the query should have taken AT MOST a second, yes, even with the column allowing nulls. So why would the query take less than a second WITHOUT the order by clause, but take 12 seconds WITH order by clause? That doesn't make sense to me.
Query below:
SELECT PlanInfo.PlanId, PlanName, COALESCE(tResponsible, '') AS tResponsible, Processor, CustName, TaskCategoryId, MapId, tEnd,
CASE MapId WHEN 9 THEN 1 ELSE 2 END AS sor
FROM PlanInfo INNER JOIN [orders].dbo.BaanOrders_Ext ON PlanInfo.PlanName = [orders].dbo.BaanOrders_Ext.OrderNo
INNER JOIN [orders].dbo.BaanOrders ON PlanInfo.PlanName = [orders].dbo.BaanOrders.OrderNo
INNER JOIN Tasks ON PlanInfo.PlanId = Tasks.PlanId
INNER JOIN EngSchedToTimingMap ON Tasks.CatId = EngSchedToTimingMap.TaskCategoryId
WHERE (MapId = 9 OR MapId = 11 or MapId = 13 or MapId = 15)
AND([orders].dbo.BaanOrders_Ext.Processor = 'metest' OR tResponsible = 'metest')
ORDER BY PlanInfo.PlanId
I would have to guess that it is due to having an index on PlanInfo.PlanId, on which you are sorting.
SQL Server could streamline collection so that it follows the index and build the rest of the columns along that order. When the field is NULLable, the index cannot be used for sorting, because it will not contain the NULL values, which incidentally come first, so it decides to optimize along a different path.
Showing the Execution Plan always helps. Either paste the images of the plans, or just show the text-mode plans, i.e. add the line above the query, then execute it
SET SHOWPLAN_TEXT ON;
<the query>
When you use the ORDER BY clause, you force the database engine to sort the results. This takes some time (especially if the result contains many rows) - thus it is possible that a query that runs 1 second without an ORDER BY clause runs 12 seconds with it. Note that sorting takes at best O(N*log(N)) time where N is the number of rows.
The reason why NULLs are generally slow is the fact that they must be treated specially. Sorting with NULLs adds more complex comparison conditions and slows the sorting down.
If your question is "Why does the ORDER BY clause cause my query to run longer?" the answer is because sorting the results is added to the query execution plan.
If you use the "Show Estimated Query Execution Plan" tool in SQL Server Studio, it will show you exactly what it thinks the SQL Server engine will do.