i got a query that consists of a cross join between 3 dims .
dim1- has 15,000 members
dim2- this is the organisation hierarchy
dim 3- has 30,000 members
this is the query:
TOPCOUNT(
FILTER(
(NONEMPTY([DIM1].[Key].[All],[Measures].[Target])
*DIM2.CHILDREN*
NONEMPTY([DIM3].[Key].[All],[Measures].[Target])),
[Measures].[Target]>0
),20,[Measures].[Gap)
as you can imagine -this is a huge cross join but i have to do this ...
i tried functions like "filter" and "non empty" but it doesnt help...it takes more then 30 minutes...
how can i improve it so it will run in a short time?
thank you
Related
Scenario: Medical records reporting to state government which requires a pipe delimited text file as input.
Challenge: Select hundreds of values from a fact table and produce a wide result set to be (Redshift) UNLOADed to disk.
What I have tried so far is a SQL that I want to make into a VIEW.
;WITH
CTE_patient_record AS
(
SELECT
record_id
FROM fact_patient_record
WHERE update_date = <yesterday>
)
,CTE_patient_record_item AS
(
SELECT
record_id
,record_item_name
,record_item_value
FROM fact_patient_record_item fpri
INNER JOIN CTE_patient_record cpr ON fpri.record_id = cpr.record_id
)
Note that fact_patient_record has 87M rows and fact_patient_record_item has 97M rows.
The above code runs in 2 seconds for 2 test records and the CTE_patient_record_item CTE has about 200 rows per record for a total of about 400.
Now, produce the result set:
,CTE_result AS
(
SELECT
cpr.record_id
,cpri002.record_item_value AS diagnosis_1
,cpri003.record_item_value AS diagnosis_2
,cpri004.record_item_value AS medication_1
...
FROM CTE_patient_record cpr
INNER JOIN CTE_patient_record_item cpri002 ON cpr.cpr.record_id = cpri002.cpr.record_id
AND cpri002.record_item_name = 'diagnosis_1'
INNER JOIN CTE_patient_record_item cpri003 ON cpr.cpr.record_id = cpri003.cpr.record_id
AND cpri003.record_item_name = 'diagnosis_2'
INNER JOIN CTE_patient_record_item cpri004 ON cpr.cpr.record_id = cpri004.cpr.record_id
AND cpri003.record_item_name = 'mediation_1'
...
) SELECT * FROM CTE_result
Result set looks like this:
record_id diagnosis_1 diagnosis_2 medication_1 ...
100001 09 9B 88X ...
...and then I use the Reshift UNLOAD command to write to disk pipe delimited.
I am testing this on a full production sized environment but only for 2 test records.
Those 2 test records have about 200 items each.
Processing output is 2 rows 200 columns wide.
It takes 30 to 40 minutes to process just just the 2 records.
You might ask me why I am joining on the item name which is a string. Basically there is no item id, no integer, to join on. Long story.
I am looking for suggestions on how to improve performance. With only 2 records, 30 to 40 minutes is unacceptable. What will happen when I have 1000s of records?
I have also tried making the VIEW a MATERIALIZED VIEW however, it takes 30 to 40 minutes (not surprisingly) to compile the materialized view also.
I am not sure which route to take from here.
Stored procedure? I have experience with stored procs.
Create new tables so I can create integer id's to join on and indexes? However, my managers are "new table" averse.
?
I could just stop with the first two CTEs, pull the data down to python and process using pandas dataframe which I've done before successfully but it would be nice if I could have an efficient query, just use Redshift UNLOAD and be done with it.
Any help would be appreciated.
UPDATE: Many thanks to Paul Coulson and Bill Weiner for pointing me in the right direction! (Paul I am unable to upvote your answer as I am too new here).
Using (pseudo code):
MAX(CASE WHEN t1.name = 'somename' THEN t1.value END ) AS name
...
FROM table1 t1
reduced execution time from 30 minutes to 30 seconds.
EXPLAIN PLAN for the original solution is 2700 lines long, for the new solution using conditional aggregation is 40 lines long.
Thanks guys.
Without some more information it is impossible to know what is going on for sure but what you are doing is likely not ideal. An explanation plan and the execution time per step would help a bunch.
What I suspect is getting you is that you are reading a 97M row table 200 times. This will slow things down but shouldn't take 40 min. So I also suspect that record_item_name is not unique per value of record_id. This will lead to row replication and could be expanding the data set many fold. Also is record_id unique in fact_patient_record? If not then this will cause row replication. If all of this is large enough to cause significant spill and significant network broadcasting your 40 min execution time is very plausible.
There is no need to be joining when all the data is in a single copy of the table. #PhilCoulson is correct that some sort of conditional aggregation could be applied and the decode() syntax could save you space if you don't like case. Several of the above issues that might be affecting your joins would also make this aggregation complicated. What are you looking for if there are several values for record_item_value for each record_id and record_item_name pair? I expect you have some discovery of what your data holds in your future.
Please propose an approach I should follow since I am obviously missing the point. I am new to SQL and still think in terms of MS Access. Here's an example of what I'm trying to do: Like I said, don't worry about the detail, I just want to know how I would do this in SQL.
I have the following tables:
Hrs_Worked (staff, date, hrs) (200 000+ records)
Units_Done (team, date, type) (few thousand records)
Rate_Per_Unit (date, team, RatePerUnit) (few thousand records)
Staff_Effort (staff, team, timestamp) (eventually 3 - 4 million records)
SO I need to do the following:
1) Calculate what each team earned by multiplying their units with RatePerUnit and Grouping on Team and Date. I create a view TeamEarnPerDay:
Create View teamEarnPerDay AS
SELECT
,Units_Done.Date,
,Units_Done.TeamID,
,Sum([Units_Done]*[Rate_Per_Unit.Rate]) AS Earn
FROM Units_Done INNER JOIN Rate_Per_Unit
ON (Units_Done.quality = Rate_Per_Unit.quality)
AND (Units_Done.type = Rate_Per_Unit.type)
AND (Units_Done.TeamID = Rate_Per_Unit.TeamID)
AND (Units_Done.Date = Rate_Per_Unit.Date)
GROUP BY
Units_Done.Date,
Units_Done.TeamID;
2) Count the TEAM's effort by Grouping Staff_Effort on Team and Date and counting records. This table has a few million records.
I have to cast the timestamp as a date....
CREATE View team_effort AS
SELECT
TeamID
,CAST([Timestamp] AS Date) as TeamDate,
,Count(Staff_EffortID) AS TeamEffort
FROM Staff_Effort
GROUP BY
TeamID
,CAST([Timestamp] AS Date);
3) Calculate the Team's Rate_of_pay: (1) Team_earnings / (2) Team_effort
I use the 2 views I created above. This view's performance drops but is still acceptable to me.
Create View team_rate_of_pay AS
SELECT
tepd.Date
,tepd.TeamID
,tepd.Earn
,tepd.TeamBags
,[Earn]/[TeamEffort] AS teamRate
FROM teamEarnPerDay
INNER JOIN team_effort
ON (teamEarnPerDay.Date = team_effort.TeamDate)
AND (teamEarnPerDay.TeamID = team_effort.TeamID);
4) Group Staff_Effort on Date and Staff and count records to get each individuals's effort. (share of the team effort)
I have to cast the Timestamp as a date....
Create View staff_effort AS
SELECT
TeamID
,StaffID
,CAST([Timestamp] AS Date) as StaffDate
,Count(Staff_EffortID) AS StaffEffort
FROM Staff_Effort
GROUP BY
,TeamID
,StaffID
,CAST([Timestamp] AS Date);
5) Calculate Staff earnings by: (4) Staff_Effort x (3) team_rate_of_pay
Multiply the individual's effort by the team rate he worked at on the day.
This one is ridiculously slow. In fact, it's useless.
CREATE View staff_earnings AS
SELECT
staff_effort.StaffDate
,staff_effort.StaffID
,sum(staff_effort.StaffEffort) AS StaffEffort
,sum([StaffEffort]*[TeamRate]) AS StaffEarn
FROM staff_effort INNER JOIN team_rate_of_pay
ON (staff_effort.TeamID = team_rate_of_pay.TeamID)
AND (staff_effort.StaffDate = team_rate_of_pay.Date)
Group By
staff_effort.StaffDate,
staff_effort.StaffID;
So you see what I mean.... I need various results and subsequent queries are dependent on those results.
What I tried to do is to write a view for each of the above steps and then just use the view in the next step and so on. They work fine but view nr 3 runs slower than the rest, even though still acceptable. View nr 5 is just ridiculously slow.
I actually have another view after nr.5 which brings hours worked into play as well but that just takes forever to produce a few rows.
I want a single line for each staff member, showing what he earned each day calculated as set out above, with his hours worked each day.
I also tried to reduce the number of views by using sub-queries instead but that took even longer.
A little guidance / direction will be much appreciated.
Thanks in advance.
--EDIT--
Taking the query posted in the comments. Did some formatting, added aliases and a little cleanup it would look like this.
SELECT epd.CompanyID
,epd.DATE
,epd.TeamID
,epd.Earn
,tb.TeamBags
,epd.Earn / tb.TeamBags AS RateperBag
FROM teamEarnPerDay epd
INNER JOIN teamBags tb ON epd.DATE = tb.TeamDate
AND epd.TeamID = tb.TeamID;
I eventually did 2 things:
1) Managed to reduce the nr of nested views by using sub-queries. This did not improve performance by much but it seems simpler with fewer views.
2) The actual improvement was caused by using LEFT JOIN in stead of Inner Join.
The final view ran for 50 minutes with the Inner Join without producing a single row yet.
With LEFT JOIN, it produced all the results in 20 seconds!
Hope this helps someone.
Our Wishlist triggered emails are based on the entries in a huge table which is being populated by a query (after a weekly import). Unfortunately this query started crashing some weeks ago due to exceeding log size and our whole wishlist email system is therefore broken.
The query is joining the search terms and products of each customer with our product table, in order to pull all the information related to these deals. The query is now taking way over 30 minutes and therefore timing out. The source table has more than 13M rows and the products table has +- 90 000 rows.
I've already tried reducing the complexity of said query by dividing the query and reducing the amount of fields it needs to pull, going as low as only trying to pull 2 fields. Still too much... I've added WITH (NOLOCK) and I've tried pulling only for 10% of our customers at a time. Still way too long. I've been trying many many variations and it doesn't seem to matter how small I make the query, it will still time out.
This is the query as whole:
w.SubscriberKey,
w.WishlistSearchTerm,
w.ProductID,
p.Title,
p.SaleType,
p.TravelType,
p.IsPackage,
p.MetaTags,
p.StartDate AS "SaleMetaTags",
p.EndDate,
p.DestinationName,
p.DestinationType,
p.CountryName AS "Country",
p.DivisionName,
p.CityName AS "City",
p.SaleURL AS "URL",
p.Summary AS "ShortDescription",
p.MainParagraph,
p.SecondOpinion AS "HighlightsDescription",
p.WeLike AS "AdditionalText",
p.LeadRate AS "BestRate",
p.Currency AS "BestRateCurrency",
p.Discount AS "TopDiscount",
p.LeadImage1,
p.LeadImage2,
p.LeadImage3,
p.LeadImage4,
p.LeadImage5,
p.FetchTimestamp AS "ModifiedTimestamp",
p.ReasonToLove,
p.Latitude,
p.Longitude
FROM EXT_SALE_WISHLIST_TEMP w WITH (NOLOCK)
JOIN DIM_SALES_V4 p ON p.ProductID=w.ProductID
WHERE w.SubscriberKey BETWEEN 0 AND 5000000
But as said, I've already tried to make it as small as:
SELECT
w.SubscriberKey,
w.WishlistSearchTerm,
w.ProductID,
p.Title,
p.SaleType,
p.TravelType
FROM EXT_SALE_WISHLIST_TEMP w WITH (NOLOCK)
JOIN DIM_SALES_V4 p ON p.ProductID=w.ProductID
WHERE w.SubscriberKey BETWEEN 0 AND 5000000
Regrettably the person building this has left us some months ago and this is not quite my field of expertise.
Some more information:
SubscriberKey and ProductID are primary keys
The source table is a temporary one and the target table is empty
The query is set to Overwrite
I've added WHERE w.SubscriberKey BETWEEN 0 AND 5000000 at last but I've tried many time without this clause
I just need to get this query under 30 minutes runtime or divide it in as many queries as needed, it doesn't matter if in total it takes a couple of hours to run as it is a weekly process
i need two join 6 queries in mdx which are fetching result from olap cube .
problem is that all queries have different where condition and i want to join them on the basis of rows. the query is
WITH
MEMBER MEASURES.CONSTANTVALUE AS 0
SELECT
Union(MEASURES.CONSTANTVALUE,[Measures].[Totalresult]) on 0,
NON EMPTY {Hierarchize(Filter ({[keyword].[All keywords]},([Measures].[Totalresult]=0)))} ON 1
FROM [Advancedsearch]
WHERE {[Path].[/Search]}
In above the filter will be changed in different queries
how can we join this one.
I would think that a cross product between the list of filters and your existing set on the rows should either already give you what you want, or be a starting point for further refinement of requirements not stated so far in your question:
This would mean something like
NON EMPTY
{[Path].[/Search], [Path].[/Search2]}
*
{Hierarchize(Filter ({[keyword].[All keywords]}, ([Measures].[Totalresult]=0)))}
ON 1
(guessing your second filter would be [Path].[/Search2]) instead of your original
NON EMPTY
{Hierarchize(Filter ({[keyword].[All keywords]}, ([Measures].[Totalresult]=0)))}
ON 1
and omitting the WHERE.
I’m pretty new to the many-to-many dimensions but I have a scenario to solve, which raised a couple of questions that I can’t solve myself… So your help would be highly appreciated!
The scenario is:
There is a parent-child Categories dimension which has a recursive Categories hierarchy with NonLeafDataVisible set
There is a regular Products dimension, that slices the fact table
There is a bridge many-to-many ProductCategory table which defines the relation between the two. Important to note is that a product can belong to any level of the categories hierarchy – i.e. a particular category can have both – directly assigned products and sub-categories.
There is a fact Transactions table that holds a FK to the Product that has been sold, as well as a FK to its category. The FK is needed, because
I have all this modeled in BIDS, the dimension usage is set between each of the dimensions and the facts, the many-to-many relation between the Categories and the Transactions table is in place is in place. In other words everything seems kind of OK..
I now need to write an MDX which I would use to create a report that shows something like that:
Lev1 Lev2 Lev3 Prod Count
-A
-AA 6
-AA 2
P6 1
P5 1
-AAA 2
P1 1
P2 1
-AAB 2
P3 1
P4 1
+BB
The following MDX almost returns what I need:
SELECT
[Measures].[SALES Count] ON COLUMNS,
NONEMPTYCROSSJOIN(
DESCENDANTS([Category].[PARENTCATEGORY].[Level 01].MEMBERS),
[Product].[Prod KEY].[Prod KEY].MEMBERS,
[Measures].[Measures].[Bridge Distinct Count],
[Measures].[SALES Count],
2) ON ROWS
FROM [Sales]
The problem that I have is that for each of the non-leaf categories, the cross join returns a valid intersection with each of the products that’s been sold for it + all subcategories. Hence the result set contains way too much redundant data and besides I can’t find a way to filter out the redundancies in the SSRS report itself.
Any idea on how to rewrite the MDX so that it only returns the result set above?
Another problem is that if I create a role-playing Category dimension which I set to slice directly the transactions data, then the numbers that I get when browsing the cube are completely off… It seems as SSAS is doing something during processing (but it’s not the SQL statements it shoots to the OLTP, as those remain exactly the same) that causes the problem, but I’ve no idea what. Any ideas?
Cheers,
Alex
I think I found a solution to the problem, using the following query:
WITH
MEMBER [Measures].[Visible] AS
IsLeaf([DIM Eco Res Category].[PARENTCATEGORY].CurrentMember)
MEMBER [Measures].[CurrentProd] AS
IIF
(
[Measures].[Visible]
,[DIM Eco Res Product].[Prod KEY].CurrentMember.Name
,""
)
SELECT
{
[Measures].[Visible]
,[Measures].[CurrentProd]
,[Measures].[FACT PRODSALES Count]
} ON COLUMNS
,NonEmptyCrossJoin
(
Descendants
(
[DIM Eco Res Product].[Prod KEY].[(All)],
,Leaves
)
,Descendants([DIM Eco Res Category].[PARENTCATEGORY].[(All)])
,[Measures].[FACT PRODSALES Count]
,2
)
DIMENSION PROPERTIES
MEMBER_CAPTION
,MEMBER_UNIQUE_NAME
,PARENT_UNIQUE_NAME
,LEVEL_NUMBER
ON ROWS
FROM [Sales];
In the report then I use the [Measures].[CurrentProd] as a source for the product column and that seems to work fine so far.