I've built a simple OLAP cube with 2 date dimensions, a single low cardinality string dimension, and a single measure that just counts the rows in the fact table.
I'm trying to figure out the best way to filter on date dimensions. I have a query that works and produces the correct results, but it seems very inefficient. It looks like this:
SELECT
[measures].[user_count] on 0,
[gender].members on 1
FROM
profiles
WHERE
NonEmptyCrossJoin(
{ [birthday].[1960].[1].[1] : [birthday].[1989].[12].[31] },
{ [created_date].[2013].[1].[1] : [created_date].[2013].[12].[31] }
)
Mondrian performs more than 400 SQL queries that look like this:
select
"dates"."day" as "c0"
from
"samtest"."dates" as "dates"
where
("dates"."month" = 8 and "dates"."year" = 1989)
group by
"dates"."day"
order by
CASE WHEN "dates"."day" IS NULL THEN 1 ELSE 0 END, "dates"."day" ASC
Then around 60 queries that look like this:
select
"dates"."year" as "c0",
"dates"."month" as "c1",
"dates"."day" as "c2",
"dates_1"."year" as "c3",
"dates_1"."month" as "c4",
"dates_1"."day" as "c5",
"profiles"."gender" as "c6",
count("profiles"."profile_id") as "m0"
from
"samtest"."dates" as "dates",
"samtest"."profiles" as "profiles",
"samtest"."dates" as "dates_1"
where
"profiles"."created_date" = "dates"."date"
and
"dates"."year" = 2013
and
"profiles"."birthday" = "dates_1"."date"
and
"dates_1"."year" in (1978, 1979, 1980, 1981, 1982, 1983)
and
"profiles"."gender" = 'Unspecified'
group by
"dates"."year",
"dates"."month",
"dates"."day",
"dates_1"."year",
"dates_1"."month",
"dates_1"."day",
"profiles"."gender"
The first time I run this it takes around 12 minutes to complete. Most of that time seems to be making the SQL queries, but even when I run it again after everything is cached, Mondrian still spends over 3 minutes computing the result. This seems strange to me because I can get the same result directly from the SQL database in less than a second.
Am I doing something completely wrong? Is this a bug? Is this just a not a good use case for OLAP?
I'm using Mondrian 3.6.1. The SQL database is Redshift. If anymore details about my configuration or schema would be useful, just let me know.
Related
In the system I'm using the below query in order to get a portion of my table's data in order to implement pagination. The table contains around 100 records only right now but it will grow into 1 million+ records later on.
SELECT
Id AS ActualMapping_Id,
BudgetPhase AS ActualMapping_BudgetPhase,
FromBH AS ActualMapping_FromBH,
ToBH AS ActualMapping_ToBH,
FromBI AS ActualMapping_FromBI,
ToBI AS ActualMapping_ToBI,
FromSI1 AS ActualMapping_FromSI1,
ToSI1 AS ActualMapping_ToSI2,
FromSI2 AS ActualMapping_FromSI2,
ToSI2 AS ActualMapping_ToSI2,
DataType AS ActualMapping_DataType,
Status AS ActualMapping_Status,
MappingType AS ActualMapping_MappingType,
LastMappedBy AS ActualMapping_LastMappedBy,
LastMappedDate AS ActualMapping_LastMappedDate
FROM
(
SELECT
Id,
BudgetPhase,
FromBH,
ToBH,
FromBI,
ToBI,
FromSI1,
ToSI1,
FromSI2,
ToSI2,
DataType,
Status,
MappingType,
LastMappedBy,
LastMappedDate,
ROW_NUMBER() OVER (ORDER BY FromBH) AS RowNumber
FROM
ActualMapping
WHERE
DataType = 'Cost' and
Status = 'Active' and
MappingType = 'Static' and
BudgetPhase like '%some_text%' and
ToBH like '%some_text%' and
ToBI like '%some_text%' and
ToSI1 like '%some_text%' and
ToSI2 like '%some_text%' and
FromBH like '%some_text%' and
FROMBI like '%some_text%' and
FROMSI1 like '%some_text%' and
FROMSI2 like '%$some_text%'
) AS NumberedTable
WHERE
RowNumber BETWEEN 1 AND 50
The above query which works pretty fine (At least on ~100 records) but after a while it starts getting blocked. I can't understand the possible reason for it getting blocked but when that happens I won't be able to do a simple select query from the table unless I kill all those blocked queries which are piled up.
So my questions are:
Why does the query gets blocked after a while?
Is this a good/ok approach to implement pagination on sql side for a large data set? (possibly more than 1 million+ records)
I need to store the result of the SELECT query below in a variable to cut down on computation time. The results are of the form 'X', 'Y', 'Z', ...
WHERE
PEL.kuerzel in (SELECT KL.kuerzel from ictq.KLE KL WHERE FachgruppeKuerzel=526)
Right now the SELECT query gets executed 3 times for each of ~2000 entries. If I was able to store the result locally, I would have to run it only once.
I'm working on a Sybase 11 database. How can I achieve this or anything similar?
The subquery where the snippet was taken from, out of a 150 line query alltogether:
SELECT list(PEL.kuerzel) from ictq.PEpisode PE
INNER JOIN ictq.PEpisodeLeistung PEL ON(PE.IDPATIENTEPISODE = PEL.IDPATIENTEPISODE)
WHERE
PE.IDPATIENTKLINIK = P.IDPATIENTKLINIK and
PEL.Datum between dateadd(month, -12, #startdatum) and #startdatum and
PEL.kuerzel in (SELECT kl.kuerzel from ictq.KLE kl where FachgruppeKuerzel=526)
I have no control over the structure and cannot add anything. The query in itself is legacy work and I'm happy it works as it is now. The slow computation, however, needs an overhaul.
It is often more efficient to use exists rather than in, or to move the in subquery to the from clause:
FROM PEL JOIN
(SELECT DISTINCT KL.kuerzel
FROM ictq.KLE KL
WHERE FachgruppeKuerzel = 526
) KL
ON PEL.kuerzel = KL.kuerzel;
For performance for this query, you want an index on ictq.KLE(FachgruppeKuerzel, kuerzel) and PEL(kuerzel).
I'm trying to avoid using straight up SQL in my Rails app, but need to do a quite large version of this:
SELECT ds.product_id,
( SELECT SUM(units) FROM daily_sales WHERE (date BETWEEN '2015-01-01' AND '2015-01-08') AND service_type = 1 ) as wk1,
( SELECT SUM(units) FROM daily_sales WHERE (date BETWEEN '2015-01-09' AND '2015-01-16') AND service_type = 1 ) as wk2
FROM daily_sales as ds group by ds.product_id
I'm sure it can be done, but i'm struggling to write this as an active record statement. Can anyone help?
If you must do this in a single query, you'll need to write some SQL for the CASE statements. The following is what you need:
ranges = [ # ordered array of all your date-ranges
Date.new(2015, 1, 1)..Date.new(2015, 1, 8),
Date.new(2015, 1, 9)..Date.new(2015, 1, 16)
]
overall_range = (ranges.first.min)..(ranges.last.max)
grouping_sub_str = \
ranges.map.with_index do |range, i|
"WHEN (date BETWEEN '#{range.min}' AND '#{range.max}') THEN 'week#{i}'"
end.join(' ')
grouping_condition = "CASE #{grouping_sub_str} END"
grouping_columns = ['product_id', grouping_condition]
DailySale.where(date: overall_range).group(grouping_columns).sum(:units)
That will produce a hash with array keys and numeric values. A key will be of the form [product_id, 'week1'] and the value will be the corresponding sum of units for that week.
Simplify your SQL to the following and try converting it..
SELECT ds.product_id,
, SUM(CASE WHEN date BETWEEN '2015-01-01' AND '2015-01-08' AND service_type = 1
THEN units
END) WK1
, SUM(CASE WHEN date BETWEEN '2015-01-09' AND '2015-01-16' AND service_type = 1
THEN units
END) WK2
FROM daily_sales as ds
group by ds.product_id
Every rail developer sooner or later hits his/her head against the walls of Active Record query interface just to find the solution in Arel.
Arel gives you the flexibility that you need in creating your query without using loops, etc. I am not going to give runnable code rather some hints how to do it yourself:
We are going to use arel_tables to create our query. For a model called for example Product, getting the Arel table is as easy as products = Product.arel_table
Getting sum of a column is like daily_sales.project(daily_sales[:units].count).where(daily_sales[:date].gt(BEGIN_DATE).where(daily_sales[:date].lt(END_DATE). You can chain as many wheres as you want and it will be translated into SQL ANDs.
Since we need to have multiple sums in our end result you need to make use of Common Table Expressions(CTE). Take a look at docs and this answer for more info on this.
You can use those CTEs from step 3 in combination with group and you are done!
So here is my project using MS Access 2010,
I have developed 2 queries to select 2 different reading periods. These queries are called CycleStart and CycleEnd. When I run these 2 queries individually I get expected output results. these 2 queries pull data from tables with a couple lookup fields in them. So the lookup fields use other tables where there are only 2 columns. The next step I use SQL to create a UNION ALL query to bring these 2 cycle queries together for reporting purposes. The problem I run into is that my resulting Union query does not output the same information as the 2 individual cycle queries.
Now the specific issues. My cycle queries have a couple lookup fields referencing another table. For example the Read_Cycle field comes for a table(Read_Cycles) and only has 2 columns, the unique identifer assigned by Access and the Read_Cycle column with the data I enter. When I run the cycle queries the field for Read_Cycle returns the Read_Cycle data as expected, but the union query does not. So here is some structure of my project:
Read_Cycles Table
|ID Col1 | |Cycle_ID Col2|
1 Spring
2 Fall
3 Winter
The data tables behind the CycleStart and the CycleEnd have fields that are lookup values referencing the above described Read_Cycles table.
Query CycleStart and CycleEnd return Spring or fall or winter, which ever value is associated with the record, correctly.
however, the problem I have is that the Union SQL Query returns the ID instead of the value, so instead of getting Fall, I get the 2.
Here is my UNION ALL SQL........
SELECT "CycleEnd" AS source,
[CycleEnd].[Recloser_SerialNo],
[CycleEnd].[Read_Date],
[CycleEnd].[3_Phase_Reading],
[CycleEnd].[A_Phase_Reading],
[CycleEnd].[B_Phase_Reading],
[CycleEnd].[C_Phase_Reading],
[CycleEnd].[Read_Cycle],
[CycleEnd].[PoleNo],
[CycleEnd].[Substation],
[CycleEnd].[Feeder],
[CycleEnd].[Feeder_Description],
[CycleEnd].[Recloser_Location]
FROM [CycleEnd]
UNION ALL
SELECT "CycleStart" AS source,
[CycleStart].[Recloser_SerialNo],
[CycleStart].[Read_Date],
[CycleStart].[3_Phase_Reading] * - 1,
[CycleStart].[A_Phase_Reading] * - 1,
[CycleStart].[B_Phase_Reading] * - 1,
[CycleStart].[C_Phase_Reading] * - 1,
[CycleStart].[Read_Cycle],
[CycleStart].[PoleNo],
[CycleStart].[Substation],
[CycleStart].[Feeder],
[CycleStart].[Feeder_Description],
[CycleStart].[Recloser_Location]
FROM [CycleStart];
All other fields are coming across just fine and as expected, I have narrowed it down to only fields that are a lookup in the original tables.
Any help would be greatly appreciated. Also my SQL experience is really limited so example code would help greatly.
UPDATE:
here is the sql from the CycleEnd that works. I got this by building the query then changing to the SQL view...
SELECT Recloser_Readings.Recloser_SerialNo,
Recloser_Readings.Read_Date,
Recloser_Readings.[3_Phase_Reading],
Recloser_Readings.A_Phase_Reading,
Recloser_Readings.B_Phase_Reading,
Recloser_Readings.C_Phase_Reading,
Recloser_Locations.PoleNo,
Recloser_Locations.Substation,
Recloser_Locations.Feeder,
Recloser_Locations.Feeder_Description,
Recloser_Locations.Recloser_Location,
Recloser_Readings.Read_Cycle
FROM (
Recloser_Inventory LEFT JOIN Recloser_Locations
ON Recloser_Inventory.PoleNo = Recloser_Locations.PoleNo
)
RIGHT JOIN Recloser_Readings
ON Recloser_Inventory.Serial_No = Recloser_Readings.Recloser_SerialNo
WHERE (((Recloser_Readings.Read_Cycle) = "8"));
UPDATE#2
I noticed I grabbed the wrong code that references the Read_Cycles table. Here it is...
SELECT Read_Cycles.Cycle_ID, Read_Cycles.ID
FROM Read_Cycles
ORDER BY Read_Cycles.Cycle_ID DESC;
UPDATE : SYNTAX ERROR FROM THE FOLLOWING CODE!!
SELECT "CycleEnd" as source,
[CycleEnd].[Recloser_SerialNo],
[CycleEnd].[Read_Date],
[CycleEnd].[3_Phase_Reading],
[CycleEnd].[A_Phase_Reading],
[CycleEnd].[B_Phase_Reading],
[CycleEnd].[C_Phase_Reading],
[CycleEnd].[Read_Cycle],
[CycleEnd].[PoleNo],
[CycleEnd].[Substation],
[CycleEnd].[Feeder],
[CycleEnd].[Feeder_Description],
[CycleEnd].[Recloser_Location]
FROM [CycleEnd] JOIN [Read_Cycles] ON [CycleEnd].[Read_Cycle] = [Read_Cycles].[ID]
UNION ALL SELECT "CycleStart" as source,
[CycleStart].[Recloser_SerialNo],
[CycleStart].[Read_Date],
[CycleStart].[3_Phase_Reading]*-1,
[CycleStart].[A_Phase_Reading]*-1,
[CycleStart].[B_Phase_Reading]*-1,
[CycleStart].[C_Phase_Reading]*-1,
[CycleStart].[Read_Cycle],
[CycleStart].[PoleNo],
[CycleStart].[Substation],
[CycleStart].[Feeder],
[CycleStart].[Feeder_Description],
[CycleStart].[Recloser_Location]
FROM [CycleStart] JOIN [Read_Cycles] ON [CycleStart].[Read_Cycle] = [Read_Cycles].[ID];
I'm trying to modify a current SQL code in Microsoft Query I created to return the most recent production test conducted on a well. Right now the query below returns a list of tests for any well from [Date Parameter=01/01/08] to now for [Well Parameter= Well#1]. Also, to get all the data I need, I must join two tables, one with well names, one with production data, and both contain a unique well # which I join.
SELECT
P.TEST_DT, P.BOPD, P.BWPD, P.MCFD
FROM
WELL_TABLE C, PRODTEST_TABLE P
WHERE
C.UNIQUE_WELL_ID = P.UNIQUE_WELL_ID
AND ((C.WELL_NAME=?)
AND (P.TEST_DT>=?))
ORDER BY P.TEST_DT DESC
Right now my table looks as such:
TEST_DT P.BOPD P.BWPD P.MCFD
9/23/2012 23 125 0
8/23/2010 21 137 0
7/15/2009 29 123 0
I would like to just return the most recent test:
TEST_DT P.BOPD P.BWPD P.MCFD
9/23/2012 23 125 0
Any help would be greatly appreciated...
I've tried working with max(TEST_DT) but have been unsuccessful.
SELECT tab.*
FROM (SELECT P.TEST_DT, P.BOPD, P.BWPD, P.MCFD
FROM WELL_TABLE C, PRODTEST_TABLE P
WHERE C.UNIQUE_WELL_ID = P.UNIQUE_WELL_ID
AND C.WELL_NAME=?
ORDER BY P.TEST_DT DESC) tab
WHERE ROWNUM=1;