We have a lot of star schemas in our data warehouse. I thought I can create views in order to simplify SQL analysis of data.
Example SQL for profit&loss star:
select
month_number,
sum(amount)
from
bizdata.dw_daily_pl_fact dwdpf
join bizdata.dw_distance dwdis on (dwdis.distance_key= dwdpf.distance_key)
join bizdata.dw_ledger_account dwled on (dwled.ledger_account_key= dwdpf.ledger_account_key)
join bizdata.dw_party dwpar on (dwpar.party_key= dwdpf.company_key)
join bizdata.dw_party dwpar2 on (dwpar2.party_key= dwdpf.supplier_key)
join bizdata.dw_budget_code dwbud on (dwbud.budget_code_key= dwdpf.budget_code_key)
join bizdata.dw_time dwtim on (dwtim.time_key= dwdpf.time_key)
join bizdata.dw_project dwpro on (dwpro.project_key= dwdpf.project_key)
where
year_number = 2012
and budget_code = 'SALARIES'
group by
month_number
(There are approx 200 columns and 100k rows in this star)
If I have a view:
create or replace view bizdata.dwv_pl_fact as (
select
*
from
bizdata.dw_daily_pl_fact dwdpf
join bizdata.dw_distance dwdis on (dwdis.distance_key= dwdpf.distance_key)
join bizdata.dw_ledger_account dwled on (dwled.ledger_account_key= dwdpf.ledger_account_key)
join bizdata.dw_party dwpar on (dwpar.party_key= dwdpf.company_key)
join bizdata.dw_party dwpar2 on (dwpar2.party_key= dwdpf.supplier_key)
join bizdata.dw_budget_code dwbud on (dwbud.budget_code_key= dwdpf.budget_code_key)
join bizdata.dw_time dwtim on (dwtim.time_key= dwdpf.time_key)
join bizdata.dw_project dwpro on (dwpro.project_key= dwdpf.project_key)
);
I can simplify the statement to the following:
select
month_number,
sum(amount)
from
bizdata.dwv_pl_fact
where
year_number = 2012
and budget_code = 'SALARIES'
group by
month_number
My questions is - Are there any performance or other issues with such approach?
A view in PostgreSQL is just a query rewriting mechanism. So you can basically assume your user-supplied criteria get merged into the view's definition and the resulting query gets run.
Since 9.0 it the planner should even notice some joins in the resulting query are unnecessary and skip them. That seems particularly useful in your case.
However, it is possible that some criteria may not be pushed "inside" clauses in the view definition - these would be the same as you might see with a sub-query though. For example, a subquery with order-by + limit can present a boundary the planner can't see through.
HTH
You haven't mentioned about the environment. In a generic model, your approach seems to be OK but you missed one great point. Keep in mind you are bringing all columns in that view. if there are 100s of column and there is a row chain, it will be nightmare. So rewrite the query and build the view(dwv_plk_fact) like following and you should be ok.
create or replace view bizdata.dwv_pl_fact as (
select
<table_name>.month_number,
<table_name>.amount
from
bizdata.dw_daily_pl_fact dwdpf
join bizdata.dw_distance dwdis on (dwdis.distance_key= dwdpf.distance_key)
join bizdata.dw_ledger_account dwled on (dwled.ledger_account_key= dwdpf.ledger_account_key)
join bizdata.dw_party dwpar on (dwpar.party_key= dwdpf.company_key)
join bizdata.dw_party dwpar2 on (dwpar2.party_key= dwdpf.supplier_key)
join bizdata.dw_budget_code dwbud on (dwbud.budget_code_key= dwdpf.budget_code_key)
join bizdata.dw_time dwtim on (dwtim.time_key= dwdpf.time_key)
join bizdata.dw_project dwpro on (dwpro.project_key= dwdpf.project_key)
);
Related
I have written this SQL query to fetch the data from greenplum datalake. The primary table has hardy 800,000ish rows which I am joining with other table. The below query is taking insane amount of time to give result. What might be the possible reason for the longer query time? How to resolve it?
select
a.pole,
t.country_name,
a.service_area,
a.park_name,
t.turbine_platform_name,
a.turbine_subtype,
a.pad as "turbine_name",
t.system_number as "turbine_id",
a.customer,
a.service_contract,
a.component,
c.vendor_mfg as "component_manufacturer",
a.case_number,
a.description as "case_description",
a.rmd_diagnosis as "case_rmd_diagnostic_description",
a.priority as "case_priority",
a.status as "case_status",
a.actual_rootcause as "case_actual_rootcause",
a.site_trends_feedback as "case_site_feedback",
a.added as "date_case_added",
a.start as "date_case_started",
a.last_flagged as "date_case_flagged_by_algorithm_latest",
a.communicated as "date_case_communicated_to_field",
a.field_visible_date as "date_case_field_visbile_date",
a.fixed as "date_anamoly_fixed",
a.expected_clse as "date_expected_closure",
a.request_closure_date as "date_case_request_closure",
a.validation_date as "date_case_closure",
a.production_related,
a.estimated_value as "estimated_cost_avoidance",
a.cms,
a.anomaly_category,
a.additional_information as "case_additional_information",
a.model,
a.full_model,
a.sent_to_field as "case_sent_to_field"
from app_pul.anomaly_stage a
left join ge_cfg.turbine_detail t on a.scada_number = t.system_number and a.added > '2017-12-31'
left join tbwgr_v.pmt_wmf_tur_component_master_t c on a.component = c.component_name
Your query is basically:
select . . .
from app_pul.anomaly_stage a left join
ge_cfg.turbine_detail t
on a.scada_number = t.system_number and
a.added > '2017-12-31' left join
tbwgr_v.pmt_wmf_tur_component_master_t c
on a.component = c.component_name
First, the condition on a is ignored, because it is the first table in the left join and is the on clause. So, I assume you actually intend for it to filter, so write the query as:
select . . .
from app_pul.anomaly_stage a left join
ge_cfg.turbine_detail t
on a.scada_number = t.system_number left join
tbwgr_v.pmt_wmf_tur_component_master_t c
on a.component = c.component_name
where a.added > '2017-12-31'
That might help with performance. Then in Postgres, you would want indexes on turbine_detail(system_number) and pmt_wmf_tur_component_master_t(component_name). It is doubtful that an index would help on the first table, because you are already selecting a large amount of data.
I'm not sure if indexes would be appropriate in Greenplum.
Verify if the joins are using respective primary and foreign keys.
Try to execute the query removing one left join after the other, so you see the focus the problem.
Try using the plan execution.
I am fairly new in Access and SQL programming. I am trying to do the following:
Sum(SO_SalesOrderPaymentHistoryLineT.Amount) AS [Sum Of PaymentPerYear]
and group by year even when there is no amount in some of the years. I would like to have these years listed as well for a report with charts. I'm not certain if this is possible, but every bit of help is appreciated.
My code so far is as follows:
SELECT
Base_CustomerT.SalesRep,
SO_SalesOrderT.CustomerId,
Base_CustomerT.Customer,
SO_SalesOrderPaymentHistoryLineT.DatePaid,
Sum(SO_SalesOrderPaymentHistoryLineT.Amount) AS [Sum Of PaymentPerYear]
FROM
Base_CustomerT
INNER JOIN (
SO_SalesOrderPaymentHistoryLineT
INNER JOIN SO_SalesOrderT
ON SO_SalesOrderPaymentHistoryLineT.SalesOrderId = SO_SalesOrderT.SalesOrderId
) ON Base_CustomerT.CustomerId = SO_SalesOrderT.CustomerId
GROUP BY
Base_CustomerT.SalesRep,
SO_SalesOrderT.CustomerId,
Base_CustomerT.Customer,
SO_SalesOrderPaymentHistoryLineT.DatePaid,
SO_SalesOrderPaymentHistoryLineT.PaymentType,
Base_CustomerT.IsActive
HAVING
(((SO_SalesOrderPaymentHistoryLineT.PaymentType)=1)
AND ((Base_CustomerT.IsActive)=Yes))
ORDER BY
Base_CustomerT.SalesRep,
Base_CustomerT.Customer;
You need another table with all years listed -- you can create this on the fly or have one in the db... join from that. So if you had a table called alltheyears with a column called y that just listed the years then you could use code like this:
WITH minmax as
(
select min(year(SO_SalesOrderPaymentHistoryLineT.DatePaid) as minyear,
max(year(SO_SalesOrderPaymentHistoryLineT.DatePaid) as maxyear)
from SalesOrderPaymentHistoryLineT
), yearsused as
(
select y
from alltheyears, minmax
where alltheyears.y >= minyear and alltheyears.y <= maxyear
)
select *
from yearsused
join ( -- your query above goes here! -- ) T
ON year(T.SO_SalesOrderPaymentHistoryLineT.DatePaid) = yearsused.y
You need a data source that will provide the year numbers. You cannot manufacture them out of thin air. Supposing you had a table Interesting_year with a single column year, populated, say, with every distinct integer between 2000 and 2050, you could do something like this:
SELECT
base.SalesRep,
base.CustomerId,
base.Customer,
base.year,
Sum(NZ(data.Amount)) AS [Sum Of PaymentPerYear]
FROM
(SELECT * FROM Base_CustomerT INNER JOIN Year) AS base
LEFT JOIN
(SELECT * FROM
SO_SalesOrderT
INNER JOIN SO_SalesOrderPaymentHistoryLineT
ON (SO_SalesOrderPaymentHistoryLineT.SalesOrderId = SO_SalesOrderT.SalesOrderId)
) AS data
ON ((base.CustomerId = data.CustomerId)
AND (base.year = Year(data.DatePaid))),
WHERE
(data.PaymentType = 1)
AND (base.IsActive = Yes)
AND (base.year BETWEEN
(SELECT Min(year(DatePaid) FROM SO_SalesOrderPaymentHistoryLineT)
AND (SELECT Max(year(DatePaid) FROM SO_SalesOrderPaymentHistoryLineT))
GROUP BY
base.SalesRep,
base.CustomerId,
base.Customer,
base.year,
ORDER BY
base.SalesRep,
base.Customer;
Note the following:
The revised query first forms the Cartesian product of BaseCustomerT with Interesting_year in order to have base customer data associated with each year (this is sometimes called a CROSS JOIN, but it's the same thing as an INNER JOIN with no join predicate, which is what Access requires)
In order to have result rows for years with no payments, you must perform an outer join (in this case a LEFT JOIN). Where a (base customer, year) combination has no associated orders, the rest of the columns of the join result will be NULL.
I'm selecting the CustomerId from Base_CustomerT because you would sometimes get a NULL if you selected from SO_SalesOrderT as in the starting query
I'm using the Access Nz() function to convert NULL payment amounts to 0 (from rows corresponding to years with no payments)
I converted your HAVING clause to a WHERE clause. That's semantically equivalent in this particular case, and it will be more efficient because the WHERE filter is applied before groups are formed, and because it allows some columns to be omitted from the GROUP BY clause.
Following Hogan's example, I filter out data for years outside the overall range covered by your data. Alternatively, you could achieve the same effect without that filter condition and its subqueries by ensuring that table Intersting_year contains only the year numbers for which you want results.
Update: modified the query to a different, but logically equivalent "something like this" that I hope Access will like better. Aside from adding a bunch of parentheses, the main difference is making both the left and the right operand of the LEFT JOIN into a subquery. That's consistent with the consensus recommendation for resolving Access "ambiguous outer join" errors.
Thank you John for your help. I found a solution which works for me. It looks quiet different but I learned a lot out of it. If you are interested here is how it looks now.
SELECT DISTINCTROW
Base_Customer_RevenueYearQ.SalesRep,
Base_Customer_RevenueYearQ.CustomerId,
Base_Customer_RevenueYearQ.Customer,
Base_Customer_RevenueYearQ.RevenueYear,
CustomerPaymentPerYearQ.[Sum Of PaymentPerYear]
FROM
Base_Customer_RevenueYearQ
LEFT JOIN CustomerPaymentPerYearQ
ON (Base_Customer_RevenueYearQ.RevenueYear = CustomerPaymentPerYearQ.[RevenueYear])
AND (Base_Customer_RevenueYearQ.CustomerId = CustomerPaymentPerYearQ.CustomerId)
GROUP BY
Base_Customer_RevenueYearQ.SalesRep,
Base_Customer_RevenueYearQ.CustomerId,
Base_Customer_RevenueYearQ.Customer,
Base_Customer_RevenueYearQ.RevenueYear,
CustomerPaymentPerYearQ.[Sum Of PaymentPerYear]
;
I am struggling with combining the below Select Statments, I know I could cheat and add some fake columns in and then use Union, but I want to do this correctly.
Once I have them joined, I will be putting the Statment in to a XML file for use with Word and CRM4.
SELECT BILLTO_NAME,
BILLTO_LINE1,
BILLTO_LINE2,
BILLTO_LINE3,
BILLTO_CITY,
BILLTO_COUNTRY,
BILLTO_POSTALCODE,
ORDERNUMBER,
REQUESTDELIVERYBY,
MODIFIEDON,
SHIPTO_NAME,
SHIPTO_LINE1,
SHIPTO_LINE2,
SHIPTO_LINE3,
SHIPTO_CITY,
SHIPTO_STATEORPROVINCE,
SHIPTO_COUNTRY,
SHIPTO_POSTALCODE,
CREATEDBY
FROM SALESORDERBASE
SELECT QUANTITY,
DESCRIPTION
FROM SALESORDERDETAILBASE
SELECT NEW_ORDERNOTES,
NEW_NOTES
FROM SALESORDEREXTENSIONBASE
They all have the common column of SalesOrderID, which I need to add in somewhere as well.
You can use a LEFT JOIN on the tables:
SELECT ob.SalesOrderID
ob.BILLTO_NAME,
ob.BILLTO_LINE1,
ob.BILLTO_LINE2,
ob.BILLTO_LINE3,
ob.BILLTO_CITY,
ob.BILLTO_COUNTRY,
ob.BILLTO_POSTALCODE,
ob.ORDERNUMBER,
ob.REQUESTDELIVERYBY,
ob.MODIFIEDON,
ob.SHIPTO_NAME,
ob.SHIPTO_LINE1,
ob.SHIPTO_LINE2,
ob.SHIPTO_LINE3,
ob.SHIPTO_CITY,
ob.SHIPTO_STATEORPROVINCE,
ob.SHIPTO_COUNTRY,
ob.SHIPTO_POSTALCODE,
ob.CREATEDBY,
od.QUANTITY,
od.DESCRIPTION,
oe.NEW_ORDERNOTES,
oe.NEW_NOTES
FROM SALESORDERBASE ob
LEFT JOIN SALESORDERDETAILBASE od
on ob.SalesOrderID = od.SalesOrderID
LEFT JOIN SALESORDEREXTENSIONBASE oe
on ob.SalesOrderID = oe.SalesOrderID
Assuming the column that identifies the relationship is called id on all three tables, you can do this:
SELECT sob.BILLTO_NAME,
sob.BILLTO_LINE1,
sob.BILLTO_LINE2,
sob.BILLTO_LINE3,
sob.BILLTO_CITY,
sob.BILLTO_COUNTRY,
sob.BILLTO_POSTALCODE,
sob.ORDERNUMBER,
sob.REQUESTDELIVERYBY,
sob.MODIFIEDON,
sob.SHIPTO_NAME,
sob.SHIPTO_LINE1,
sob.SHIPTO_LINE2,
sob.SHIPTO_LINE3,
sob.SHIPTO_CITY,
sob.SHIPTO_STATEORPROVINCE,
sob.SHIPTO_COUNTRY,
sob.SHIPTO_POSTALCODE,
sob.CREATEDBY,
sodb.QUANTITY,
sodb.DESCRIPTION,
soeb.NEW_ORDERNOTES,
soeb.NEW_NOTES
From SalesOrderBase sob
JOIN SalesOrderDetailBase sodb
ON sob.id = sodb.SalesOrderID
JOIN SalesOrderExtensionBase soeb
ON sob.id = soeb.SalesOrderID
You can think of JOINing as slamming together rows side-by-side, whereas UNIONing is slamming together rows one on top of the other. UNIONS require that the columns be the same and JOINs require that there is a relationship of some kind between each row.
EDIT - The OP provided more details
I am trying to write a SQL Statement for Interbase.
Whats wrong with this SQL?
md_master (trm) = Master Table
cd_Med (cdt) = Detail table
SELECT trm.seq_no, trm.recipient_id, trm.payee_fullname, trm.payee_address1, trm.payee_address2, trm.payee_address3, trm.payee_address_city, trm.payee_address_state, trm.recip_zip, trm.recip_zip_4, trm.recip_zip_4_2, trm.check_no, trm.check_date, trm.check_amount,
cdt.com_ss_source_sys, cdt.cd_pay_date, cdt.com_set_amount,
bnk.name, bnk.address, bnk.transit_routing,
act.acct_no
FROM md_master trm, cd_med cdt, accounts act, banks bnk
join cd_med on cdt.master_id = trm.id
join accounts on act.acct_id = trm.account_tag
join banks on bnk.bank_id = act.bank_id
ORDER BY cdt.master_id
I don't get an error, the computer just keeps crunching away and hangs.
I don't know about Interbase specifically, but that FROM clause seems a little strange (perhaps just some syntax I'm not familiar with though). Does this help?
...
FROM md_master trm
join cd_med cdt on cdt.master_id = trm.id
join accounts act on act.acct_id = trm.account_tag
join banks bnk on bnk.bank_id = act.bank_id
By the way, you have no WHERE clause so if any of these tables is large, I wouldn't be overly surprised that it takes a long time to run.
You have been bitten by an anti-pattern called implicit join syntax
SELECT * FROM table_with_a_1000rows, othertable_with_a_1000rows
Will do a cross-join on both tables selecting 1 million rows in the output.
You are doing:
FROM md_master trm, cd_med cdt, accounts act, banks bnk
A cross join on 4 tables (combined with normal joins afterwards), which could easily generate many billions of rows.
No wonder interbase hangs; it is working until the end of time to generate more rows then there are atoms in the universe.
The solution
Never use , after the FROM clause, that is an implicit join and it is evil.
Only use explicit joins, like so:
SELECT
trm.seq_no, trm.recipient_id, trm.payee_fullname, trm.payee_address1
, trm.payee_address2, trm.payee_address3, trm.payee_address_city
, trm.payee_address_state, trm.recip_zip, trm.recip_zip_4, trm.recip_zip_4_2
, trm.check_no, trm.check_date, trm.check_amount
, cdt.com_ss_source_sys, cdt.cd_pay_date, cdt.com_set_amount
, bnk.name, bnk.address, bnk.transit_routing
, act.acct_no
FROM md_master trm
join cd_med on cdt.master_id = trm.id
join accounts on act.acct_id = trm.account_tag
join banks on bnk.bank_id = act.bank_id
ORDER BY cdt.master_id
The error lie in the from clause. You are using half with comma separated tables without a relation in where clause and half with joins.
Just use joins and all should work fine
Warning: Here be beginner SQL! Be gentle...
I have two queries that independently give me what I want from the relevant tables in a reasonably timely fashion, but when I try to combine the two in a (fugly) union, things quickly fall to bits and the query either gives me duplicate records, takes an inordinately long time to run, or refuses to run at all quoting various syntax errors at me.
Note: I had to create a 'dummy' table (tblAllDates) with a single field containing dates from 1 Jan 2008 as I need the query to return a single record from each day, and there are days in both tables that have no data. This is the only way I could figure to do this, no doubt there is a smarter way...
Here are the queries:
SELECT tblAllDates.date, SUM(tblvolumedata.STT)
FROM tblvolumedata RIGHT JOIN tblAllDates ON tblvolumedata.date=tblAllDates.date
GROUP BY tblAllDates.date;
SELECT tblAllDates.date, SUM(NZ(tblTimesheetData.batching)+NZ(tblTimesheetData.categorisation)+NZ(tblTimesheetData.CDT)+NZ(tblTimesheetData.CSI)+NZ(tblTimesheetData.destruction)+NZ(tblTimesheetData.extraction)+NZ(tblTimesheetData.indexing)+NZ(tblTimesheetData.mail)+NZ(tblTimesheetData.newlodgement)+NZ(tblTimesheetData.recordedDeliveries)+NZ(tblTimesheetData.retrieval)+NZ(tblTimesheetData.scanning)) AS VA
FROM tblTimesheetData RIGHT JOIN tblAllDates ON tblTimesheetData.date=tblAllDates.date
GROUP BY tblAllDates.date;
The best result I have managed is the following:
SELECT tblAllDates.date, 0 AS STT, SUM(NZ(tblTimesheetData.batching)+NZ(tblTimesheetData.categorisation)+NZ(tblTimesheetData.CDT)+NZ(tblTimesheetData.CSI)+NZ(tblTimesheetData.destruction)+NZ(tblTimesheetData.extraction)+NZ(tblTimesheetData.indexing)+NZ(tblTimesheetData.mail)+NZ(tblTimesheetData.newlodgement)+NZ(tblTimesheetData.recordedDeliveries)+NZ(tblTimesheetData.retrieval)+NZ(tblTimesheetData.scanning)) AS VA
FROM tblTimesheetData RIGHT JOIN tblAllDates ON tblTimesheetData.date=tblAllDates.date
GROUP BY tblAllDates.date
UNION SELECT tblAllDates.date, SUM(tblvolumedata.STT) AS STT, 0 AS VA
FROM tblvolumedata RIGHT JOIN tblAllDates ON tblvolumedata.date=tblAllDates.date
GROUP BY tblAllDates.date;
This gives me the VA and STT data I want, but in two records where I have data from both in a single day, like this:
date STT VA
28/07/2008 0 54020
28/07/2008 33812 0
29/07/2008 0 53890
29/07/2008 33289 0
30/07/2008 0 51780
30/07/2008 30456 0
31/07/2008 0 52790
31/07/2008 31305 0
What I'm after is the STT and VA data in single row per day. How might this be achieved, and how far am I away from a query that could be considered optimal? (don't laugh, I only seek to learn!)
You could put all of that into one query like so
SELECT
dates.date,
SUM(volume.STT) AS STT,
SUM(NZ(timesheet.batching)+NZ(timesheet.categorisation)+NZ(timesheet.CDT)+NZ(timesheet.CSI)+NZ(timesheet.destruction)+NZ(timesheet.extraction)+NZ(timesheet.indexing)+NZ(timesheet.mail)+NZ(timesheet.newlodgement)+NZ(timesheet.recordedDeliveries)+NZ(timesheet.retrieval)+NZ(timesheet.scanning)) AS VA
FROM
tblAllDates dates
LEFT JOIN tblvolumedata volume
ON dates.date = volume.date
LEFT JOIN tblTimesheetData timesheet
ON
dates.date timesheet.date
GROUP BY dates.date;
I've put the dates table first in the FROM clause and then LEFT JOINed the two other tables.
The jet database can be funny with more than one join in a query, so you may need to wrap one of the joins in parentheses (I believe this is referred to as Bill's SQL!) - I would recommend LEFT JOINing the tables in the query builder and then taking the SQL code view and modifying that to add in the SUMs, GROUP BY, etc.
EDIT:
Ensure that the date field in each table is indexed as you're joining each table on this field.
EDIT 2:
How about this -
SELECT date,
Sum(STT),
Sum(VA)
FROM
(SELECT dates.date, 0 AS STT, SUM(NZ(tblTimesheetData.batching)+NZ(tblTimesheetData.categorisation)+NZ(tblTimesheetData.CDT)+NZ(tblTimesheetData.CSI)+NZ(tblTimesheetData.destruction)+NZ(tblTimesheetData.extraction)+NZ(tblTimesheetData.indexing)+NZ(tblTimesheetData.mail)+NZ(tblTimesheetData.newlodgement)+NZ(tblTimesheetData.recordedDeliveries)+NZ(tblTimesheetData.retrieval)+NZ(tblTimesheetData.scanning)) AS VA
FROM tblTimesheetData RIGHT JOIN dates ON tblTimesheetData.date=dates.date
GROUP BY dates.date
UNION SELECT dates.date, SUM(tblvolumedata.STT) AS STT, 0 AS VA
FROM tblvolumedata RIGHT JOIN dates ON tblvolumedata.date=dates.date
GROUP BY dates.date
)
GROUP BY date;
Interestingly, When I ran my first statement against some test data, the figures for STT and VA had all been multiplied by 4, compared to the second statement. Very strange behaviour and certainly not what I expected.
The table of dates is the best way.
Combine the joins in there FROM clause. Something like this....
SELECT d.date,
a.value,
b.value
FROM tableOfDates d
RIGHT JOIN firstTable a
ON d.date = a.date
RIGHT JOIN secondTable b
ON d.date = b.date
Turn the SQL into views and join them on the dates.