I need to improve my view performance, right now the SQL that makes the view is:
select tr.account_number , tr.actual_collection_trx_date ,s.customer_key
from fct_collections_trx tr,
stg_scd_customers_key s
where tr.account_number = s.account_number
and trunc(tr.actual_collection_trx_date) between s.start_date and s.end_date;
Table fct_collections_trx has 170k+-(changes every day) records.
Table stg_scd_customers_key has 430mil records.
Table fct_collections_trx have indexes as following:
(SINGLE INDEX OF ALL OF THEM) (ACCOUNT_NUMBER, SUB_ACCOUNT_NUMBER, ACTUAL_COLLECTION_TRX_DATE, COLLECTION_TRX_DATE, COLLECTION_ACTION_CODE)(UNIQUE) and ENTRY_SCHEMA_DATE(NORMAL). DDL:
alter table stg_admin.FCT_COLLECTIONS_TRX
add primary key (ACCOUNT_NUMBER, SUB_ACCOUNT_NUMBER, ACTUAL_COLLECTION_TRX_DATE, COLLECTION_TRX_DATE, COLLECTION_ACTION_CODE)
using index
tablespace STG_COLLECTION_DATA
pctfree 10
initrans 2
maxtrans 255
storage
(
initial 80K
next 1M
minextents 1
maxextents unlimited
);
Table structure:
create table stg_admin.FCT_COLLECTIONS_TRX
(
account_number NUMBER(10) not null,
sub_account_number NUMBER(5) not null,
actual_collection_trx_date DATE not null,
customer_key NUMBER(10),
sub_account_key NUMBER(10),
schema_key VARCHAR2(10) not null,
collection_group_code CHAR(3),
collection_action_code CHAR(3) not null,
action_order NUMBER,
bucket NUMBER(5),
collection_trx_date DATE not null,
days_into_cycle NUMBER(5),
logical_delete_date DATE,
balance NUMBER(10,2),
abbrev CHAR(8),
customer_status CHAR(2),
sub_account_status CHAR(2),
entry_schema_date DATE,
next_collection_action_code CHAR(3),
next_collectin_trx_date DATE,
reject_key NUMBER(10) not null,
dwh_update_date DATE,
delta_type VARCHAR2(1)
)
Table stg_scd_customers_key have indexes : (SINGLE INDEX OF ALL OF THEM)
(ACCOUNT_NUMBER, START_DATE, END_DATE). DDL :
create unique index stg_admin.STG_SCD_CUST_KEY_PKP on stg_admin.STG_SCD_CUSTOMERS_KEY (ACCOUNT_NUMBER, START_DATE, END_DATE);
This table is also partitioned:
partition by range (END_DATE)
(
partition SCD_CUSTOMERS_20081103 values less than (TO_DATE(' 2008-11-04 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
tablespace FCT_CUSTOMER_SERVICES_DATA
pctfree 10
initrans 1
maxtrans 255
storage
(
initial 8M
next 1M
minextents 1
maxextents unlimited
)
Table structure:
create table stg_admin.STG_SCD_CUSTOMERS_KEY
(
customer_key NUMBER(18) not null,
account_number NUMBER(10) not null,
start_date DATE not null,
end_date DATE not null,
curr_ind NUMBER(1) not null
)
I Can't add filter on the big table(need all range of dates) and i can't use materialized view. This query runs for about 20-40 minutes, I have to make it faster..
I've already tried to drop the trunc, makes no different.
Any suggestions?
Explain plan:
First, write the query using explicit join syntax:
select tr.account_number , tr.actual_collection_trx_date ,s.customer_key
from fct_collections_trx tr join
stg_scd_customers_key s
on tr.account_number = s.account_number and
trunc(tr.actual_collection_trx_date) between s.start_date and s.end_date;
You already have appropriate indexes for the customers table. You can try an index on fct_collections_trx(account_number, trunc(actual_collection_trx_date), actual_collection_trx_date). Oracle might find this useful for the join.
However, if you are looking for a single match, then I wonder if there is another approach that might work. How does the performance of the following query work:
select tr.account_number , tr.actual_collection_trx_date,
(select min(s.customer_key) keep (dense_rank first order by s.start_date desc)
from stg_scd_customers_key s
where tr.account_number = s.account_number and
tr.actual_collection_trx_date >= s.start_date
) as customer_key
from fct_collections_trx tr ;
This query is not exactly the same as the original query, because it is not doing any filtering -- and it is not checking the end date. Sometimes, though, this phrasing can be more efficient.
Also, I think the trunc() is unnecessary in this case, so an index on stg_scd_customers_key(account_number, start_date, customer_key) is optimal.
The expression min(x) keep (dense_rank first order by) essentially does first() -- it gets the first element in a list. Note that the min() isn't important; max() works just as well. So, this expression is getting the first customer key that meets the conditions in the where clause. I have observed that this function is quite fast in Oracle, and often faster than other methods.
If the start and end dates have no time elements (ie. they both default to midnight), then you could do:
select tr.account_number , tr.actual_collection_trx_date ,s.customer_key
from fct_collections_trx tr,
stg_scd_customers_key s
where tr.account_number = s.account_number
and tr.actual_collection_trx_date >= s.start_date
and tr.actual_collection_trx_date < s.end_date + 1;
On top of that, you could add an index to each table, containing the following columns:
for fct_collections_trx: (account_number, actual_collection_trx_date)
for stg_scd_customers_key: (account_number, start_date, end_date, customer_key)
That way, the query should be able to use the indexes rather than having to go to the table as well.
I suggest you an index based on the most selective field in your case
START_DATE, END_DATE
try reverting the (or adding a proper) index
START_DATE, END_DATE, ACCOUNT_NUMBER
in table stg_scd_customers_key
Related
Let's assume I have these tables (ignore that they are essentially the same, actual setup is more complex than this):
create table inbound (
id number(19,0) not null,
created_on timestamp(6),
place_id number(19,0),
qty_amount float(126),
constraint "inbound_pk" primary key (id),
constraint "inbound_place_FK" foreign key (place_id)
references place (id) on delete cascade
);
create table outbound (
id number(19,0) not null,
created_on timestamp(6),
place_id number(19,0),
qty_amount float(126),
constraint "outbound_pk" primary key (id),
constraint "outbound_place_FK" foreign key (place_id)
references place (id) on delete cascade
)
Then, I have this query:
with aligned_in (start_date, place_id, total) as (
select
get_week_start(place_id, created_on) start_date,
place_id,
sum(qty_amount) total
from inbound
where <....>
group by
get_week_start(place_id, created_on), place_id
),
aligned_out (start_date, place_id, total) as (
select
get_week_start(place_id, created_on) start_date,
place_id,
sum(qty_amount) total
from outbound
where <....>
group by get_week_start(place_id, created_on), place_id
)
select
start_date,
place_id,
aligned_in.total total_in,
aligned_out.total total_out
from aligned_in
left outer join aligned_out using(place_id, start_date)
For some reason, this query, when executed on Oracle 12.2.0.1.0, throws out a
ORA-00979: not a GROUP BY expression
error with line pointing at line with a call to get_week_start.
While fiddling with it, I've also discovered the following:
The subqueries for aligned_in and aligned_out can be run completely fine by themselves
Removing call to get_week_start from the subqueries' projection fixes it - the group by clause without this call in projection works (but obviously changes a lot about how this query written and executes)
(The thing which I'm most confused about) This exact query without any alterations runs completely fine on Oracle 11.2.0.2.0
Most info on ORA-00979 is not very useful because it doesn't appear applicable at all to my query
Here, the get_week_start is a pretty simple function to find out what the start of a business week would be at a given Place (this is customer's data). Due to how it's defined, this function is not deterministic. However, I did run into suggestions that such functions should be marked deterministic, and did try doing that just to see what happens - and that did not help.
So, why is this happening?
What changed beween versions 11.2.0 and 12.2.0 that caused this? Am I missing some configuration option? Can this be fixed without rewriting the query?
Edit:
Sample version of get_week_start as requested in comments:
create function get_week_start(place_id number, week_day date)
return date
as
start_date date;
begin
begin
select
trunc(next_day(week_day, o.business_week_start)) - 7
into start_date
from place
inner join place_owner o on o.id = place.owner_id
where place.id = place_id;
return stat_date;
exception
when others then return null;
end;
end get_week_start;
Sample tables for place and place_owner:
create table place_owner (
id number(19,0) not null,
name varchar2(255) not null,
business_week_start varchar2(64) not null,
constraint "place_owner_pk" primary key (id)
);
create table place (
id number(19, 0) not null,
name varchar2(255) not null,
owner_id number(19,0) not null,
constraint "place_pk" primary key (id),
constraint "place_unq" unique (owner_id, name),
constraint "place_owner_fk" foreign key (owner_id)
references place_owner (id) on delete cascade
);
I would try CROSS/OUTER APPLY (Oracle 12c):
with aligned_in (start_date, place_id, total) as (
select
s.start_date,
place_id,
sum(qty_amount) total
from inbound
cross apply (SELECT get_week_start(place_id, created_at) AS start_date
FROM dual) s
where <....>
group by
s.start_date, place_id
),
...
Another approach:
with aligned_in (start_date, place_id, total) as (
SELECT start_date,
place_id,
sum(qty_amount) total
FROM (select get_week_start(place_id, created_at) AS start_date,
place_id,
qty_amount
from inbound
where <....>) sub
group by start_date, place_id
),
-- ...
There are two queries below which return count of ID column excluding NULL values
and second query will return the count of all the rows from the table including NULL rows.
select COUNT(ID) from TableName
select COUNT(*) from TableName
My Confusion :
Is there any performance difference ?
TL/DR: Plans might not be the same, you should test on appropriate
data and make sure you have the correct indexes and then choose the best solution based on your investigations.
The query plans might not be the same depending on the indexing and the nullability of the column which is used in the COUNT function.
In the following example I create a table and fill it with one million rows.
All the columns have been indexed except column 'b'.
The conclusion is that some of these queries do result in the same execution plan but most of them are different.
This was tested on SQL Server 2014, I do not have access to an instance of 2012 at this moment. You should test this yourself to figure out the best solution.
create table t1(id bigint identity,
dt datetime2(7) not null default(sysdatetime()),
a char(800) null,
b char(800) null,
c char(800) null);
-- We will use these 4 indexes. Only column 'b' does not have any supporting index on it.
alter table t1 add constraint [pk_t1] primary key NONCLUSTERED (id);
create clustered index cix_dt on t1(dt);
create nonclustered index ix_a on t1(a);
create nonclustered index ix_c on t1(c);
insert into T1 (a, b, c)
select top 1000000
a = case when low = 1 then null else left(REPLICATE(newid(), low), 800) end,
b = case when low between 1 and 10 then null else left(REPLICATE(newid(), 800-low), 800) end,
c = case when low between 1 and 192 then null else left(REPLICATE(newid(), 800-low), 800) end
from master..spt_values
cross join (select 1 from master..spt_values) m(ock)
where type = 'p';
checkpoint;
-- All rows, no matter if any columns are null or not
-- Uses primary key index
select count(*) from t1;
-- All not null,
-- Uses primary key index
select count(id) from t1;
-- Some values of 'a' are null
-- Uses the index on 'a'
select count(a) from t1;
-- Some values of b are null
-- Uses the clustered index
select count(b) from t1;
-- No values of dt are null and the table have a clustered index on 'dt'
-- Uses primary key index and not the clustered index as one could expect.
select count(dt) from t1;
-- Most values of c are null
-- Uses the index on c
select count(c) from t1;
Now what would happen if we were more explicit in what we wanted our count to do? If we tell the query planner that we want to get only rows which have not null, will that change anything?
-- Homework!
-- What happens if we explicitly count only rows where the column is not null? What if we add a filtered index to support this query?
-- Hint: It will once again be different than the other queries.
create index ix_c2 on t1(c) where c is not null;
select count(*) from t1 where c is not null;
I am building a leaderboard for some of my online games. Here is what I need to do with the data:
Get rank of a player for a given game across multiple time frame (today, last week, all time, etc.)
Get paginated ranking (e.g. top score for last 24 hrs., get players between rank 25 and 50, get rank or a single user)
I defined with the following table definition and index and I have a couple of questions.
Considering my scenarios, do I have a good primary key? The reason why I have a clustered key across gameId, playerName and score is simply because I want to make sure that all data for a given game is in the same area and that score is already sorted. Most of the time I will display the data is descending order of score (+ updatedDateTime for ties) for a given gameId. Is this a right strategy? In other words, I want to make sure that I can run my queries to get the rank of my players as fast as possible.
CREATE TABLE score (
[gameId] [smallint] NOT NULL,
[playerName] [nvarchar](50) NOT NULL,
[score] [int] NOT NULL,
[createdDateTime] [datetime2](3) NOT NULL,
[updatedDateTime] [datetime2](3) NOT NULL,
PRIMARY KEY CLUSTERED ([gameId] ASC, [playerName] ASC, [score] DESC, [updatedDateTime] ASC)
CREATE NONCLUSTERED INDEX [Score_Idx] ON score ([gameId] ASC, [score] DESC, [updatedDateTime] ASC) INCLUDE ([playerName])
Below is the first iteration of the query I will be using to get the rank of my players. However, I am a bit disappointed by the execution plan (see below). Why does SQL need to sort? The additional sort seem to come from the RANK function. But isn’t my data already sorted in descending order (based on the clustered key of the score table)? I am also wondering if I should normalize a bit more my table and move out the PlayerName column in a Player table. I originally decided to keep everything in the same table to minimize the number of joins.
DECLARE #GameId AS INT = 0
DECLARE #From AS DATETIME2(3) = '2013-10-01'
SELECT DENSE_RANK() OVER (ORDER BY Score DESC), s.PlayerName, s.Score, s.CountryCode, s.updatedDateTime
FROM [mrgleaderboard].[score] s
WHERE s.GameId = #GameId
AND (s.UpdatedDateTime >= #From OR #From IS NULL)
Thank you for the help!
[Updated]
Primary key is not good
You have a unique entity that is [GameID] + [PlayerName]. And composite clustered Index > 120 bytes with nvarchar. Look for the answer by #marc_s in the related topic SQL Server - Clustered index design for dictionary
Your table schema does not match of your requirements to time periods
Ex.: I earned 300 score on Wednesday and this score stored on leaderboard. Next day I earned 250 score, but it will not record on leaderboard and you don't get results if I run a query to Tuesday leaderboard
For complete information you can get from a historical table games played score but it can be very expensive
CREATE TABLE GameLog (
[id] int NOT NULL IDENTITY
CONSTRAINT [PK_GameLog] PRIMARY KEY CLUSTERED,
[gameId] smallint NOT NULL,
[playerId] int NOT NULL,
[score] int NOT NULL,
[createdDateTime] datetime2(3) NOT NULL)
Here are solutions to accelerate it related with the aggregation:
Indexed view on historical table (see post by #Twinkles).
You need 3 indexed view for the 3 time periods. Potentially huge size of historical tables and 3 indexed view. Unable to remove the "old" periods of the table. Performance issue to save score.
Asynchronous leaderboard
Scores saved in the historical table. SQL job/"Worker" (or several) according to schedule (1 per minute?) sorts historical table and populates the leaderboards table (3 tables for 3 time period or one table with time period key) with the precalculated rank of a user. This table also can be denormalized (have score, datetime, PlayerName and ...). Pros: Fast reading (without sorting), fast save score, any time periods, flexible logic and flexible schedules. Cons: The user has finished the game but did not found immediately himself on the leaderboard
Preaggregated leaderboard
During recording the results of the game session do pre-treatment. In your case something like UPDATE [Leaderboard] SET score = #CurrentScore WHERE #CurrentScore > MAX (score) AND ... for the player / game id but you did it only for "All time" leaderboard. The scheme might look like this:
CREATE TABLE [Leaderboard] (
[id] int NOT NULL IDENTITY
CONSTRAINT [PK_Leaderboard] PRIMARY KEY CLUSTERED,
[gameId] smallint NOT NULL,
[playerId] int NOT NULL,
[timePeriod] tinyint NOT NULL, -- 0 -all time, 1-monthly, 2 -weekly, 3 -daily
[timePeriodFrom] date NOT NULL, -- '1900-01-01' for all time, '2013-11-01' for monthly, etc.
[score] int NOT NULL,
[createdDateTime] datetime2(3) NOT NULL
)
playerId timePeriod timePeriodFrom Score
----------------------------------------------
1 0 1900-01-01 300
...
1 1 2013-10-01 150
1 1 2013-11-01 300
...
1 2 2013-10-07 150
1 2 2013-11-18 300
...
1 3 2013-11-19 300
1 3 2013-11-20 250
...
So, you have to update all 3 score for all time period. Also as you can see leaderboard will contain "old" periods, such as monthly of October. Maybe you have to delete it if you do not need this statistics. Pros: Does not need a historical table. Cons: Complicated procedure for storing the result. Need maintenance of leaderboard. Query requires sorting and JOIN
CREATE TABLE [Player] (
[id] int NOT NULL IDENTITY CONSTRAINT [PK_Player] PRIMARY KEY CLUSTERED,
[playerName] nvarchar(50) NOT NULL CONSTRAINT [UQ_Player_playerName] UNIQUE NONCLUSTERED)
CREATE TABLE [Leaderboard] (
[id] int NOT NULL IDENTITY CONSTRAINT [PK_Leaderboard] PRIMARY KEY CLUSTERED,
[gameId] smallint NOT NULL,
[playerId] int NOT NULL,
[timePeriod] tinyint NOT NULL, -- 0 -all time, 1-monthly, 2 -weekly, 3 -daily
[timePeriodFrom] date NOT NULL, -- '1900-01-01' for all time, '2013-11-01' for monthly, etc.
[score] int NOT NULL,
[createdDateTime] datetime2(3)
)
CREATE UNIQUE NONCLUSTERED INDEX [UQ_Leaderboard_gameId_playerId_timePeriod_timePeriodFrom] ON [Leaderboard] ([gameId] ASC, [playerId] ASC, [timePeriod] ASC, [timePeriodFrom] ASC)
CREATE NONCLUSTERED INDEX [IX_Leaderboard_gameId_timePeriod_timePeriodFrom_Score] ON [Leaderboard] ([gameId] ASC, [timePeriod] ASC, [timePeriodFrom] ASC, [score] ASC)
GO
-- Generate test data
-- Generate 500K unique players
;WITH digits (d) AS (SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION
SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9 UNION SELECT 0)
INSERT INTO Player (playerName)
SELECT TOP (500000) LEFT(CAST(NEWID() as nvarchar(50)), 20 + (ABS(CHECKSUM(NEWID())) & 15)) as Name
FROM digits CROSS JOIN digits ii CROSS JOIN digits iii CROSS JOIN digits iv CROSS JOIN digits v CROSS JOIN digits vi
-- Random score 500K players * 4 games = 2M rows
INSERT INTO [Leaderboard] (
[gameId],[playerId],[timePeriod],[timePeriodFrom],[score],[createdDateTime])
SELECT GameID, Player.id,ABS(CHECKSUM(NEWID())) & 3 as [timePeriod], DATEADD(MILLISECOND, CHECKSUM(NEWID()),GETDATE()) as Updated, ABS(CHECKSUM(NEWID())) & 65535 as score
, DATEADD(MILLISECOND, CHECKSUM(NEWID()),GETDATE()) as Created
FROM ( SELECT 1 as GameID UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4) as Game
CROSS JOIN Player
ORDER BY NEWID()
UPDATE [Leaderboard] SET [timePeriodFrom]='19000101' WHERE [timePeriod] = 0
GO
DECLARE #From date = '19000101'--'20131108'
,#GameID int = 3
,#timePeriod tinyint = 0
-- Get paginated ranking
;With Lb as (
SELECT
DENSE_RANK() OVER (ORDER BY Score DESC) as Rnk
,Score, createdDateTime, playerId
FROM [Leaderboard]
WHERE GameId = #GameId
AND [timePeriod] = #timePeriod
AND [timePeriodFrom] = #From)
SELECT lb.rnk,lb.Score, lb.createdDateTime, lb.playerId, Player.playerName
FROM Lb INNER JOIN Player ON lb.playerId = Player.id
ORDER BY rnk OFFSET 75 ROWS FETCH NEXT 25 ROWS ONLY;
-- Get rank of a player for a given game
SELECT (SELECT COUNT(DISTINCT rnk.score)
FROM [Leaderboard] as rnk
WHERE rnk.GameId = #GameId
AND rnk.[timePeriod] = #timePeriod
AND rnk.[timePeriodFrom] = #From
AND rnk.score >= [Leaderboard].score) as rnk
,[Leaderboard].Score, [Leaderboard].createdDateTime, [Leaderboard].playerId, Player.playerName
FROM [Leaderboard] INNER JOIN Player ON [Leaderboard].playerId = Player.id
where [Leaderboard].GameId = #GameId
AND [Leaderboard].[timePeriod] = #timePeriod
AND [Leaderboard].[timePeriodFrom] = #From
and Player.playerName = N'785DDBBB-3000-4730-B'
GO
This is only an example for the presentation of ideas. It can be optimized. For example, combining columns GameID, TimePeriod, TimePeriodDate to one column through the dictionary table. The effectiveness of the index will be higher.
P.S. Sorry for my English. Feel free to fix grammatical or spelling errors
You could look into indexed views to create scoreboards for common time ranges (today, this week/month/year, all-time).
to get the rank of a player for a given game across multiple timeframes, you will select the game and rank (i.e. sort) by score over a multiple timeframes. for this, your nonclustered index could be changed like this since this is the way your select seems to query.
CREATE NONCLUSTERED INDEX [Score_Idx]
ON score ([gameId] ASC, [updatedDateTime] ASC, [score] DESC)
INCLUDE ([playerName])
for the paginated ranking:
for the 24h-top score i guess you will want all the top scores of a single user across all games within the last 24h. for this you will be querying [playername], [updateddatetime] with [gameid].
for the players between rank 25-50, i assume you are talking about a single game and have a long ranking that you can page through. the query will then be based upon [gameid], [score] and a little on [updateddatetime] for the ties.
the single-user ranks, probably for each game, is a little more difficult. you will need to query the leaderboards for all games in order to get the player's rank in them and then filter on the player. you will need [gameid], [score], [updateddatetime] and then filter by player.
concluding all this, i propose you keep your nonclustered index and change the primary key to:
PRIMARY KEY CLUSTERED ([gameId] ASC, [score] DESC, [updatedDateTime] ASC)
for the 24h-top score i think this might help:
CREATE NONCLUSTERED INDEX [player_Idx]
ON score ([playerName] ASC)
INCLUDE ([gameId], [score])
the dense_rank query sorts because it selects [gameId], [updatedDateTime], [score]. see my comment on the nonclustered index above.
i would also think twice about including the [updatedDateTime] in your queries and subsequently in your indexes. maybe sometmes two players get the same rank, why not? [updatedDateTime] will let your index swell up significantly.
also you might think about partitioning tables by [gameid].
As a bit of a sidetrack:
Ask yourself how accurate and how up to date do the scores in the leaderboard actually need to be?
As a player I don't care if I'm number 142134 in the world or number 142133. I do care if I beat my friends' exact score (but then I only need my score compared to a couple of other scores) and I want to know that my new highscore sends me from somewhere around 142000 to somewhere around 90000. (Yay!)
So if you want really fast leaderboards, you do not actually need all data to be up to date. You could daily or hourly compute a static sorted copy of the leaderboard and when showing player X's score, show at what rank it'd fit in the static copy.
When comparing to friends, last minute updates do matter, but you're dealing with only a couple hundred scores, so you can look up their actual scores in the up to date leaderboards.
Oh, and I care about the top 10 of course. Consider them my "friends" merely based on the fact that they scored so well, and show these values up to date.
Your clustered index is composite so it means that order is defined by more than one column. You request ORDER BY Score which is the 2nd column in the clustered index. For that reason, entries in the index are not necessarily in the order of Score, e.g. entries
1, 2, some date
2, 1, some other date
If you select just Score, the order will be
2
1
which needs to be sorted.
i would not put the "score" column into the clustered index because it will probably change all the time ... and updates on a column that's part of the clustered index will be expensive
I'm trying to create a view with row numbers like so:
create or replace view daily_transactions as
select
generate_series(1, count(t)) as id,
t.ic,
t.bio_id,
t.wp,
date_trunc('day', t.transaction_time)::date transaction_date,
min(t.transaction_time)::time time_in,
w.start_time wp_start,
w.start_time - min(t.transaction_time)::time in_diff,
max(t.transaction_time)::time time_out,
w.end_time wp_end,
max(t.transaction_time)::time - w.end_time out_diff,
count(t) total_transactions,
calc_att_status(date_trunc('day', t.transaction_time)::date,
min(t.transaction_time)::time,
max(t.transaction_time)::time,
w.start_time, w.end_time ) status
from transactions t
left join wp w on (t.wp = w.wp_name)
group by ic, bio_id, t.wp, date_trunc('day', transaction_time),
w.start_time, w.end_time;
I ended up with duplicate rows. SELECT DISTINCT doesn't work either. Any ideas?
Transaction Table:
create table transactions(
id serial primary key,
ic text references users(ic),
wp text references wp(wp_name),
serial_no integer,
bio_id integer,
node integer,
finger integer,
transaction_time timestamp,
transaction_type text,
transaction_status text
);
WP table:
create table wp(
id serial unique,
wp_name text primary key,
start_time time,
end_time time,
description text,
status text
);
View Output:
CREATE OR REPLACE VIEW daily_transactions as
SELECT row_number() OVER () AS id
, t.ic
, t.bio_id
, t.wp
, t.transaction_time::date AS transaction_date
, min(t.transaction_time)::time AS time_in
, w.start_time AS wp_start
, w.start_time - min(t.transaction_time)::time AS in_diff
, max(t.transaction_time)::time AS time_out
, w.end_time AS wp_end
, max(t.transaction_time)::time - w.end_time AS out_diff
, count(*) AS total_transactions
, calc_att_status(t.transaction_time::date, min(t.transaction_time)::time
, max(t.transaction_time)::time
, w.start_time, w.end_time) AS status
FROM transactions t
LEFT JOIN wp w ON t.wp = w.wp_name
GROUP BY t.ic, t.bio_id, t.wp, t.transaction_time::date
, w.start_time, w.end_time;
Major points
generate_series() is applied after aggregate functions, but produces multiple rows, thereby multiplying all output rows.
The window function row_number() is also applied after aggregate functions, but only generates a single number per row. You need PostgreSQL 8.4 or later for that.
date_trunc() is redundant in date_trunc('day', t.transaction_time)::date.
t.transaction_time::date achieves the same, simper & faster.
Use count(*) instead of count(t). Same result here, but a bit faster.
Some other minor changes.
This query works (thanks to those that helped) to generate a 30-day moving average of volume.
SELECT x.symbol, x.dseqkey, AVG(y.VOLUME) moving_average
FROM STOCK_HIST x, STOCK_HIST y
WHERE x.dseqkey>=29 AND x.dseqkey BETWEEN y.dseqkey AND y.dseqkey+29
AND Y.Symbol=X.Symbol
GROUP BY x.symbol, x.dseqkey
ORDER BY x.dseqkey DESC
However the performance is very bad. I am running the above against a view (STOCK_HIST) that brings two tables (A and B) together. Table A contains daily stock volume and the daily date for over 9,000 stocks dating back as far as 40 years (300+ rows, per year, per each of the 9,000 stocks). Table B is a "Date Key" table that links the date in table A to the DSEQKEY (int).
What are my options for performance improvement? I have heard that views are convenient but not performant. Should I just copy the columns needed from table A and B to a single table and then run the above query? I have indexes on the tables A and B on the stock symbol + date (A) and DSEQKEY (B).
Is it the view that's killing my performance? How can I improve this?
EDIT
By request, I have posted the 2 tables and the view below. Also, now there is one clustered index on the view and each table. I am open to any recommendations as this query that produces the deisred result, is still slow:
SELECT
x.symbol
, x.dseqkey
, AVG(y.VOLUME) moving_average
FROM STOCK_HIST x
JOIN STOCK_HIST y ON x.dseqkey BETWEEN y.dseqkey AND y.dseqkey+29 AND Y.Symbol=X.Symbol
WHERE x.dseqkey >= 15000
GROUP BY x.symbol, x.dseqkey
ORDER BY x.dseqkey DESC ;
HERE IS THE VIEW:
CREATE VIEW [dbo].[STOCK_HIST]
WITH SCHEMABINDING
AS
SELECT
dbo.DATE_MASTER.date
, dbo.DATE_MASTER.year
, dbo.DATE_MASTER.quarter
, dbo.DATE_MASTER.month
, dbo.DATE_MASTER.week
, dbo.DATE_MASTER.wday
, dbo.DATE_MASTER.day
, dbo.DATE_MASTER.nday
, dbo.DATE_MASTER.wkmax
, dbo.DATE_MASTER.momax
, dbo.DATE_MASTER.qtrmax
, dbo.DATE_MASTER.yrmax
, dbo.DATE_MASTER.dseqkey
, dbo.DATE_MASTER.wseqkey
, dbo.DATE_MASTER.mseqkey
, dbo.DATE_MASTER.qseqkey
, dbo.DATE_MASTER.yseqkey
, dbo.DATE_MASTER.tom
, dbo.QP_HISTORY.Symbol
, dbo.QP_HISTORY.[Open] as propen
, dbo.QP_HISTORY.High as prhigh
, dbo.QP_HISTORY.Low as prlow
, dbo.QP_HISTORY.[Close] as prclose
, dbo.QP_HISTORY.Volume
, dbo.QP_HISTORY.QRS
FROM dbo.DATE_MASTER
INNER JOIN dbo.QP_HISTORY ON dbo.DATE_MASTER.date = dbo.QP_HISTORY.QPDate ;
HERE IS DATE_MASTER TABLE:
CREATE TABLE [dbo].[DATE_MASTER] (
[date] [datetime] NULL
, [year] [int] NULL
, [quarter] [int] NULL
, [month] [int] NULL
, [week] [int] NULL
, [wday] [int] NULL
, [day] [int] NULL
, [nday] nvarchar NULL
, [wkmax] [bit] NOT NULL
, [momax] [bit] NOT NULL
, [qtrmax] [bit] NOT NULL
, [yrmax] [bit] NOT NULL
, [dseqkey] [int] IDENTITY(1,1) NOT NULL
, [wseqkey] [int] NULL
, [mseqkey] [int] NULL
, [qseqkey] [int] NULL
, [yseqkey] [int] NULL
, [tom] [bit] NOT NULL
) ON [PRIMARY] ;
HERE IS THE QP_HISTORY TABLE:
CREATE TABLE [dbo].[QP_HISTORY] (
[Symbol] varchar NULL
, [QPDate] [date] NULL
, [Open] [real] NULL
, [High] [real] NULL
, [Low] [real] NULL
, [Close] [real] NULL
, [Volume] [bigint] NULL
, [QRS] [smallint] NULL
) ON [PRIMARY] ;
HERE IS THE VIEW (STOCK_HIST) INDEX
CREATE UNIQUE CLUSTERED INDEX [ix_STOCK_HIST] ON [dbo].[STOCK_HIST]
(
[Symbol] ASC,
[dseqkey] ASC,
[Volume] ASC
)
HERE IS THE QP_HIST INDEX
CREATE UNIQUE CLUSTERED INDEX [IX_QP_HISTORY] ON [dbo].[QP_HISTORY]
(
[Symbol] ASC,
[QPDate] ASC,
[Close] ASC,
[Volume] ASC
)
HERE IS THE INDEX ON DATE_MASTER
CREATE UNIQUE CLUSTERED INDEX [IX_DATE_MASTER] ON [dbo].[DATE_MASTER]
(
[date] ASC,
[dseqkey] ASC,
[wseqkey] ASC,
[mseqkey] ASC
)
I do not have any primary keys setup. Would this help performance?
EDIT - After making suggested changes the query is slower than before. What ran in 10m 44s is currently at 30m and still running.
I made all of the requested changes except I did not change name of date in Date_Master and I did not drop the QPDate column from QP_Hist. (I have reasons for this and do not see it impacting the performance since I'm not referring to it in the query.)
REVISED QUERY
select x.symbol, x.dmdseqkey, avg(y.volume) as moving_average
from dbo.QP_HISTORY as x
join dbo.QP_HISTORY as y on (x.dmdseqkey between y.dmdseqkey and (y.dmdseqkey + 29))
and (y.symbol = x.symbol)
where x.dmdseqkey >= 20000
group by x.symbol, x.dmdseqkey
order by x.dmdseqkey desc ;
PK on QP_History
ALTER TABLE [dbo].[QP_HISTORY]
ADD CONSTRAINT [PK_QP_HISTORY] PRIMARY KEY CLUSTERED ([Symbol] ASC, DMDSeqKey] ASC)
FK on QP_History
ALTER TABLE [dbo].[QP_HISTORY] ADD CONSTRAINT [FK_QP_HISTORY_DATE_MASTER] FOREIGN KEY([DMDSeqKey]) REFERENCES [dbo].[DATE_MASTER] ([dseqkey])
PK on Date_Master
ALTER TABLE [dbo].[DATE_MASTER]
ADD CONSTRAINT [PK_DATE_MASTER] PRIMARY KEY CLUSTERED ([dseqkey] ASC)
EDIT
HERE IS THE EXECUTION PLAN
First, separate join an filter.
(edit: fixed ON clause)
SELECT x.symbol, x.dseqkey, AVG(y.VOLUME) moving_average
FROM
STOCK_HIST x
JOIN
STOCK_HIST y ON x.dseqkey BETWEEN y.dseqkey AND y.dseqkey+29
AND Y.Symbol=X.Symbol
WHERE x.dseqkey>=29
GROUP BY x.symbol, x.dseqkey
ORDER BY x.dseqkey DESC
Also, what indexes do you have - I'd suggest an index on (dseqkey, symbol) INCLUDE (VOLUME)
Edit 3: you can't have an INCLUDE in a clustered index, my bad. Your syntax is OK.
Please try these permutations... the aim is find the best index for the JOIN and WHERE, followed with the ORDER BY.
CREATE UNIQUE CLUSTERED INDEX [ix_STOCK_HIST] ON [dbo].[STOCK_HIST] (...
...[Symbol] ASC, [dseqkey] ASC, [Volume] ASC )
...[dseqkey] ASC, [Symbol] ASC, [Volume] ASC )
...[Symbol] ASC, [dseqkey] DESC, [Volume] ASC )
...[dseqkey] DESC, [Symbol] ASC, [Volume] ASC )
SQL Server does not support LAG or LEAD clauses available in Oracle and PostgreSQL, neither does it support session variables like MySQL.
Calculating aggregates against moving windows is a pain in SQL Server.
So God knows I hate to say this, however, in this case a CURSOR based solution may be more efficient.
try putting a clustered index on the view. that will make the view persisted to disk like a normal table and your tables won't have to be accessed every time.
that should speed things up a bit.
for better answer please post the link to your original question to see if a better solution can be found.
OK, so I'll start from the end. I would like to achieve this model.
With this in place, you can run the query on the history table directly, no need for the view and join to the dbo.DATE_MASTER.
select
x.symbol
, x.dseqkey
, avg(y.volume) as moving_average
from dbo.QP_HISTORY as x
join dbo.QP_HISTORY as y on (x.dSeqKey between y.dSeqKey and (y.dSeqKey + 29))
and (y.symbol = x.symbol)
where x.dseqkey >= 15000
group by x.symbol, x.dseqkey
order by x.dseqkey desc
OPTION (ORDER GROUP) ;
The QP_HISTORY is narrower than the STOCK_HISTORY view, so the query should be faster. The "redundant column removal" from joins is scheduled for the next generation of SQL Server (Denali), so for the time being narrower usually means faster -- at least for large tables. Also, the join on .. and the where clause nicely match the the PK(Symbol, dSeqKey).
Now, how to achieve this:
a) Modify the [date] column in dbo.DATE_MASTER to be if the type date instead of datetime. Rename it FullDate to avoid confusion. Not absolutely necessary, but to preserve my sanity.
b) Add PK to the dbo.DATE_MASTER
alter table dbo.DATE_MASTER add constraint primary key pk_datemstr (dSeqKey);
c) In the table QP_HISTORY add column dSeqKey and populate it for matching QPDate dates.
d) Drop the QPDate column from the table.
e) Add PK and FK to the QP_HISTORY
alter table dbo.QP_HISTORY
add constraint pk_qphist primary key (Symbol, dSeqKey)
, add constraint fk1_qphist foreign key (dSeqKey)
references dbo.DATE_MASTER(dSeqKey) ;
f) Drop all those indexes mentioned at the end ouf your question, at least for the time being.
g) I do not see the size of the Symbol field. Define it as narrow as possible.
h) Needles to say, implement and test this on a development system first.