I'll try to explain my issue, since I'm not using SQL directly.
I'm using INFORMATICA tool by using mappings that process SQL data, so I'll try to explain the logic my map does into SQL.
My map basically select data from an SCD (slowly changing dimension) where start_date = sysdate and ind = 1 (this table has approximately 600mil records) using this query:
SELECT table.ACCOUNT_NUMBER, table.SUB_ACCOUNT_NUMBER, table.SUB_ACCOUNT_KEY
FROM table
WHERE table.CURR_IND=1
AND table.START_DATE=trunc(sysdate)
This table is indexes as following:
SUB_ACCOUNT_KEY - UNIQUE
Then add another column and update a different table that have approximately 8mil records . The query of that is probably update with join by
SET table2.ind =The_New_Column,table_2.sub_account_key = table1.sub_account_key
WHERE Table.account_number = Table_2.account_number
AND table.sub_account_number = table_2.sub_account_number
This table_2 is indexes as following:
(ACCOUNT_NUMBER, SUB_ACCOUNT_NUMBER) - UNIQUE
Both select and update take some time to process depending on the amount of data I get each day(We have 1 day each three month that the amount of data is about X30 of a normal day which take for ever.. about 2 hours)
So, my question is: How can I speed this process up having the following limitation :
I can't (unless given a very good reason) adding an index on the tables since it is being used in many other processes , so it can harm their performances
suggestion 1: create a function based index:
CREATE INDEX index_name
ON table (TRUNC(START_DATE));
as you mentioned, this might not be possible because you can't use indexes.
suggestion 2: use BETWEEN:
SELECT table.ACCOUNT_NUMBER, table.SUB_ACCOUNT_NUMBER, table.SUB_ACCOUNT_KEY
FROM table
WHERE table.CURR_IND=1
AND table.START_DATE BETWEEN TO_DATE('2016.02.14 12:00:00 AM', 'YYYY.MM.DD HH:MI:SS AM')
AND TO_DATE('2016.02.15 11:59:59 PM', 'YYYY.MM.DD HH:MI:SS PM');
(see also http://oraclecoder.com/tutorials/quick-tip-do-not-use-trunc-to-filter-on-a-date-and-time-field--2120)
This is essentially the same question you asked under "get current date fomatted". You are either going to have to modify your sql, or use a function based index. Yes, indexes can cause some additional overhead on DML, but can give dramatic improvement on SELECTs. Like all design decisions, you have weigh the benefit to cost and decide what is more important.
Related
I would like to run this query about once every 5 minutes to be able to run an incremental query to MERGE to another table.
SELECT MAX(timestamp) FROM dataset.myTable
-- timestamp is of type TIMESTAMP
My concern is that will do a full scan of myTable on a regular basis.
What are the best practices for optimizing this query? Will partitioning help even if the SELECT MAX doesn't extract the date from the query? Or is it just the columnar nature of BigQuery will make this optimal?
Thank you.
What you can do is, instead of querying your table directly, query the INFORMATION_SCHEMA.PARTITIONS table within your dataset. Doc here.
You can for instance go for:
SELECT LAST_MODIFIED_TIME
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE TABLE_NAME = "myTable"
The PARTITIONS table hold metadata at the rate of one record for each of your partitions. It is therefore greatly smaller than your table and it's an easy way to cut your query costs. (it is also much faster to query).
i want to understand if i need to query a table and query is something like below
Select * from table_name where date_column > sysdate-2 and date_column < sysdate;
note: my intention is to select data of each day to be specific
then how should i design my table for better results?? i think partitioning based on date will give too many partitions and lead to performance bottle neck , not sure whether bucketing works here.... plz suggest and some explanation
If the data on a daily basis is just not enough to create a partition, you must think on creating partition based on yyyyMM (Year and Month). In that case, your query changes to
Select * from table_name where
my_partition_col in (date_format(sysdate,'yyyyMM'), date_format(sysdate-2,'yyyyMM'))
AND date_column > sysdate-2 and date_column < sysdate;
This optimizes the storage and performance requirement.
You should partition by date.
You are correct that this will create a lot of partitions. Within Hive, each date will be a separate file, and yes, Hive will need to maintain all of that, but that's exactly what Hive is best at.
note: my intention is to select data of each day to be specific
Since this is your intention, you'll get the best performance with daily partitions.
Other sorts of queries, running across multiple dates, may result in the performance bottleneck you're expressing concern about. But if that occurs, you could consider creating a different table to address that use case.
For your primary, current use case, daily partitions are the solution.
We have a 6B row table that is giving us challenges when retrieving data.
Our query returns values instantly when doing a...
SELECT * WHERE Event_Code = 102225120
That type of instant result is exactly what we need. We now want to filter to receive values for just a particular year - but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row. There are other approaches like using TRUNC, or BETWEEN, or specifying the datetime in YYYY-MM-DD format for doing comparisons.
Of note, we do not have the option to add indexes to the database as it is a vendor's database.
What is the way to add a date filtering query and enable Oracle to begin streaming the results back in the fastest way possible?
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row
That question is quite different from yours. Firstly, your statement above applies to any data type, not only dates. Also the word many is relative to the number of records in the table. If the optimizer decides that the query will return many of all records in your table, then it may decide that a full scan of the table is faster than using the index. In your situation, this translates to how many records are in 2017 out of all records in the table? This calculation gives you the cardinality of your query which then gives you an idea if an index will be faster or not.
Now, if you decide that an index will be faster, based on the above, the next step is to know how to build your index. In order for the optimizer to use the index, it must match the condition that you're using. You are not comparing dates in your query, you are only comparing the year part. So an index on the date column will not be used by this query. You need to create an index on the year part, so use the same condition to create the index.
we do not have the option to add indexes to the database as it is a vendor's database.
If you cannot modify the database, there is no way to optimize your query. You need to talk to the vendor and get access to modify the database or ask them to add the index for you.
A function can also cause slowness for the number of records involved. Not sure if Function Based Index can help you for this, but you can try.
Had you tried to add a year column in the table? If not, try to add a year column and update it using code below.
UPDATE table
SET year = EXTRACT(YEAR FROM PERFORMED_DATE_TIME);
This will take time though.
But after this, you can run the query below.
SELECT *
FROM table
WHERE Event_Code = 102225120 AND year = 2017;
Also, try considering Table Partitioned for this big data. For starters, see link below,
link: https://oracle-base.com/articles/8i/partitioned-tables-and-indexes
Your question is a bit ambiguous IMHO:
but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Do you mean that
SELECT * WHERE Event_Code = 102225120
is fast, but
SELECT * WHERE Event_Code = 102225120 AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
is slow???
For starters I'll agree with Mitch Wheat that you should try to use PERFORMED_DATE_TIME between Jan 1, 2017 and Dec 31, 2017 instead of Year(field) = 2017. Even if you'd have an index on the field, the latter would hardly be able to make use of it while the first method would benefit enormously.
I'm also hoping you want to be more specific than just 'give me all of 2017' because returning over 1B rows is NEVER going to be fast.
Next, if you can't make changes to the database, would you be able to maintain a 'shadow' in another database? This would require that you create a table with all date-values AND the PK of the original table in another database and query those to find the relevant PK values and then JOIN those back to your original table to find whatever you need. The biggest problem with this would be that you need to keep the shadow in sync with the original table. If you know the original table only changes overnight, you could merge the changes in the morning and query all day. If the application is 'real-time(ish)' then this probably won't work without some clever thinking... And yes, your initial load of 6B values will be rather heavy =)
May this could be usefull (because you avoid functions (a cause for context switching) and if you have an index on your date field, it could be used) :
with
dt as
(
select
to_date('01/01/2017', 'DD/MM/YYYY') as d1,
to_date('31/01/2017', 'DD/MM/YYYY') as d2
from dual
),
dates as
(
select
dt.d1 + rownum -1 as d
from dt
connect by dt.d1 + rownum -1 <= dt.d2
)
select *
from your_table, dates
where dates.d = PERFORMED_DATE_TIME
Move the date literal to RHS:
AND PERFORMED_DATE_TIME >= date '2017-01-01'
AND PERFORMED_DATE_TIME < date '2018-01-01'
But without an (undisclosed) appropriate index on PERFORMED_DATE_TIME, the query is unlikely to be any faster.
One option to create indexes in third party databases is to script in the index and then before any vendor upgrade run a script to remove any indexes you've added. If the index is important, ask the vendor to add it to their database design.
I have a query which retrieves shipment information.
I would like to be able to use an index on a date column. The where clause for this column looks like shipment.end_alloc_date >= to_date( last week ).
When I add shipment.end_alloc_date <= to_date( next week ) the index is used. However, I don't want to use this second line.
Does anyone know how to force Oracle to use this date index with only using the first restriction?
The table contains about 180.000 rows and both sql retrieve 50 rows. However, when I run explain plan the index on end_alloc_date is only used in the second sql. How come and is there something I can do to force Oracle to use the index?
1) select <some data> from shipment where shipment.end_alloc_date >= to_date( last week )
2) select <some data> from shipment where shipment.end_alloc_date >= to_date( last week ) and shipment.end_alloc_date <= to_date( next week )
Generally speaking, you should trust the optimizer to know its business, which is optimizing the performance of queries. In particular, you should expect that the optimizer knows when it will be beneficial to use an index and when it will not be beneficial to do so. If using an index won't benefit the performance of the query, then the optimizer won't use it.
So, some questions for you:
Is the query running too slow? (If not, why are you worried?)
What is the schema of the tables?
What are the indexes on the tables?
What are the cardinalities of the tables in question?
What exactly does the complete query look like?
What does the query plan look like?
What proportion of the rows in the table satisfy shipment.end_alloc_date >= to_date(last week)?
What proportion of the rows in the table satisfy shipment.end_alloc_date <= to_date(next week)?
Did you notice that these conditions are not inverses of each other? I assume so, but that means that the best query plan for one may be different from the best query plan for the other.
The optimizer will be taking into account the answers to questions 2-8 in that list, and using its judgement to choose the best way of answering the query. You must know the answers to these questions if you think the optimizer is failing. But without that information, no-one here can provide you much help beyond vague hand-waving "look for optimizer hints in the manual".
It is a misconception that using the index is the fastest way to run a query.
I expect the Optimizer has decided its more efficient to not use the index when only 'shipment.end_alloc_date >= to_date( last week )' is specified.
For example, if that query yields many rows then most likely the optimizer has chosen this route because its more efficient not to use the index. It could choose a full table scan instead because its quicker to read contigious blocks of data as opposed to reading rowids from the index.
I have a Postgres DB running 7.4 (Yeah we're in the midst of upgrading)
I have four separate queries to get the Daily, Monthly, Yearly and Lifetime record counts
SELECT COUNT(field)
FROM database
WHERE date_field
BETWEEN DATE_TRUNC('DAY' LOCALTIMESTAMP)
AND DATE_TRUNC('DAY' LOCALTIMESTAMP) + INTERVAL '1 DAY'
For Month just replace the word DAY with MONTH in the query and so on for each time duration.
Looking for ideas on how to get all the desired results with one query and any optimizations one would recommend.
Thanks in advance!
NOTE: date_field is timestamp without time zone
UPDATE:
Sorry I do filter out records with additional query constraints, just wanted to give the gist of the date_field comparisons. Sorry for any confusion
I have some idea of using prepared statements and simple statistics (record_count_t) table for that:
-- DROP TABLE IF EXISTS record_count_t;
-- DEALLOCATE record_count;
-- DROP FUNCTION updateRecordCounts();
CREATE TABLE record_count_t (type char, count bigint);
INSERT INTO record_count_t (type) VALUES ('d'), ('m'), ('y'), ('l');
PREPARE record_count (text) AS
UPDATE record_count_t SET count =
(SELECT COUNT(field)
FROM database
WHERE
CASE WHEN $1 <> 'l' THEN
DATE_TRUNC($1, date_field) = DATE_TRUNC($1, LOCALTIMESTAMP)
ELSE TRUE END)
WHERE type = $1;
CREATE FUNCTION updateRecordCounts() RETURNS void AS
$$
EXECUTE record_count('d');
EXECUTE record_count('m');
EXECUTE record_count('y');
EXECUTE record_count('l');
$$
LANGUAGE SQL;
SELECT updateRecordCounts();
SELECT type,count FROM record_count_t;
Use updateRecordCounts() function any time you need update statistics.
I'd guess that it is not possible to optimize this any further than it already is.
If you're collecting daily/monthly/yearly stats, as I'm assuming you are doing, one option (after upgrading, of course) is a with statement and the relevant joins, e.g.:
with daily_stats as (
(what you posted)
),
monthly_stats as (
(what you posted monthly)
),
etc.
select daily_stats.stats,
monthly_stats.stats,
etc.
stats
left join yearly_stats on ...
left join monthly_stats on ...
left join daily_stats on ...
However, that will actually perform less well than running each query separately in a production environment, because you'll introduce left joins in the DB which could be done just as well in the middleware (i.e. show daily, then monthly, then yearly and finally lifetime stats). (If not better, since you'll be avoiding full table scans.)
By keeping things as if, you'll save the precious DB resources to deal with reads and writes on actual data. The tradeoff (less network traffic between your database and your app) is almost certainly not worth it.
Yikes! Don't do this!!! Not because you can't do what you're asking, but because you probably shouldn't be doing what you're asking in this manner. I'm guessing the reason you've got date_field in your example is because you've got a date_field attached to a user or some other meta-data.
Think about it: you are asking PostgreSQL to scan 100% of the records relevant to a given user. Unless this is a one-time operation, you almost assuredly do not want to do this. If this is a one-time operation and you are planning on caching this value as a meta-data, then who cares about the optimizations? Space is cheap and will save you heaps of execution time down the road.
You should add 4x per-user (or whatever it is) meta-data fields that help sum up the data. You have two options, I'll let you figure out how to use this so that you keep historical counts, but here's the easy version:
CREATE TABLE user_counts_only_keep_current (
user_id , -- Your user_id
lifetime INT DEFAULT 0,
yearly INT DEFAULT 0,
monthly INT DEFAULT 0,
daily INT DEFAULT 0,
last_update_utc TIMESTAMP WITH TIME ZONE,
FOREIGN KEY(user_id) REFERENCES "user"(id)
);
CREATE UNIQUE INDEX this_tbl_user_id_udx ON user_counts_only_keep_current(user_id);
Setup some stored procedures that zero out individual columns if last_update_utc doesn't match the current day according to NOW(). You can get creative from here, but incrementing records like this is going to be the way to go.
Handling of time series data in any relational database requires special handling and maintenance. Look in to PostgreSQL's table inheritance if you want good temporal data management.... but really, don't do whatever it is you're about to do to your application because it's almost certainly going to result in bad things(tm).