Currently I handle big data in redshift.
I would like to ask you a best table schema.
Data set on the table is like this.
tmp_tbl_yyyy_MM
user_id int,
position_x int,
position_y int,
date date,
type int
sortkey (date)
and 1 bilion data is inserted per month.
I usually scan data of the past 3 months, and
according to this article,
time separated table seems to be good.
Therefore I separated the table by month like "_yyyy_MM"
Here is an example query I often run.
select user_id from(
select * from tmp_tbl_yyyy_MM
unionall
select * from tmp_tbl_yyyy_MM
)
where
(position_x between ? and ?
and position_y between ? and ?)
or
(position_x between ? and ?
and position_y between ? and ?)
or ...
and date between ? and ?
and type = ?;
position_x,position_y conditions are repeated over 1000 times.
This query's plan is sequential scan, so its very slow.
Teach me the best way to get same results.
I guess the points are table, query and sortkey.
Unionall is bad?
Shouldn't I separate the table by month?
Where clause should be in the subqueries?
Should I set interleaved sortkey to all contdition like position_x,position_y,date,type
Related
I would like to run this query about once every 5 minutes to be able to run an incremental query to MERGE to another table.
SELECT MAX(timestamp) FROM dataset.myTable
-- timestamp is of type TIMESTAMP
My concern is that will do a full scan of myTable on a regular basis.
What are the best practices for optimizing this query? Will partitioning help even if the SELECT MAX doesn't extract the date from the query? Or is it just the columnar nature of BigQuery will make this optimal?
Thank you.
What you can do is, instead of querying your table directly, query the INFORMATION_SCHEMA.PARTITIONS table within your dataset. Doc here.
You can for instance go for:
SELECT LAST_MODIFIED_TIME
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE TABLE_NAME = "myTable"
The PARTITIONS table hold metadata at the rate of one record for each of your partitions. It is therefore greatly smaller than your table and it's an easy way to cut your query costs. (it is also much faster to query).
we have a database that is growing every day. roughly 40M records as of today.
This table/database is located in Azure.
The table has a primary key 'ClassifierID', and the query is running on this primary key.
The primary key is in the format of ID + timestamp (mmddyyy HHMMSS), for example 'CNTR00220200 04052021 073000'
Here is the query to get all the IDs by date
**Select distinct ScanID
From ClassifierResults
Where ClassifierID LIKE 'CNTR%04052020%**
Very simple and straightforward, but it sometimes takes over a min to complete. Do you have any suggestion how we can optimize the query? Thanks much.
The best thing here would be to fix your design so that a) you are not storing the ID and timestamp in the same text field, and b) you are storing the timestamp in a proper date/timestamp column. Using your single point of data, I would suggest the following table design:
ID | timestamp
CNTR00220200 | timestamp '2021-04-05 07:30:00'
Then, create an index on (ID, timestamp), and use this query:
SELECT *
FROM yourTable
WHERE ID LIKE 'CNTR%' AND
timestamp >= '2021-04-05' AND timestamp < '2021-04-06';
The above query searches for records having an ID starting with CNTR and falling exactly on the date 2021-04-05. Your SQL database should be able to use the composite index I suggested above on this query.
I have two tables (pageviews and base_events) that are both partitioned on a date field derived_tstamp. Every night I'm doing an incremental update to the base_events table, querying the new data from pageviews like so:
select
*
from
`project.sp.pageviews`
where derived_tstamp > (select max(derived_tstamp) from `project.sp_modeled.base_events`)
Looking at the query costs, this query scans the full table instead of only the new data. Usually this should only get yesterdays data.
Do you have any idea, what's wrong with the query?
Subqueries will trigger a full table scan. The solution is to use scripting. I have solved my problem with the following query:
declare event_date_checkpoint DATE default (
select max(date(page_view_start)) from `project.sp_modeled.base_events
);
select
*
from
`project.sp.pageviews`
where derived_tstamp > event_date_checkpoint
More on scripting:
https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting#declare
Say we have a table partitioned as:-
CREATE EXTERNAL TABLE MyTable (
col1 string,
col2 string,
col3 string
)
PARTITIONED BY(year INT, month INT, day INT, hour INT, combination_id BIGINT);
Now obviously year is going to store year value (e.g. 2016), the month will store month va.ue (e.g. 7) the day will store day (e.g. 18) and hour will store hour value in 24 hour format (e.g. 13). And combination_id is going to be combination of padded (if single digit value pad it with 0 on left) values for all these. So in this case for example the combination id is 2016071813.
So we fire query (lets call it Query A):-
select * from mytable where combination_id = 2016071813
Now Hive doesn't know that combination_id is actually combination of year,month,day and hour. So will this query not take proper advantage of partitioning?
In other words, if I have another query, call it Query B, will this be more optimal than query A or there is no difference?:-
select * from mytable where year=2016 and month=7 and day=18 and hour=13
If Hive partitioning scheme is really hierarchical in nature then Query B should be better from performance point of view is what I am thinking. Actually I want to decide whether to get rid of combination_id altogether from partitioning scheme if it is not contributing to better performance at all.
The only real advantage for using combination id is to be able to use BETWEEN operator in select:-
select * from mytable where combination_id between 2016071813 and 2016071823
But if this is not going to take advantage of partitioning scheme, it is going to hamper performance.
Yes. Hive partitioning is hierarchical.
You can simply check this by printing the partitions of the table using below query.
show partitions MyTable;
Output:
year=2016/month=5/day=5/hour=5/combination_id=2016050505
year=2016/month=5/day=5/hour=6/combination_id=2016050506
year=2016/month=5/day=5/hour=7/combination_id=2016050507
In your scenario, you don't need to specify combination_id as partition column if you are not using for querying.
You can partition either by
Year, month, day, hour columns
or
combination_id only
Partitioning by Multiple columns helps in performance in grouping operations.
Say if you want to find maximum of a col1 for 'March' month of the years (2016 & 2015).
It can easily fetch the records by going to the specific 'Year' partition(year=2016/2015) and month partition(month=3)
i have a table in database that is having 7k plus records. I have a query that searches for a particular id in that table.(id is auto-incremented)
query is like this->
select id,date_time from table order by date_time DESC;
this query will do all search on that 7k + data.......isn't there anyways with which i can optimize this query so that search is made only on 500 or 1000 records....as these records will increase day by day and my query will become heavier and heavier.Any suggestions?????
I don't know if im missing something here but what's wrong with:
select id,date_time from table where id=?id order by date_time DESC;
and ?id is the number of the id you are searching for...
And of course id should be a primary index.
If id is unique (possibly your primary key), then you don't need to search by date_time and you're guaranteed to only get back at most one row.
SELECT id, date_time FROM table WHERE id=<value>;
If id is not unique, then you still use the same query but need to look at indexes, other contraints, and/or caching outside the database, if the query becomes too slow.