We have a large database with monthly partitioned tables. I need to aggregate a selection of these tables every month but I don't want to update the union all every month to add the new monthly table.
CREATE VIEW dynamic_view AS
SELECT timestamp,
traffic
FROM traffic_table_m_2017_01
UNION ALL
SELECT timestamp,
traffic
FROM traffic_table_m_2017_02
Is this where I would use a stored procedure? I am not really familiar with them.
I think it would also work as:
SELECT timestamp,
traffic
FROM REPLACE(REPLACE('traffic_table_m_yyyy_mm',
yyyy, FORMAT(GETDATE(),'yyyy', 'en-us')),
mm, FORMAT(GETDATE(),'mm', 'en-us'));
This might work for the current month but I would need to save the data from the past months which would also be an issue.
you should append each table as it arrives to 1 larger table then run your queries against that. there are many ways to do this but probable the fastest and most elegant is to use.
ALTER TABLE APPEND
Instructions here https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE_APPEND.html
Related
Consider the following scenario:
I have a table with 1 million product ids products :
create table products (
pid number,
p_description varchar2(200)
)
also there is a relatively slow function
function gerProductMetrics(pid,date) return number
which returns some metric for the given product at given date.
there is also an annual report executed every year that is based on the following query:
select pid,p_description,getProductMetrics(pid,'2019-12-31') from
products
that query takes about 20-40 minutes to execute for a given year.
would it be correct approach to create Materialized View (MV) for this scenario using the following
CREATE TABLE mydates
(
mydate date
);
INSERT INTO mydates (mydate)
VALUES (DATE '2019-12-31');
INSERT INTO mydates (mydate)
VALUES (DATE '2018-12-31');
INSERT INTO mydates (mydate)
VALUES (DATE '2017-12-31');
CREATE MATERIALIZED VIEW metrics_summary
BUILD IMMEDIATE
REFRESH FORCE ON DEMAND
AS
SELECT pid,
getProductMetrics(pid,mydate AS annual_metric,
mydate
FROM products,mydates
or it would take forever?
Also, how and how often would I update this MV?
Metrics data is required for the end of each year.
But any year's data could be requested at any time.
Note, that I have no control over the slow function - it's just a given.
thanks.
First, you do not have a "group by" query, so you can remove that.
An MV would be most useful if you needed to recompute all of the data for all years. As this appears to be a summary, with no need to reprocess old data, updated only when certain threshold dates like end of year are passed, I would recommend putting the results in a normal table and only adding the updates as often as your threshold dates occur (annually?) using a stored procedure. Otherwise your MV will take longer to run and require more system resources with every execution that adds a new date.
Do not create a materialized view. This is not just a performance issue. It is also an archiving issue: You don't want to run the risk that historical results could change.
My advice is to create a single table with a "year" column. Run the query once per year and insert the rows into the new table. This is an archive of the results.
Note: If you want to recalculate previous years because the results may have changed (say the data is updated somehow), then you should store those results in a separate table and decide which version is the "right" version. You may find that you want an archive table with both the "as-of" date and the "run-date" to see how results might be changing.
I'm developing an application for managing delivery in a company using netbeans and php myadmin.
I have to save in the database daily hundreds of deliveries with specific data for each but all with the date of that day, for query later like
select * from table_1 where 'date'='02/10/2016' for example.
I can create a field in the table with type "date" but this date will be redundancy hundreds of times in the table just to specify one day, and the next day also and so on...
What's the best way to stop the redundancy ??
You could use datetime as its type which will reduce your redundancy.
And when you wish to retrive all entries of a particular date, you could try using select * from table_1 where 'date' LIKE '2016-10-02%' in your query
I know there are a lot of solutions to this but I am looking for a simple query to get all the dates between two dates.
I cannot declare variables.
As per the comment above, it's just guesswork without your table structures and further detail. Also, are you using a 3NF database or star schema structures, etc. Is this a transaction system or a data warehouse?
As a general answer, I would recommend creating a Calendar table, that way you can create multiple columns for Working Day, Weekend Day, Business Day, etc. and add a date key value, starting at 1 and incrementing each day.
Your query then is a very simple sub-select or join to the table to do something like
SELECT date FROM Calendar WHERE date BETWEEN <x> AND <y>
How to create a Calender table for 100 years in Sql
There are other options like creating the calendar table using iterations (eg, as a CTE table) and linking to that.
SQL - Create a temp table or CTE of first day of the month and month names
My question is about table partitioning in SQL Server 2008.
I have a program that loads data into a table every 10 mins or so. Approx 40 million rows per day.
The data is bcp'ed into the table and needs to be able to be loaded very quickly.
I would like to partition this table based on the date the data is inserted into the table. Each partition would contain the data loaded in one particular day.
The table should hold the last 50 days of data, so every night I need to drop any partitions older than 50 days.
I would like to have a process that aggregates data loaded into the current partition every hour into some aggregation tables. The summary will only ever run on the latest partition (since all other partitions will already be summarised) so it is important it is partitioned on insert_date.
Generally when querying the data, the insert date is specified (or multiple insert dates). The detailed data is queried by drilling down from the summarised data and as this is summarised based on insert date, the insert date is always specified when querying the detailed data in the partitioned table.
Can I create a default column in the table "Insert_date" that gets a value of Getdate() and then partition on this somehow?
OR
I can create a column in the table "insert_date" and put a hard coded value of today's date.
What would the partition function look like?
Would seperate tables and a partitioned view be better suited?
I have tried both, and even though I think partition tables are cooler. But after trying to teach how to maintain the code afterwards it just wasten't justified. In that scenario we used a hard coded field date field that was in the insert statement.
Now I use different tables ( 31 days / 31 tables ) + aggrigation table and there is an ugly union all query that joins togeather the monthly data.
Advantage. Super timple sql, and simple c# code for bcp and nobody has complained about complexity.
But if you have the infrastructure and a gaggle of .net / sql gurus I would choose the partitioning strategy.
I'm partitioning my data on BigQuery by day, and I want a quick way to query "yesterday's data".
Is this possible? How can I write queries that automatically point to the latest data, without having to re-write the tables I want to query?
You can create a view with TABLE_QUERY to find yesterday's (or an arbitrary relative date) data.
For example, GitHubArchive stores daily tables, and I created a view that points to yesterday's table:
SELECT *
FROM TABLE_QUERY(githubarchive:day, 'table_id CONTAINS "events_"
AND table_id CONTAINS STRFTIME_UTC_USEC(DATE_ADD(CURRENT_TIMESTAMP(), -1, "day"), "%Y%m%d")')
You can test and query this view:
SELECT COUNT(*)
FROM [fh-bigquery:public_dump.github_yesterday]