My question is about table partitioning in SQL Server 2008.
I have a program that loads data into a table every 10 mins or so. Approx 40 million rows per day.
The data is bcp'ed into the table and needs to be able to be loaded very quickly.
I would like to partition this table based on the date the data is inserted into the table. Each partition would contain the data loaded in one particular day.
The table should hold the last 50 days of data, so every night I need to drop any partitions older than 50 days.
I would like to have a process that aggregates data loaded into the current partition every hour into some aggregation tables. The summary will only ever run on the latest partition (since all other partitions will already be summarised) so it is important it is partitioned on insert_date.
Generally when querying the data, the insert date is specified (or multiple insert dates). The detailed data is queried by drilling down from the summarised data and as this is summarised based on insert date, the insert date is always specified when querying the detailed data in the partitioned table.
Can I create a default column in the table "Insert_date" that gets a value of Getdate() and then partition on this somehow?
OR
I can create a column in the table "insert_date" and put a hard coded value of today's date.
What would the partition function look like?
Would seperate tables and a partitioned view be better suited?
I have tried both, and even though I think partition tables are cooler. But after trying to teach how to maintain the code afterwards it just wasten't justified. In that scenario we used a hard coded field date field that was in the insert statement.
Now I use different tables ( 31 days / 31 tables ) + aggrigation table and there is an ugly union all query that joins togeather the monthly data.
Advantage. Super timple sql, and simple c# code for bcp and nobody has complained about complexity.
But if you have the infrastructure and a gaggle of .net / sql gurus I would choose the partitioning strategy.
Related
I have schedule query which runs hourly I want to partition the table hourly so in the destinaltion I have provided this mytable_{run_time|"%Y%m%d%H"}, but this is creating a new table for every run in my BigQuery datasets , when I change the destination to mytable_{run_time|"%Y%m%d"}, it's partition the data correctly based on date
How to enable hourly partition in big query ?
What you are doing is aligned to table sharding, which you can do but it is not as performant and involves more management. In theory it acts similarly to a partition but is not the same. What you are likely seeing when you use the format mytable_{run_time|"%Y%m%d"} is that you're inserting multiple hours into the same day, and depending on what your table definition is may be partitioned within a single day.
You will want to define the partition in the creation of the table see below:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables#create_a_time-unit_column-partitioned_table
Consider the following scenario:
I have a table with 1 million product ids products :
create table products (
pid number,
p_description varchar2(200)
)
also there is a relatively slow function
function gerProductMetrics(pid,date) return number
which returns some metric for the given product at given date.
there is also an annual report executed every year that is based on the following query:
select pid,p_description,getProductMetrics(pid,'2019-12-31') from
products
that query takes about 20-40 minutes to execute for a given year.
would it be correct approach to create Materialized View (MV) for this scenario using the following
CREATE TABLE mydates
(
mydate date
);
INSERT INTO mydates (mydate)
VALUES (DATE '2019-12-31');
INSERT INTO mydates (mydate)
VALUES (DATE '2018-12-31');
INSERT INTO mydates (mydate)
VALUES (DATE '2017-12-31');
CREATE MATERIALIZED VIEW metrics_summary
BUILD IMMEDIATE
REFRESH FORCE ON DEMAND
AS
SELECT pid,
getProductMetrics(pid,mydate AS annual_metric,
mydate
FROM products,mydates
or it would take forever?
Also, how and how often would I update this MV?
Metrics data is required for the end of each year.
But any year's data could be requested at any time.
Note, that I have no control over the slow function - it's just a given.
thanks.
First, you do not have a "group by" query, so you can remove that.
An MV would be most useful if you needed to recompute all of the data for all years. As this appears to be a summary, with no need to reprocess old data, updated only when certain threshold dates like end of year are passed, I would recommend putting the results in a normal table and only adding the updates as often as your threshold dates occur (annually?) using a stored procedure. Otherwise your MV will take longer to run and require more system resources with every execution that adds a new date.
Do not create a materialized view. This is not just a performance issue. It is also an archiving issue: You don't want to run the risk that historical results could change.
My advice is to create a single table with a "year" column. Run the query once per year and insert the rows into the new table. This is an archive of the results.
Note: If you want to recalculate previous years because the results may have changed (say the data is updated somehow), then you should store those results in a separate table and decide which version is the "right" version. You may find that you want an archive table with both the "as-of" date and the "run-date" to see how results might be changing.
How can I see the number of new rows added to each of my database's tables in the past day?
Example result:
table_name new_rows
---------- -----------
users 32
questions 150
answers 98
...
I'm not seeing any table that stores this information in PostGRES statistics collector: http://www.postgresql.org/docs/9.1/static/monitoring-stats.html
The only solution I can think of, is create a database table that stores the row_count of each table at midnight each day.
Edit: I need this to work with any table, regardless of whether it has a "created_at" or other timestamp column. Many of the tables I would like to see the growth rate in, do not have timestamps columns & can't have one added.
The easiest way is to add a column in your table that keep a track of the insert/updated date.
Then to retrieve the rows, you can do a simple select for the last day.
From my knowledge, and I've also done a couple research to make sure, there is no intern functionality that allow you to do that without creating a field.
I have a SQL server database table that contain a few thousand records. These records are populated by PowerShell scripts on a weekly basis. These scripts basically overwrite last weeks data so the table only has information pertaining to the previous week. I would like to be able to take a copy of that tables data each week and add a date column with that day's date beside each record. I need this so can can do trend analysis in the future.
Unfortunately, I don't have access to the PowerShell scripts to edit them. Is there any way I can accomplish this using MS SQL server or some other way?
You can do the following. Create a table that will contain the clone + dates. Insert the results from your original table along with the date into your clone table. From your description you don't need a where clause because the results of the original table are wiped out only holding new data. After the initial table creation there is no need to do it again. You'll just simply do the insert piece. Obviously the below is very basic and is just to provide you the framework.
CREATE TABLE yourTableClone
(
col1 int
col2 varchar(5)...
col5 date
)
insert into yourTableClone
select *, getdate()
from yourOriginalTable
i have an requirment like this i need to delete all the customer who have not done transaaction for the past 800 days
i have an table customer where customerID is the primary key
*creditcard table have columns customerID,CreditcardID, where creditcard is an primary key*
Transcation table having column transactiondatetime, CreditcardID,CreditcardTransactionID here is the primarary key in this table.
All the transcationtable data is in the view called CreditcardTransaction so i am using the view to get the information
i have written an query to get the creditcard who have done transaction for the past 800 days and get their CreditcardID and store it in table
as the volume of data in CreditcardTransaction view is around 60 millon data the query what i have writeen fails and logs an message log file is full and throws message system out of memory exception.
INSERT INTO Tempcard
SELECT CreditcardID,transactiondatetime
FROM CreditcardTransaction WHERE
DATEDIFF(DAY ,CreditcardTransaction.transactiondatetime ,getdate())>600
As i need to get the CreditcardID when was their last Transactiondatetime
Need to show their data in an Excel sheet so, i am dumping in data in an Table then insert them into Excel.
what is teh best solution i show go ahead here
i am using an SSIS package(vs 2008 R2) where i call an SP dump data into Table then do few business logic finally insert data in to excel sheet.
Thanks
prince
One thought: Using a function in a Where clause can slow things down - considerably. Consider adding a column named IdleTransactionDays. This will allow you to use the DateDiff function in a Select clause. Later, you can query the Tempcard table to return the records with IdleTransactionDays greater than 600 - similar to this:
declare #DMinus600 datetime =
INSERT INTO Tempcard
(CreditcardID,transactiondatetime,IdleTransactionDays)
SELECT CreditcardID,transactiondatetime, DATEDIFF(DAY ,CreditcardTransaction.transactiondatetime ,getdate())
FROM CreditcardTransaction
Select * From Tempcard
Where IdleTransactionDays>600
Hope this helps,
Andy
Currently you're inserting those records row by row. You could create a SSIS package that reads your data with an OLEDB Source component, performs the necessary operations and bulk inserts them (a minimally logged operation) into your destination table.
You could also directly output your rows into an Excel file. Writing rows to an intermediate table decreases performance.
If your source query still times out, investigate if any indexes exist and that they are not too fragmented.
You could also partition your source data by year (based on transactiondatetime). This way the data will be loaded in bursts.