Redshift Spectrum partitioning a table using two date fields

Redshift Spectrum partitioning a table using two date fields - amazon-s3

I was searching for best practices to create partitions by date, using amazon-redshift-spectrum, but the examples shows the problem being solved by partitioning the table by one date only. What to do if I have more than one date field?
Eg: Mobile events with user_install_date and event_date
How performative is to partition your s3 like:
installdate=2015-01-01/eventdate=2017-01-01
installdate=2015-01-01/eventdate=2017-01-02
installdate=2015-01-01/eventdate=2017-01-03
Will It kill my select performance ? What is the best strategy in this case?

If you data was partitioned in the above manner, then a query that merely had eventdate in the WHERE clause (without installdate) would be less efficient.
It would still need to look through every installdate directory, but it could skip over eventdate directories that do not match the predicate.
Put the less-used parameter second.

Related

What is the difference between partitioning and indexing in DB ? (Performance-Wise)

I am new to SQL and have been trying to optimize the query performances of my microservices to my DB (Oracle SQL).
Based on my research, I checked that you can do indexing and partitioning to improve query performance, I seem to have known each of the concept and how to do it, but I'm not sure about the difference between both?
For example, suppose I have table Orders, with 100 million entries with
Columns:
OrderId (PK)
CustomerId (6 digit unique number)
Order (what the order is. Ex: "Burger")
CreatedTime (Timestamp)
In essence, both methods "subdivides" the orders table so that a DB query wont need to scan through all 100 million entries in DB, right?
Lets say I want to find orders on "2020-01-30", I can create an index on createdTime to improve the performance.
But I can also create a partition based on createdTime to improve the performance. (the partition is per day)
Are there any difference to both methods in this case? Is one better than the other ?

There are several ways to partition - by range, by list, by hash, and by reference (although that tends to be uncommon).
If you were to partition by a date column, it would usually be using a range, so one month/week/day uses one partition, another uses another etc. If you want to filter rows where this date is equal to a value then you can do a partition full table scan to read all of the partition that houses this data with a full scan. This can end up being quite efficient if most of the data in the partition would match your filter - apply the same thinking about whether a full table scan in general is a good idea but when the data in the table is already filtered down. If you wanted to look for an hour long date range and you’re partitioning by range with monthly intervals then you’re going to be reading about 730 times more data than necessary. Local indexes are useful in that scenario.
Smaller partitions also help this out, but you can end up with a case where you have thousands of partitions. If you have selective queries that don’t know which partition needs to be read - you could want global indexes. These add a lot of effort into all partition maintenance operations.
If you index the date column instead then you can quickly establish the location of the rows in your table that meet your filter. This is easy in an index because it’s just a sorted list - you find the first key in the index that matches the filter and read until it no longer matches. You then have to lookup these rows using single block reads. Usual efficiency rules of an index apply - the less data you need to read with your filters the more useful the index will be.
Usually, queries include more than just a date filter. These additional filters might be more appropriate for your partitioning scheme. You could also just include the columns in your index (remembering the Golden Rule of Indexing would tell you if you’re using a column with equality filters it should go before columns that you use range filters on in an index).
You can generally get all the performance you need with just indexing. Partitioning really comes into play when you have important queries that need to read huge chunks of data (generally reporting queries) or when you need to do things like purge data older than X months.

Optimizing time range filters in Amazon QLDB

I am working on a ledger with a table of Transactions. Each entry has a transaction_id, account_id, timestamp and other metadata. I need to query for all Transactions for a given account_id with a between operator on timestamp
My planned approach was to build an index on account_id, transaction_id and timestamp. However I have noted a limitation on inequalities and indexes from the AWS Documentation and I had planned applying this to timestamp
Query performance is improved only when you use an equality predicate;
for example, fieldName = 123456789.
QLDB does not currently honor inequalities in query predicates. One
implication of this is that range filtered scans are not implemented.
...
Warning
QLDB requires an index to efficiently look up a document. Without an
indexed field in the WHERE predicate clause, QLDB needs to do a table
scan when reading documents. This can cause more query latency and can
also lead to more concurrency conflicts.
Transactions would be generated and grow indefinetly over time, and I would need to be able to query a weeks worth of transactions at a time.
Current Query:
SELECT *
FROM Transactions
WHERE "account_id" = 'test_account' and "timestamp" BETWEEN `2020-07-05T00:00Z` AND `2020-07-12T00:00Z`
I know it is possible to stream the data to a database more suited for this query, such as dynamodb, but I would like to know if my performance concerns performing the above query is valid, and if it is, what is the recommended indexes and query to ensure this scales and does not result in a scan across all transactions for the given account_id?

Thanks for your question (well written and researched)!
QLDB, at the time of writing, does not support range indexes. So, the short answer is "you can't."
I'd be interested to know what the intention behind your query is. For example, is getting a list of transactions between two dates something you need to do to form a new transaction or is it something you need for reporting purposes (e.g. displaying a user statement).
Nearly every use-case I've encountered thus far is the latter (reporting), and is much better served by replicating data to something like ElasticSearch or Redshift. Typically this can be done with a couple of lines of code in a Lambda function and the cost is extremely low.

QLDB has a history() function that works like a charm to generate statements, since you can pass one or two dates as arguments for start and/or end dates.
You see, this is where QLDB gets tricky: when you think of it as a relational database.
The caveat here is that you would have to change your transactions to be updates in the account table rather than new inserts in a different table. This is because, by design, QLDB gives you the ledger of any table. Meaning you can later check all versions of that record and filter them as well.
Here's an example of what a history query would look like in an Accounts table:
SELECT ha.data.* from Accounts a by Accounts_id
JOIN history(Accounts, `2022-04-10T00:00:00.000Z`, `2022-04-13T23:59:59.999Z`) ha
ON ha.metadata.id = Accounts_id
WHERE a.account_id = 1234
This different segment by Accounts_id is what QLDB uses to get the index on your history table and how you can join both tables by an indexed column. In this case, account_id.

Google Big Query - Date-Partitioned Tables with Eventual Data

Our use case for BigQuery is a little unique. I want to start using Date-Partitioned Tables but our data is very much eventual. It doesn't get inserted when it occurs, but eventually when it's provided to the server. At times this can be days or even months before any data is inserted. Thus, the _PARTITION_LOAD_TIME attribute is useless to us.
My question is there a way I can specify the column that would act like the _PARTITION_LOAD_TIME argument and still have the benefits of a Date-Partitioned table? If I could emulate this manually and have BigQuery update accordingly, then I can start using Date-Partitioned tables.
Anyone have a good solution here?

You don't need create your own column.
_PARTITIONTIME pseudo column still will work for you!
The only what you will need to do is insert/load respective data batch into respective partition by referencing not just table name but rather table with partition decorator - like yourtable$20160718
This way you can load data into partition that it belong to

How do I manage large data set spanning multiple tables? UNIONs vs. Big Tables?

I have an aggregate data set that spans multiple years. The data for each respective year is stored in a separate table named Data. The data is currently sitting in MS ACCESS tables, and I will be migrating it to SQL Server.
I would prefer that data for each year is kept in separate tables, to be merged and queried at runtime. I do not want to do this at the expense of efficiency, however, as each year is approx. 1.5M records of 40ish fields.
I am trying to avoid having to do an excessive number of UNIONS in the query. I would also like to avoid having to edit the query as each new year is added, leading to an ever-expanding number of UNIONs.
Is there an easy way to do these UNIONs at runtime without an extensive SQL query and high system utility? Or, if all the data should be managed in one large table, is there a quick and easy way to append all the tables together in a single query?

If you really want to store them in separate tables, then I would create a view that does that unioning for you.
create view AllData
as
(
select * from Data2001
union all
select * from Data2002
union all
select * from Data2003
)
But to be honest, if you use this, why not put all the data into 1 table. Then if you wanted you could create the views the other way.
create view Data2001
as
(
select * from AllData
where CreateDate >= '1/1/2001'
and CreateDate < '1/1/2002'
)

A single table is likely the best choice for this type of query. HOwever you have to balance that gainst the other work the db is doing.
One choice you did not mention is creating a view that contains the unions and then querying on theview. That way at least you only have to add the union statement to the view each year and all queries using the view will be correct. Personally if I did that I would write a createion query that creates the table and then adjusts the view to add the union for that table. Once it was tested and I knew it would run, I woudl schedule it as a job to run on the last day of the year.

One way to do this is by using horizontal partitioning.
You basically create a partitioning function that informs the DBMS to create separate tables for each period, each with a constraint informing the DBMS that there will only be data for a specific year in each.
At query execution time, the optimiser can decide whether it is possible to completely ignore one or more partitions to speed up execution time.
The setup overhead of such a schema is non-trivial, and it only really makes sense if you have a lot of data. Although 1.5 million rows per year might seem a lot, depending on your query plans, it shouldn't be any big deal (for a decently specced SQL server). Refer to documentation

I can't add comments due to low rep, but definitely agree with 1 table, and partitioning is helpful for large data sets, and is supported in SQL Server, where the data will be getting migrated to.
If the data is heavily used and frequently updated then monthly partitioning might be useful, but if not, given the size, partitioning probably isn't going to be very helpful.

SQL Group By Year, Month, Week, Day, Hour SQL vs Procedural Performance

I need to write a query that will group a large number of records by periods of time from Year to Hour.
My initial approach has been to decide the periods procedurally in C#, iterate through each and run the SQL to get the data for that period, building up the dataset as I go.
SELECT Sum(someValues)
FROM table1
WHERE deliveryDate BETWEEN #fromDate AND # toDate
I've subsequently discovered I can group the records using Year(), Month() Day(), and datepart(week, date) and datepart(hh, date).
SELECT Sum(someValues)
FROM table1
GROUP BY Year(deliveryDate), Month(deliveryDate), Day(deliveryDate)
My concern is that using datepart in a group by will lead to worse performance than running the query multiple times for a set period of time due to not being able to use the index on the datetime field as efficiently; any thoughts as to whether this is true?
Thanks.

As with anything performance related Measure
Checking the query plan up for the second approach will tell you any obvious problems in advance (a full table scan when you know one is not needed) but there is no substitute for measuring. In SQL performance testing that measurement should be done with appropriate sizes of test data.
Since this is a complex case, you are not simply comparing two different ways to do a single query but comparing a single query approach against a iterative one, aspects of your environment may play a major role in the actual performance.
Specifically
the 'distance' between your application and the database as the latency of each call will be wasted time compared to the one big query approach
Whether you are using prepared statements or not (causing additional parsing effort for the database engine on each query)
whether the construction of the ranges queries itself is costly (heavily influenced by 2)

If you put a formula into the field part of a comparison, you get a table scan.
The index is on field, not on datepart(field), so ALL fields must be calculated - so I think your hunch is right.

you could do something similar to this:
SELECT Sum(someValues)
FROM
(
SELECT *, Year(deliveryDate) as Y, Month(deliveryDate) as M, Day(deliveryDate) as D
FROM table1
WHERE deliveryDate BETWEEN #fromDate AND # toDate
) t
GROUP BY Y, M, D

If you can tolerate the performance hit of joining in yet one more table, I have a suggestion that seems odd but works real well.
Create a table that I'll call ALMANAC with columns like weekday, month, year. You can even add columns for company specific features of a date, like whether the date is a company holiday or not. You might want to add a starting and ending timestamp, as referenced below.
Although you might get by with one row per day, when I did this I found it convenient to go with one row per shift, where there are three shifts in a day. Even at that rate, a period of ten years was only a little over 10,000 rows.
When you write the SQL to populate this table, you can make use of all the date oriented built in functions to make the job easier. When you go to do queries you can use the date column as a join condition, or you may need two timestamps to provide a range for catching timestamps within the range. The rest of it is as easy as working with any other kind of data.

I was looking for similar solution for reporting purposes, and came across this article called Group by Month (and other time periods). It shows various ways, good and bad, to group by the datetime field. Definitely worth looking at.

I think that you should benchmark it to get reliable results , but, IMHO and my first thought would be that letting the DB take care of it (your 2nd approach) would be much faster then when you do it in your client code.
With your first approach, you have multiple roundtrips to the DB, which I think will be far more expensive. :)

You may want to look at a dimensional approach (this is simliar to what Walter Mitty has suggested), where each row has a foreign key to a date and/or time dimension. This allows very flexible summations through the join to this table where these parts are precalculated. In these cases, the key is usually a natural integer key of the form YYYYMMDD and HHMMSS which is relatively performant and also human readable.
Another alternative might be indexed views, where there are separate expressions for each of the date parts.
Or calculated columns.
But performance has to be tested and execution plans examined...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas