Is there a performant way to search by a non-partitioned column in crateDB?

Is there a performant way to search by a non-partitioned column in crateDB? - cratedb

My team and I have been using crate for one of our projects over the passed few years. We have a table with hundreds of millions of records and performance is key.
As we've developed more and more features on this project, we've ran into interesting problem. We have a column on this table labeled 'persist_date' which is when the record actually got persisted into the table. These dates may not always align and we could have a start_date of 2021-06-21 with a persist_date of 2021-10-14.
All of our queries up this point have easily been able to add a partition against start_date. Now we are encountering a problem which requires us to use a non-partitioned column (persist_date) to query against.
As I understand it, crateDB is really performant but only when you query against 1 specific partition at a time. My question now is how would I go about creating a partition for this other date column without duplicated my data? Is there anything other than a partition that might help, like the way the table is clustered?

You could use both columns as partition values.
e.g.
CREATE TABLE two_parted (a TEXT, b TEXT, val DOUBLE) PARTITIONED BY (a,b);
If either a or b are used in a selection, this would limit queries to shards that have either value. However this could lead to more shards, so you might want to partitions not on a daily, but weekly or monthly basis.

Related

BigQuery: Max Date for Time Partition on Custom Date Column

I am currently working on the Optimization of a huge table in Google's BigQuery. The tables has approximately 19 billions records resulting in a total size of 5.2 TB. In order to experiment on performance with regards to clustering and time partitioning, I duplicated the table with a Time Partitioning on a custom DATE MyDate column which is frequently used in queries.
When performing a query with a WHERE clause (for instance, WHERE(MyDate) = "2022-08-08") on the time partitioned table, the query is quicker and only reads around 20 GB compared to the 5.2 TB consumed by the table without partition. So far, so good.
My issue, however, arises when applying an aggregated function, i.e. in my case a MAX(MyDate): the query on the partitioned and the non-partitioned tables read the same amount of data and execute in roughly the same time. However, I would have expected the query on the partitioned table to be way quicker as it only needs to scan a single partition.
There seem to be workarounds by fetching the dataset's metadata (information schema) as described here. However, I would like to avoid solutions like this as it adds complexity to our queries.
Are there a more elegant ways to get the MAX of a time-partitioned BigQuery table based on a custom column without scanning the whole table or fetching metadata from the information schema?

Keeping track of mutated rows in BigQuery?

I have a large table whose rows get updated/inserted/merged periodically from a few different queries. I need a scheduled process to run (via API) to periodically check for which rows in that table were updated since the last check. So here are my issues...
When I run the merge query, I don't see a way for it to return which records were updated... otherwise, I could be copying those updated rows to a special updated_records table.
There are no triggers so I can't keep track of mutations that way.
I could add a last_updated timestamp column to keep track that way, but then repeatedly querying the entire table all day for that would be a huge amount of data billed (expensive).
I'm wondering if I'm overlooking something obvious or if maybe there's some kind of special BQ metadata that could help?
The reason I'm attempting this is that I'm wanting to extract and synchronize a smaller subset of this table into my PostgreSQL instance because the latency for querying BQ is just too much for smaller queries.
Any ideas? Thanks!

One way is to periodically save intermediate state of the table using the time travel feature. Or store only the diffs. I just want to leave this option here:
FOR SYSTEM_TIME AS OF references the historical versions of the table definition and rows that were current at timestamp_expression.
The value of timestamp_expression has to be within last 7 days.
The following query returns a historical version of the table from one hour ago.
SELECT * FROM table
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
The following query returns a historical version of the table at an absolute point in time.
SELECT * FROM table
FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00';

An approach would be to have 3 tables:
one basetable in "append only" mode, only inserts are added, and updates as full row, in this table would be every record like a versioning system.
a table to hold deletes (or this can be incorporated as a soft delete if there is a special column kept in the first table)
a livetable where you hold the current data (in this table you would do your MERGE statements most probably from the first base table.
If you choose partitioning and clustering, you could end up leverage a lot for long time storage discounted price and scan less data by using partitioning and clustering.

If the table is large but the amount of data updated per day is modest then you can partition and/or cluster the table on the last_updated_date column. There are some edge cases, like the first today's check should filter for last_updated_date being either today or yesterday.
Depending of how modest this amount of data updated throughout a day is, even repeatedly querying the entire table all day could be affordable because BQ engine will scan one daily partition only.
P.S.
Detailed explanation
I could add a last_updated timestamp column to keep track that way
I inferred from that the last_updated column is not there yet (so the check-for-updates statement cannot currently distinguish between updated rows and non-updated ones) but you can modify the table UPDATE statements so that this column will be added to the newly modified rows.
Therefore I assumed you can modify the updates further to set the additional last_updated_date column which will contain the date portion of the timestamp stored in the last_updated column.
but then repeatedly querying the entire table all day
From here I inferred there are multiple checks throughout the day.
but the data being updated can be for any time frame
Sure, but as soon as a row is updated, no matter how old this row is, it will acquire two new columns last_updated and last_updated_date - unless both columns have already been added by the previous update in which cases the two columns will be updated rather than added. If there are several updates to the same row between the update checks, then the latest update will still make the row to be discoverable by the checks that use the logic described below.
The check-for-update statement will (conceptually, not literally):
filter rows to ensure last_updated_date=today AND last_updated>last_checked. The datetime of the previous update check will be stored in last_checked and where this piece of data is held (table, durable config) is implementation dependent.
discover if the current check is the first today's check. If so then additionally search for last_updated_date=yesterday AND last_updated>last_checked.
Note 1If the table is partitioned and/or clustered on the last_updated_date column, then the above update checks will not cause table scan. And subject to ‘modest’ assumption made at the very beginning of my answer, the checks will satisfy your 3rd bullet point.
Note 2The downside of this approach is that the checks for updates will not find rows that had been updated before the table UPDATE statements were modified to include the two extra columns. (Such rows will be in the__NULL__ partition with rows that never were updated.) But I assume until the changes to the UPDATE statements are made it will be impossible to distinguish between updated rows and non-updated ones anyway.
Note 3 This is an explanatory concept. In the real implementation you might need one extra column instead of two. And you will need to check which approach works better: partitioning or clustering (with partitioning on a fake column) or both.
The detailed explanation of the initial (e.g. above P.S.) answer ends here.
Note 4
clustering only helps performance
From the point of view of table scan avoidance and achieving a reduction in the data usage/costs, clustering alone (with fake partitioning) could be as potent as partitioning.
Note 5
In the comment you mentioned there is already some partitioning in place. I’d suggest to examine if the existing partitioning is indispensable, can it be replaced with clustering.

Some good ideas posted here. Thanks to those who responded. Essentially, there are multiple approaches to tackling this.
But anyway, here's how I solved my particular problem...
Suppose the data needs to ultimately end up in a table called MyData. I created two additional tables, MyDataStaging and MyDataUpdate. These two tables have an identical structure to MyData with the exception of MyDataStaging has an additional Timestamp field, "batch_timestamp". This timestamp allows me to determine which rows are the latest versions in case I end up with multiple versions before the table is processed.
DatFlow pushes data directly to MyDataStaging, along with a Timestamp ("batch_timestamp") value indicating when the process ran.
A scheduled process then upserts/merges MyDataStaging to MyDataUpdate (MyDataUpdate will now always contain only a unique list of rows/values that have been changed). Then the process upserts/merges from MyDataUpdate into MyData as well as being exported & downloaded to be loaded into PostgreSQL. Then staging/update tables are emptied appropriately.
Now I'm not constantly querying the massive table to check for changes.
NOTE: When merging to the main big table, I filter the update on unique dates from within the source table to limit the bytes processed.

I need to partition a huge table and truncate the partition on daily basis

I have a table tblCallDataStore which has a flow of 2 millions records daily.Daily, I need to delete any record older than 48 hours. If I create a delete job, it runs for more than 13 hours or sometimes more than that. What is the feasible way to partition the table and truncate the partition.
How may I do that?
I have a Receivedate column and I want to do the partition on its basis.

Unfortunately, you will need to bite the one-time bullet and add partitioning to your table (hint: base it off the DATEPART(DAY, ...) of the timestamp you are keying off of so you can quickly get this as your #ParitionID variable later). After that, you can use the SWITCH PARTITION statement to move all rows of the specified partition ID to another table. You can then quickly TRUNCATE the receiving table since you don't care about the data anymore.
There are a couple of tricky setup steps in this process: Managing the partition IDs themselves, and ensuring the receiving table has an identical structure (and various other requirements, such as not being able to refer to your partitioned table with foreign keys (umm... ouch)). Once you have the partition ID you want, the code is simple:
ALTER TABLE [x] SWITCH PARTITION #PartitionID TO [x_copy]
I have been meaning to write a stored procedure to automatically maintain the receiving tables, since at work we have just been manually updating both tables in tandem whenever we make a change (e.g. adding a new column or index). To manually do it, basically just copy the table definition and append something to all the entity names (e.g. just put the number 2 on the end).
P.S. Maybe you want to use hourly partitioning instead, then you have much more control over the "48 hours" requirement, and also have the flexibility to handle time zones (ugh). You'd just have to be more clever since DATEPART(HOUR, ...) obviously won't work directly. Also note that your partition IDs must be defined as a range, so eventually you will end up cycling through them, so keep that in mind.
P.P.S. As noted in the comments by David Browne, partitioning was an enterprise-only feature until SQL 2016 SP1.

Partition bigquery table with more than 4000 days of data?

I have about 11 years of data in a bunch of Avro files. I wanted to partition by the date of each row, but from the documentation it appears I can't because there are too many distinct dates?
Does clustering help on this? The natural cluster key for my data would still have some that'd have data for more than 4000 days.

two solutions i see:
1)
Combine tables sharding (per year) with time partitioning based on your column. I never tested that myself, but it should work, as every shard is seen as a new table in BQ.
With that you are able to easily address the shard plus the partition with one wildcard/variable.
2)
A good workaround is to create an extra column with the date of you field which should be partitioned.
For every data entry longer ago than 9 years (eg: DATE_DIFF(current_date(), DATE('2009-01-01'), YEAR)) format your date to the 1st of the particular month.
With that you are able to create another 29 years of data.
Be aware that you cannot filter based on that column with a date filter eg in DataStudio. But for query it works.
Best Thomas

Currently as per doc clustering is supported for partition table only. In future it might support non-partition tables.
You can put old data per year in single partition.
You need to add extra column to you table for partioning it.
Say, all data for year 2011 will go to partition 20110101.
For newer data (2019) you can have seperate partition for each date.
This is not a clean solution to problem but using this you can optimize further by using clustering to provide minimal table scan.

4,000 daily partitions is just over 10 years of data. If you require a 'table' with more than 10 years of data one workaround would be to use a view:
Split your table into decades ensuring all tables are partitioned on the same field and have the same schema
Union the tables together in a BigQuery view
This results in a view with 4,000+ partitions which business users can query without worrying about which version of a table they need to use or union-ing the tables themselves.

It might make sense to partition by week/month/year instead of day - depending on how much data you have per day.
In that case, see:
Partition by week/year/month to get over the partition limit?

SAP HANA PARTITIONED TABLE CALCULATION VIEW RUNNING SLOW IN COMPARISON TO NON-PARTITIONED TABLE CALCULATION VIEWE

I have large size table , close to 1 GB and the size of this table is growing every week, it has total rows as 190 millions, I started getting alerts from HANA to partition this table, so I planned to partition this with column which is frequently used in Where clause.
My HANA System is scale out system with 8 nodes.
In order to compare the partition query performance difference with this un-partitioned table, I created calculation views on top of this un-partitioned table and recorded the query performance.
I partitioned this table using HASH method and by number of servers, and recorded the query performance. By this way I would have good data distribution across servers. I created calculation view and recorded query performance.
To my surprise I have found that my un-partitioned table calculation view query is performing better in comparison to partitioned table calculation view.
This was really shock. Not sure why non partitioned table Calculation view responding better to partitioned table Calculation view.
I have plan viz output files but not sure where to attach it.
Let me know why this is the behaviour?

Ok, this is not a straight-forward question that can be answered correctly as such.
What I can do though is to list some factors that likely will play a role here:
a non-partitioned table needs a single access to the table structure while the partitioned version requires at least one access for each partition
if the SELECT is not actually providing a WHERE condition that can be evaluated by the HASH function used for the partitioning, then all partitions always have to be evaluated and no partition pruning can take place.
HASH partitioning does not take any additional knowledge about the data into account, which means that similar data does not get stored together. This has a negative impact on data compression. Also, each partition requires its own set of value dictionaries for the columns where a single-partition/non-partitioned table only needs one dictionary per column.
You mentioned that you are using a scale-out system. If the table partitions are distributed across the different nodes, then every query will result in cross-node network communication. That is an additional workload and waiting time that simply does not exist with non-partitioned tables.
When joining partitioned tables each partition of the first table has to be joined with each partition of the second table, if no partition-wise join is possible.
There are other/more potential reasons for why a query against partitioned tables can be slower than against a non-partitioned table. All this is extensively explained in the SAP HANA Administration Guide.
As a general guidance, tables should only be partitioned if that cannot be avoided and when the access pattern of queries are well understood. It is definitively not a feature that you just "switch on" and everything will just work fine.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas