Inconsistency in partition pruning in BigQuery - google-bigquery

What's the problem?
I'm trying to select two different date periods at once within a query (to be used in Data Studio), to do some complex period comparison calculations.
This has to all happen in the query, because it's getting used in Data Studio (and other reporting platforms).
However the logic I'm using to prune the partitions, seems to be inconsistent on the BigQuery side.
One of them prunes partitions correctly, the other doesn't. Some times variations seem to use arbitrary amounts of data.
Examples of partition pruning
Here's the first example. It either selects the correct amount ~ 12MB. Or the entire table ~5GB.
SELECT
columns...
FROM
table
WHERE
(
date(_PARTITIONTIME) >= PARSE_DATE('%Y%m%d', "20220509")
AND date(_PARTITIONTIME) <= PARSE_DATE('%Y%m%d', "20220509")
)
OR
(
date(_PARTITIONTIME) >= DATE_SUB(PARSE_DATE('%Y%m%d', "20220509"), INTERVAL 1 DAY)
AND date(_PARTITIONTIME) <= DATE_SUB(PARSE_DATE('%Y%m%d', "20220509"), INTERVAL 1 DAY)
)
AND REGEXP_CONTAINS (path, r"/c/")
I can't consistently recreate the 12MB version, it's more typically returning the 5GB version.
Question 1: I'm assuming this is some sort of weird cache thing, where it's somehow run the date calculation before?
Question 2: I'm also not sure why it's not pruning the tables here as I'm not doing any subqueries, just some date calculations. Why is it pruning?

Related

Bigquery runs indefinitely

I have a query like this:
WITH A AS (
SELECT id FROM db1.X AS d
WHERE DATE(d.date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
),
B AS (
SELECT id
FROM db2.Y as t
WHERE
t.start <= TIMESTAMP(DATE_SUB(current_date(), INTERVAL 7 DAY))
AND t.end >= TIMESTAMP(current_date())
)
SELECT * FROM A as d JOIN B as t on d.id = t.id;
db1.X has 1.6 Billion rows.
db2.Y has 15K rows.
db1.X is a materialized view on a bigger table.
db2.Y is a table with source as a google sheet.
Issue
The query keeps running indefinitely.
I had to cancel it when it reached about an hour, but one query which I left running went on for 6 hours and then timed-out without any further error.
The query used to run fine till 2nd Jan, After that I reran it on 9th Jan and it never ended. Both the tables are auto-populated so it is possible that they ran over some threshold during this time, but I could not find any such threshold value. (Similar fate of 3 other queries, same tables)
What's tried
Removed join to use a WHERE IN. Still never ending.
No operation works on A, but all work on B. For ex: SELECT count(*) from B; will work. It keeps on going for A. (But it works when the definition of B is removed)
The above behaivour is replicated even when not using subqueries.
A has 10.6 Million rows, B has 31 rows (Much less than actual table, but still the same result)
The actual query was without any subqueries and used only multiple date comparisons while joining. So I used subqueries which filters data before going into the join. (This is the one above) But it also runs indefinitely
JOIN EACH: This never got out of syntax error. Replacing JOIN with JOIN EACH in above query complains about the "AS", removing that it complains that I should use dataset.tablename, on fixing that it complains Expected end of input but got "."
It turns out that the table size is the problem.
I created a smaller table and ran exactly the same queries, and that works.
This was also expected because the query just stopped running one day. The only variable was the amount of data in source tables.
In my case, I needed the data every week, so I created a scheduled query to update the smaller table with only 1 month worth of data.
The smaller versions of the tables have:
db1.X: 40 million rows
db2.Y: 400 rows
Not sure what's going on exactly in terms of issues due to size, but apart from some code clarity your query should run as expected. Am I correct in reading from your query that table A should return results within the last 7 days whereas table B should return results outside of the last 7 days? Some things you might try to make debugging easier.
Use BETWEEN and dates. E.g. WHERE DATE(d.date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
Use a backtick (`) for your FROM statement to prevent table name errors like the one you mentioned (expected end of input but got ".")
Limit your CTE instead of the outer query. The current limit in the outer query has no effect on computed data only on the output. E.g. to limit the source data from table A instead use:
WITH A AS (
SELECT id FROM `db1.X`
WHERE DATE(date) BETWEEN DATE_SUB(current_date(), INTERVAL 7 DAY) AND current_date()
LIMIT 10
)
...

Oracle SQL - Query with timestamp - Slow

I'm trying to write a query that looks for all records for yesterday from a column with type: TIMESTAMP(6) WITH LOCAL TIME ZONE.
The issue I'm facing is the query runs extremely slow, I'm wondering if there is something I'm doing incorrectly when querying with a timestamp format?
select * from inventory_transaction
WHERE Complete_timestamp >= to_timestamp(sysdate-1)
and Complete_timestamp < to_timestamp(sysdate);
First, if you want "yesterday", the logic would be:
select *
from inventory_transaction
where Complete_timestamp >= trunc(sysdate - 1) and
Complete_timestamp < trunc(sysdate);
What is important here is the trunc() so you get calendar days. Otherwise, you have the time in the comparison. Despite its name, sysdate has a time component.
I would also use sysdate - interval '1' day, but that has no impact on performance or results.
Why is this slow? There could be multiple reasons. The most likely is that inventory_transaction is a table. In that case, you want an index on inventory_transaction(complete_timestamp).
Another possibility is that there is a lot of data. More data takes more time for the database to read. There is no way around that, other than upgrading the server or changing server parameters (if they are suboptimal).
Another possibility is that inventory_transaction is really a view. If that is the case, you need to optimize the view. It is possible that an index on an underlying column will speed the query. However, it might be the entire view that needs to be optimized.

What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds

My use case is i have two data sources:
1. Source1 (as speed layer)
2. Hive external table on top of S3(as batch layer)
I am using Presto for querying data from both the data sources by using view.
I want to create view that will union data from both the sources like : "create view test as select * from Source1.table union all select * from hive.table"
We are keeping 24 hours data in Source1 and after 24 hours that data will be migrated to s3 via hive.
Columns for Source1 tables are:timestamp,logtype,company,category
User will query data using timestamp range(can query data of last 15/30 minutes, last x hours, last x days, last x months, etc)
example: "select * from test where timestamp > (now() - interval '15' minute)","select * from test where timestamp > (now() - interval '12' hour)", "select * from test where timestamp > (now() - interval '1' day)"
To satisfy the user query I need to partition the hive table as well as the user should not be aware of the underlying stategy i.e if user is querying last x minutes data, he/she should not bother that if presto is reading the data from Source1 or hive.
What should be my hive partitioning strategy and view strategy so that query can efficiently run and return results within 10 seconds?
For hive a partition column should be used which will queried in filter.
In your case this is timestamp. However if you use timestamp it would create a partition for every second (or millisecond) depending in the data in the column.
A better solution would be to create columns like year, month, day, hour (from timestamp) and to use these as partition columns.
The same strategy will work for Kudu however be advised it could create hot-spotting since all the newly arriving records will go to same (most-recent) partition this will limit insert (and may be query) performance.
To overcome use one additional column as hash partition along with timestamp derived columns.
e.g year, month, day, hour, logtype

SQL to group time intervals by arbitrary time period

I need help with this SQL query. I have a big table with the following schema:
time_start (timestamp) - start time of the measurement,
duration (double) - duration of the measurement in seconds,
count_event1 (int) - number of measured events of type 1,
count_event2 (int) - number of measured events of type 2
I am guaranteed that the no rows will overlap - in SQL talk, there are no two rows such that time_start1 < time_start2 AND time_start1 + duration1 > time_start2.
I would like to design an efficient SQL query which would group the measurements by some arbitrary time period (I call it the group_period), for instance 3 hours. I have already tried something like this:
SELECT
ROUND(time_start/group_period,0) AS time_period,
SUM(count_event1) AS sum_event1,
SUM(count_event2) AS sum_event2
FROM measurements
GROUP BY time_period;
However, there seems to be a problem. If there is a measurement with duration greater than the group_period, I would expect such measurement to be grouped into all time period it belongs to, but since the duration is never taken into account, it gets grouped only into the first one. Is there a way to fix this?
Performance is of concern to me because in time, I expect the table size to grow considerably reaching millions, possibly tens or hundreds of millions of rows. Do you have any suggestions for indexes or any other optimizations to improve the speed of this query?
Based on Timekiller's advice, I have come up with the following query:
-- Since there's a problem with declaring variables in PostgreSQL,
-- we will be using aliases for the arguments required by the script.
-- First some configuration:
-- group_period = 3600 -- group by 1 hour (= 3600 seconds)
-- min_time = 1440226301 -- Sat, 22 Aug 2015 06:51:41 GMT
-- max_time = 1450926301 -- Thu, 24 Dec 2015 03:05:01 GMT
-- Calculate the number of started periods in the given interval in advance.
-- period_count = CEIL((max_time - min_time) / group_period)
SET TIME ZONE UTC;
BEGIN TRANSACTION;
-- Create a temporary table and fill it with all time periods.
CREATE TEMP TABLE periods (period_start TIMESTAMP)
ON COMMIT DROP;
INSERT INTO periods (period_start)
SELECT to_timestamp(min_time + group_period * coefficient)
FROM generate_series(0, period_count) as coefficient;
-- Group data by the time periods.
-- Note that we don't require exact overlap of intervals:
-- A. [period_start, period_start + group_period]
-- B. [time_start, time_start + duration]
-- This would yield the best possible result but it would also slow
-- down the query significantly because of the part B.
-- We require only: period_start <= time_start <= period_start + group_period
SELECT
period_start,
COUNT(measurements.*) AS count_measurements,
SUM(count_event1) AS sum_event1,
SUM(count_event2) AS sum_event2
FROM periods
LEFT JOIN measurements
ON time_start BETWEEN period_start AND (period_start + group_period)
GROUP BY period_start;
COMMIT TRANSACTION;
It does exactly what I was going for, so mission accomplished. However, I would still appreciate if anybody could give me some feedback to the performance of this query for the following conditions:
I expect the measurements table to have about 500-800 million rows.
The time_start column is primary key and has unique btree index on it.
I have no guarantees about min_time and max_time. I only know that group period will be chosen so that 500 <= period_count <= 2000.
(This turned out way too large for a comment, so I'll post it as an answer instead).
Adding to my comment on your answer, you probably should go with getting best results first and optimize later if it turns out to be slow.
As for performance, one thing I've learned while working with databases is that you can't really predict performance. Query optimizers in advanced DBMS are complex and tend to behave differently on small and large data sets. You'll have to get your table filled with some large sample data, experiment with indexes and read the results of EXPLAIN, there's no other way.
There are a few things to suggest, though I know Oracle optimizer much better than Postgres, so some of them might not work.
Things will be faster if all fields you're checking against are included in the index. Since you're performing a left join and periods is a base, there's probably no reason to index it, since it'll be included fully either way. duration should be included in the index though, if you're going to go with proper interval overlap - this way, Postgres won't have to fetch the row to calculate the join condition, index will suffice. Chances are it will not even fetch the table rows at all since it needs no other data than what exists in indexes. I think it'll perform better if it's included as the second field to time_start index, at least in Oracle it would, but IIRC Postgres is able to join indexes together, so perhaps a second index would perform better - you'll have to check it with EXPLAIN.
Indexes and math don't mix well. Even if duration is included in the index, there's no guarantee it will be used in (time_start + duration) - though, again, look at EXPLAIN first. If it's not used, try to either create a function-based index (that is, include time_start + duration as a field), or alter the structure of the table a bit, so that time_start + duration is a separate column, and index that column instead.
If you don't really need left join (that is, you're fine with missing empty periods), then use inner join instead - optimizer will likely start with a larger table (measurements) and join periods against it, possibly using hash join instead of nested loops. If you do that, than you should also index your periods table in the same fashion, and perhaps restructure it the same way, so that it contains start and end periods explicitly, as optimizer has even more options when it doesn't have to perform any operations on the columns.
Perhaps the most important, if you have max_time and min_time, USE IT to limit the results of measurements before joining! The smaller your sets, the faster it will work.

How to find periods without activity in BigQuery

I have a table of some type of activity in BigQuery with just about 40Mb of data now. Activity date is stored in one of the fields (string in format YYYY-MM-DD HH:MM:SS). I need to find way to determine periods of inactivity (with some predefined threshold) running reasonable amount of time.
Query that I built runs already hour. Here it is:
SELECT t1.date, MIN(PARSE_UTC_USEC(t1.date) - PARSE_UTC_USEC(t2.date)) AS mintime
FROM logs t1
JOIN (SELECT date, http_error FROM logs) t2 ON t1.http_error = t2.http_error
WHERE PARSE_UTC_USEC(t1.date) > PARSE_UTC_USEC(t2.date)
GROUP BY t1.date
HAVING mintime > 1000;
Idea is:
1. Take decart multiplication of the table with itself (http_error is field that almost never changes value, so it does the trick)
2. Take only pairs where date1 > date2
3. Take for every date1 date2 with minimal difference
4. Restrict choice by cases where this minimal difference is more than threshold.
I admit that real query I use is burden a bit by fixes to invalid data (this adds additional operations). But I really need better idea to do this. I'll be glad to hear other ideas
I don't know the granularity of inactivity you are looking for, but why not try bucketing by your timestamp, then counting the relative frequency of activities in each bucket:
SELECT
UTC_USEC_TO_HOUR(PARSE_UTC_USEC(timestamp_usec)) AS hour_bucket,
COUNT(*) as activity_count
GROUP BY
hour_bucket
ORDER BY
activity_count ASC;