BigQuery Performance on Date Partitioned Table - google-bigquery

I have a table which is having ~65M records per day. This table is partitioned on date(GG_COMMIT_TIMESTAMP). I have a spark job which does left outer join with this table.
Spark job works faster if the data from this table is fetched on date-wise. For example
where date(ext.GG_COMMIT_TIMESTAMP)=date('"+currentPartition+"')
But If I fetch the data based on timestamp column then the job is comparatively 5 to 6 times slow.
where ext.GG_COMMIT_TIMESTAMP >= TIMESTAMP_SUB(TIMESTAMP('" + currentPartition + "'),INTERVAL 3 HOUR) and ext.GG_COMMIT_TIMESTAMP <= TIMESTAMP_ADD(TIMESTAMP('" + currentPartition + "'),INTERVAL 90 MINUTE)
Could you please share why it is so? there is no change in the volume of left table while running above join.

Related

How to make a group query to select multiple rows?

I have a DateTime column (timestamp 2022-05-22 10:10:12) with a batch of stamps per each day.
I need to filter the rows where stamp is before 9am (here is no problem) and I'm using this code:
SELECT * FROM tickets
WHERE date_part('hour'::text, tickets.date_in) < 9::double precision;
The output is the list of the rows where the time in timestamp is less than 9 am (50 rows from 2000).
date_in
2022-05-22 08:10:12
2022-04-23 07:11:13
2022-06-15 08:45:26
Then I need to find all the days where at least one row has a stamp before 9 am - and here I'm stuck. Any idea how to select all the days where at least one stamp was before 9 am?
The code I'm trying:
SELECT * into temp1 FROM tickets
WHERE date_part('hour'::text, tickets.date_in) < 9::double precision
ORDER BY date_part('day'::text, date_in);
Select * into temp2
from tickets, temp1
where date_part('day'::text, tickets.date_in) = date_part('day'::text, temp1.date_in);
Update temp2 set distorted_route = 1;
But this is giving me nothing.
Expected output is to get all the days where at least one route was done before 9am:
date_in
2022-05-22 08:10:12
2022-05-22 10:11:45
2022-05-22 12:14:59
2022-04-23 07:11:13
2022-04-23 11:42:25
2022-06-15 08:45:26
2022-06-15 15:10:57
Should I make an additional table (temp1) to feed it with the first query result (just the rows before 9am) and then make a cross table query to find in the source table public.tickets all the days which are equal to the public.temp1?
Select * from tickets, temp1
where TO_Char(tickets.date_in, 'YYYY-MM-DD')
= TO_Char(temp1.date_in, 'YYYY-MM-DD');
or like this:
SELECT *
FROM tickets
WHERE EXISTS (
SELECT date_in FROM TO_Char(tickets.date_in, 'YYYY-MM-DD') = TO_Char(temp1.date_in, 'YYYY-MM-DD')
);
Ideally, I'd want to avoid using a temporary table and make a request just for one table.
After that, I need to create a view or update and add some remarks to the source table.
Assuming you mean:
How to select all rows where at least one row exists with a timestamp before 9 am of the same day?
SELECT *
FROM tickets t
WHERE EXISTS (
SELECT FROM tickets t1
WHERE t1.date_in::date = t.date_in::date -- same day
AND t1.date_in::time < time '9:00' -- time before 9:00
AND t1.id <> t.id -- exclude self
)
ORDER BY date_id; -- optional, but typically helpful
id being the PK column of your undisclosed table.
But be aware that ...
... typically you'll want to work with timestamptz instead of timestamp. See:
Ignoring time zones altogether in Rails and PostgreSQL
https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_timestamp_.28without_time_zone.29
... this query is slow for big tables, because it cannot use a plain index on (date_id) (not "sargable"). Related:
How do you do date math that ignores the year?
There are various ways to optimize performance. The best way depends on undisclosed information for performance questions.

Rolling count of rows withing time interval [duplicate]

This question already has answers here:
Window Functions or Common Table Expressions: count previous rows within range
(2 answers)
Closed 1 year ago.
For an analysis I need to aggregate the rows of a single table depending on their creation time. Basically, I want to know the count of orders that have been created within a certain period of time before the current order. Can't seem to find the solution to this.
Table structure:
order_id
time_created
1
00:00
2
00:01
3
00:03
4
00:05
5
00:10
Expected result:
order_id
count within 3 seconds
1
1
2
2
3
3
4
2
5
1
Sounds like an application for window functions. But, sadly, that's not the case. Window frames can only be based on row counts, not on actual column values.
A simple query with LEFT JOIN can do the job:
SELECT t0.order_id
, count(t1.time_created) AS count_within_3_sec
FROM tbl t0
LEFT JOIN tbl t1 ON t1.time_created BETWEEN t0.time_created - interval '3 sec'
AND t0.time_created
GROUP BY 1
ORDER BY 1;
db<>fiddle here
Does not work with time like in your minimal demo, as that does not wrap around. I suppose it's reasonable to assume timestamp or timestamptz.
Since you include each row itself in the count, an INNER JOIN would work, too. (LEFT JOIN is still more reliable in the face of possible NULL values.)
Or use a LATERAL subquery and you don't need to aggregate on the outer query level:
SELECT t0.order_id
, t1.count_within_3_sec
FROM tbl t0
LEFT JOIN LATERAL (
SELECT count(*) AS count_within_3_sec
FROM tbl t1
WHERE t1.time_created BETWEEN t0.time_created - interval '3 sec'
AND t0.time_created
) t1 ON true
ORDER BY 1;
Related:
Rolling sum / count / average over date interval
For big tables and many rows in the time frame, a procedural solution that walks through the table once will perform better. Like:
Window Functions or Common Table Expressions: count previous rows within range
Alternatives to broken PL/ruby: convert a warehouse journal table
GROUP BY and aggregate sequential numeric values

Can I replace an interval of partitions of a BigQuery partitioned table at once?

I'm working on BigQuery tables with the Python SDK and I want to achieve something that seems doable, but can't find anything in the documentation.
I have a table T partitioned by date, and I have a SELECT request that computes values over the X last days. In T, I would like to replace the partitions of the X last days with these values, without affecting the partitions older than X days.
Here is how we do for replacing one partition only :
job_config = bigquery.QueryJobConfig()
job_config.destination = dataset.table("{}${}".format(table, date.strftime("%Y%m%d")))
job_config.use_legacy_sql = False
job_config.write_disposition = bigquery.job.WriteDisposition.WRITE_TRUNCATE
query_job = bigquery.job.QueryJob(str(uuid.uuid4()), query, client, job_config)
query_job.result()
I tried to go like this :
job_config.destination = dataset.table(table))
But it truncates all partitions, even those older than X days.
Is there a way to do this easily ? Or do I have to loop over each partition of the interval ?
Thanks
I don't think you can achieve it by playing with destination table.
Not considering the cost, what you can do with SQL is
DELETE FROM your_ds.your_table WHERE partition_date > DATE_SUB(CURRENT_DATE(), INTERVAL X DAY);
Then
INSERT INTO your_ds.your_table SELECT (...)
Cost
The first DELETE will cost:
The sum of bytes processed for all the columns referenced in all partitions for the tables scanned by the query
+ the sum of bytes for all columns in the modified or scanned partitions for the table being modified (at the time the DELETE starts).
The second INSERT INTO should cost the same as your current query.

Most efficient way to retrieve data by timestamps

I'm using PostgreSQL 9.2.8.
I have table like:
CREATE TABLE foo
(
foo_date timestamp without time zone NOT NULL,
-- other columns, constraints
)
This table contains about 4.000.000 rows. One day data is about 50.000 rows.
My goal is to retrieve one day data as fast as possible.
I have created an index like:
CREATE INDEX foo_foo_date_idx
ON foo
USING btree
(date_trunc('day'::text, foo_date));
And now I'm selecting data like this (now() is just an example, i need data from ANY day):
select *
from process
where date_trunc('day'::text, now()) = date_trunc('day'::text, foo_date)
This query lasts about 20 s.
Is there any possiblity to obtain same data in shorter time?
It takes time to retrieve 50,000 rows. 20 seconds seems like a long time, but if the rows are wide, then that might be an issue.
You can directly index foo_date and use inequalities. So, you might try this version:
create index foo_foo_date_idx2 on foo(foo_date);
select p
from process p
where p.foo_date >= date_trunc('day', now()) and
p.foo_date < date_trunc('day', now() + interval '1 day');

Postgres SQL select a range of records spaced out by a given interval

I am trying to determine if it is possible, using only sql for postgres, to select a range of time ordered records at a given interval.
Lets say I have 60 records, one record for each minute in a given hour. I want to select records at 5 minute intervals for that hour. The resulting rows should be 12 records each one 5 minutes apart.
This is currently accomplished by selecting the full range of records and then looping thru the results and pulling out the records at the given interval. I am trying to see if I can do this purly in sql as our db is large and we may be dealing with tens of thousands of records.
Any thoughts?
Yes you can. Its really easy once you get the hang of it. I think its one of jewels of SQL and its especially easy in PostgreSQL because of its excellent temporal support. Often, complex functions can turn into very simple queries in SQL that can scale and be indexed properly.
This uses generate_series to draw up sample time stamps that are spaced 1 minute apart. The outer query then extracts the minute and uses modulo to find the values that are 5 minutes apart.
select
ts,
extract(minute from ts)::integer as minute
from
( -- generate some time stamps - one minute apart
select
current_time + (n || ' minute')::interval as ts
from generate_series(1, 30) as n
) as timestamps
-- extract the minute check if its on a 5 minute interval
where extract(minute from ts)::integer % 5 = 0
-- only pick this hour
and extract(hour from ts) = extract(hour from current_time)
;
ts | minute
--------------------+--------
19:40:53.508836-07 | 40
19:45:53.508836-07 | 45
19:50:53.508836-07 | 50
19:55:53.508836-07 | 55
Notice how you could add an computed index on the where clause (where the value of the expression would make up the index) could lead to major speed improvements. Maybe not very selective in this case, but good to be aware of.
I wrote a reservation system once in PostgreSQL (which had lots of temporal logic where date intervals could not overlap) and never had to resort to iterative methods.
http://www.amazon.com/SQL-Design-Patterns-Programming-Focus/dp/0977671542 is an excellent book that goes has lots of interval examples. Hard to find in book stores now but well worth it.
Extract the minutes, convert to int4, and see, if the remainder from dividing by 5 is 0:
select *
from TABLE
where int4 (date_part ('minute', COLUMN)) % 5 = 0;
If the intervals are not time based, and you just want every 5th row; or
If the times are regular and you always have one record per minute
The below gives you one record per every 5
select *
from
(
select *, row_number() over (order by timecolumn) as rown
from tbl
) X
where mod(rown, 5) = 1
If your time records are not regular, then you need to generate a time series (given in another answer) and left join that into your table, group by the time column (from the series) and pick the MAX time from your table that is less than the time column.
Pseudo
select thetimeinterval, max(timecolumn)
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
And further join it back to the table for the full record (assuming unique times)
select t.* from
tbl inner join
(
select thetimeinterval, max(timecolumn) timecolumn
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
) y on tbl.timecolumn = y.timecolumn
How about this:
select min(ts), extract(minute from ts)::integer / 5
as bucket group by bucket order by bucket;
This has the advantage of doing the right thing if you have two readings for the same minute, or your readings skip a minute. Instead of using min even better would be to use one of the the first() aggregate functions-- code for which you can find here:
http://wiki.postgresql.org/wiki/First_%28aggregate%29
This assumes that your five minute intervals are "on the fives", so to speak. That is, that you want 07:00, 07:05, 07:10, not 07:02, 07:07, 07:12. It also assumes you don't have two rows within the same minute, which might not be a safe assumption.
select your_timestamp
from your_table
where cast(extract(minute from your_timestamp) as integer) in (0,5);
If you might have two rows with timestamps within the same minute, like
2011-01-01 07:00:02
2011-01-01 07:00:59
then this version is safer.
select min(your_timestamp)
from your_table
group by (cast(extract(minute from your_timestamp) as integer) / 5)
Wrap either of those in a view, and you can join it to your base table.