BigQuery Partition by Day( timestamp column ) is not working - google-bigquery

I have 1 partition on column _installed_at_ (timestamp), see
here. But when I run
SELECT * FROM `data-analytics-experiment.data_3rd_party.raw_adjust` WHERE DATE(_installed_at_) = "2022-05-31" LIMIT 1000
This query processed all the tables, the partition is not running. This query returned no results.
See here
Help please T.T

Below screenshot says that your table is partitioned but most values in _installed_at_ which is a partition column are not valid.
You might want to check if _installed_at_ is properly generated or parsed from string-formatted timestamp.

Related

Query just runs, doesn't execute

my query just runs and doesnt execute, what is wrong. work on oracle sql developer, company server
CREATE TABLE voice2020 AS
SELECT
to_char(SDATE , 'YYYYMM') as month,
MSISDN,
SUM(CH_MONEY_SUBS_DED)/100 AS AIRTIME_VOICE,
SUM(CALLDURATION/60) AS MIN_USAGE,
sum(DUR_ONNET_OOB/60) as DUR_ONNET_OOB,
sum(DUR_ONNET_IB/60) as DUR_ONNET_IB,
sum(DUR_ONNET_FREE/60) as DUR_ONNET_FREE,
sum(DUR_OFFNET_OOB/60) as DUR_OFFNET_OOB,
sum(DUR_OFFNET_IB/60) as DUR_OFFNET_IB,
sum(DUR_OFFNET_FREE/60) as DUR_OFFNET_FREE,
SUM(case when sdate < to_date('20190301','YYYYMMDD')
then CH_MONEY_PAID_DED-nvl(CH_MONEY_SUBS_DED,0)-REV_VOICE_INT-REV_VOICE_ROAM_OUTGOING-REV_VOICE_ROAM_Incoming
else (CH_MONEY_OOB-REV_VOICE_INT-REV_VOICE_ROAM_OUTGOING-REV_VOICE_ROAM_Incoming) end)/100 AS VOICE_OOB_SPEND
FROM CCN.CCN_VOICE_MSISDN_MM#xdr1
where MSISDN IN ( SELECT MSISDN FROM saayma_a.BASE30112020) --change date
GROUP BY
MSISDN,
to_char(SDATE , 'YYYYMM')
;
This is a performance issue. Clearly the query driving your CREATE TABLE statement is taking too long to return a result set.
You are querying from a table in a remote database (CCN.CCN_VOICE_MSISDN_MM#xdr1) and then filtering against a local table (saayma_a.BASE30112020) . This means you are going to copy all of that remote table across the network, then discard the records which don't match the WHERE clause.
You know your data (or at least you should know it): does that sound efficient? If you're actually discarding most of the records you should try to filter CCN_VOICE_MSIDN_MM in the remote database.
If you need more advice you need to provide more information. Please read this post about asking Oracle tuning questions on this site, then edit your question to include some details.
You are executing CTAS (CREATE TABLE AS SELECT) and the purpose of this query is to create the table with data which is generated via this query.
If you want to just execute the query and see the data then remove first line of your query.
-- CREATE TABLE voice2020 AS
SELECT
.....
Also, the data of your actual query must be present in the voice2020 table if you have already executed it once.
Select * from voice2020;
Looks like you are trying to copying the data from one table to another table, Can you once create the table if it's not created and then try this statement.
insert into target_table select * from source_table;

Redshift - Issue displaying time difference in table stored in table

I am trying to find the time difference between two time stamps and store it in a column. When I check the output in the table, I see the value to be a huge number and not the difference in day/hours. I am using amazon redshift as the database.
Data_type of time_duration column : varchar
Given below is the sample:
order_no,order_date,complain_date,time_duration
1001,2018-03-10 04:00:00,2018-03-11 07:00:00,97200000000
But I am expecting time_duration column to show 1 day,3 hours
This issue happens when I store the time_duration in a table and then query to view the output.
Could anyone assist.
Thanks.
Do it following way, it will give hours difference
select datediff(hour,order_date,complain_date) as diff_in_hour from your_table ;
If you want to do it day, do it following way
select datediff(day,order_date,complain_date) as diff_in_day from your_table;
You could use datediff function to update your table column time_duration.
Refer Redshift documentation for more details.

Cannot query over table without a filter that can be used for partition elimination

I have a partitioned table and would love to use a MERGE statement, but for some reason doesn't work out.
MERGE `wr_live.p_email_event` t
using `wr_live.email_event` s
on t.user_id=s.user_id and t.event=s.event and t.timestamp=s.timestamp
WHEN NOT MATCHED THEN
INSERT (user_id,event,engagement_score,dest_email_domain,timestamp,tags,meta)
values (user_id,event,engagement_score,dest_email_domain,timestamp,tags,meta)
I get
Cannot query over table 'wr_live.p_email_event' without a filter that
can be used for partition elimination.
What's the proper syntax? Also is there a way I can express shorter the insert stuff? without naming all columns?
What's the proper syntax?
As you can see from error message - your partitioned wr_live.p_email_event table was created with require partition filter set to true. This mean that any query over this table must have some filter on respective partitioned field
Assuming that timestamp IS that partitioned field - you can do something like below
MERGE `wr_live.p_email_event` t
USING `wr_live.email_event` s
ON t.user_id=s.user_id AND t.event=s.event AND t.timestamp=s.timestamp
AND DATE(t.timestamp) > CURRENT_DATE() -- this is the filter you should tune
WHEN NOT MATCHED THEN
INSERT (user_id,event,engagement_score,dest_email_domain,timestamp,tags,meta)
VALUES (user_id,event,engagement_score,dest_email_domain,timestamp,tags,meta)
So you need to make below line such that it in reality does not filter out whatever you need to be involved
AND DATE(t.timestamp) <> CURRENT_DATE() -- this is the filter you should tune
For example, I found, setting it to timestamp in future - in many cases addresses the issue, like
AND DATE(t.timestamp) > DATE_ADD(CURRENT_DATE(), INTERVAL 1 DAY)
Of course, if your wr_live.email_event table also partitioned with require partition filter set to true - you need to add same filter for s.timestamp
Also is there a way I can express shorter the insert stuff? without naming all columns?
BigQuery DML's INSERT requires column names to be specified - there is no way (at least that I am aware of) to avoid it using INSERT statement
Meantime, you can avoid this by using DDL's CREATE TABLE from the result of the query. This will not require listing the columns
For example, something like below
CREATE OR REPLACE TABLE `wr_live.p_email_event`
PARTITION BY DATE(timestamp) AS
SELECT * FROM `wr_live.p_email_event`
WHERE DATE(timestamp) <> DATE_ADD(CURRENT_DATE(), INTERVAL 1 DAY)
UNION ALL
SELECT * FROM `wr_live.email_event` s
WHERE NOT EXISTS (
SELECT 1 FROM `wr_live.p_email_event` t
WHERE t.user_id=s.user_id AND t.event=s.event AND t.timestamp=s.timestamp
AND DATE(t.timestamp) > DATE_ADD(CURRENT_DATE(), INTERVAL 1 DAY)
)
You might also want to include table options list via OPTIONS() - but looks like filter attribute is not supported yet - so if you do have/need it - above will "erase" this attribute :o(

Why Google BigQuery doesn't use partition date correctly when using views

I have a date partitioned table (call it sample_table) with 2 columns, one to save dateTime in UTC and other to save timezone offset. I have a view on top of this table (call it sample_view). The view takes _partitiontime in from table and exposes that as partitionDate column and also there is another column customerDateTime which simply adds dateTime with timeZoneOffset.
When I query the sample_table directly using only _partitiontime bigquery scans far less data (131 MB).
select
containerName,
count(*)
from
[sample_project.sample_table]
where
_partitiontime between timestamp('2016-12-12') and timestamp('2016-12-19')
and customer = 'X'
and containerName = 'XXX'
group by containerName
;
But when I run same query on the table with dateTime column to scan according to customer's local date time big query scans more (211MB). I expected less than 131MB or equal to 131MB.
select
containerName,
count(*)
from
[sample_project.sample_table]
where
_partitiontime between timestamp('2016-12-12') and timestamp('2016-12-19')
and DATE_ADD(dateTime, 3600, 'SECOND' ) between timestamp('2016-12-12 08:00:00') and timestamp('2016-12-19 15:00:00')
and customer = 'X'
and containerName = 'XXX'
group by containerName
;
When I run similar query against the sample_view with partitionDate bigquery scans more (399MB)
select
containerName,
count(*)
from
[sample_project.sample_view]
where
partitionDate between timestamp('2016-12-12') and timestamp('2016-12-19')
and customer = 'X'
and containerName = 'XXX'
group by containerName
;
And when I run query against the view with partitionDate and use customerDateTime column as well bigquery scans even more (879MB)
select
containerName,
count(*)
from
[sample_project.sample_view]
where
partitionDate between timestamp('2016-12-12') and timestamp('2016-12-19') and customerDateTime between timestamp('2016-12-12 08:00:00') and timestamp('2016-12-19 15:00:00')
and customer = 'X'
and containerName = 'XXX'
group by containerName
;
I'm not too sure whether I'm scanning right partitions from any of the queries above. Why do I see the differences between these queries? Is exposing _partitiontime as a new column partitionDate a bad strategy? I'm not sure how else to use the partition date within Tableau without writing more queries. Please let me know if you require more details.
You will probably need to use standard SQL for the query instead, since legacy SQL has some limitations in terms of filter pushdown. I'm not very familiar with Tableau myself, but they have a help page for BigQuery that talks about switching between legacy and standard SQL.
Just guess - the problem you see is because you have repeated fields. Legacy and Standard SQL deal differently with flattening result. Legacy SQL does flatten result thus you see not count of original records but rather number of repeated values in them. Whereas Standard SQL keep original structure. In Legacy SQL you need to take extra care of eliminating effect of Flattening, while in Standard SQL it is already taken care of

Oracle SQL use variable partition name

I run a daily report that has to query another table which is updated separately. Due to the high volume of records in the source table (8M+ per day) each day is stored in it's own partition. The partition has a standard format as P ... 4 digit year ... 2 digit month ... 2 digit date, so yesterday's partition is P20140907.
At the moment I use this expression, but have to manually change the name of the partition each day:
select * from <source_table> partition (P20140907) where ....
By using sysdate, toChar and Concat I have created another table called P_NAME2 that will automatically generate and update a string value as the name of the partition that I need to read. Now I need to update my main query so it does this:
select * from <source_table> partition (<string from P_NAME2>) where ....
You are working too hard. Oracle already does all these things for you. If you query the table using the correct date range oracle will perform the operation only on the relevant partitions - this is called pruning .
I suggest reading the docs on that.
If you'r still skeptic, Query all_tab_partitions.HIGH_VALUE to get each partitions high value (the table you created ... ).
I thought I'd pop back to share how I solved this in the end. The source database has a habit of leaking dates across partitions which is why queries for one day were going outside a single partition. I can't affect this, just work around it ...
begin
execute immediate
'create table LL_TEST as
select *
from SCHEMA.TABLE Partition(P'||TO_CHAR(sysdate,'YYYYMMDD')||')
where COLUMN_A=''Something''
and COLUMN_B=''Something Else''
';
end
;
Using the PL/SQL script I create the partition name with TO_CHAR(sysdate,'YYYYMMDD') and concatenate the rest of the query around it.
Note that the values you are searching for in the where clause require double apostrophes so to send 'Something' to the query you need ''Something'' in the script.
It may not be pretty, but it works on the database that I have to use.