Extract last modified date from backup snapshots in bigquery

Extract last modified date from backup snapshots in bigquery - google-bigquery

I have snapshots backed up everyday and named in this format: TableName_20221218
I want to extract the date from the name of the snapshots to create a date column
Currently, i am manually adding date columns this way but it is kinda inconvenient because i have to update the code everyday
select id, date('2022-11-17') as date, rfm_r, rfm_f, rfm_m, recency_score, phone, recency_score_detail, gender, age_group, city from `backup.t1_customer_dependent_20221117`
union all
select id, date('2022-11-18') as date, rfm_r, rfm_f, rfm_m, recency_score, phone, recency_score_detail, gender, age_group, city from `backup.t1_customer_dependent_20221118`
union all
select id, date('2022-11-19') as date, rfm_r, rfm_f, rfm_m, recency_score, phone, recency_score_detail, gender, age_group, city from `backup.t1_customer_dependent_20221119`
union all
select id, date('2022-11-20') as date, rfm_r, rfm_f, rfm_m, recency_score, phone, recency_score_detail, gender, age_group, city from `backup.t1_customer_dependent_20221120`
Instead of that i want to automatically take the date from the name of the snapshot to create the date column and transform the code to some thing like this
select id, date, rfm_r, rfm_f, rfm_m, recency_score, phone, recency_score_detail, gender, age_group, city from `backup.t1_customer_dependent*`
Anyone know how to do this ?
Im new to bigquery so any help will be greatly appreciated
Thanks

Consider using _TABLE_SUFFIX for wildcard tables.
SELECT id, PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) AS date, rfm_r ...
FROM `backup.t1_customer_dependent_*`

Related

Need help in Optimizing Query for Azure SQL

I am new to the SQL world and working with the below mentioned query, the table contains 3000000+ records. Can you please suggest how to reduce query run time or any other query for the same result.
I tried two queries:
#1
SELECT *
FROM (SELECT ID,
Priority,
Agent_Name,
Urgency,
Status,
Agent_Group_Name,
Country,
Region,
Due_by,
Type,
Created_Date,
Resolved_Date,
Closed_Date,
Resolution_Status,
Requester_Location,
WH_Region,
ExecDate,
Date,
Full_Date,
Datatype,
Department_Name,
Requester_Emails,
ROW_NUMBER()
OVER (
PARTITION BY ID
ORDER BY Execdate DESC ) nn
FROM weekly_tickets
WHERE Created_date >= '2022-01-01 12:00:00 AM') sub_table
WHERE sub_table.nn = 1
#2
WITH cte
AS (SELECT ID,
Priority,
Agent_Name,
Urgency,
Status,
Category,
Item_Category,
Agent_Group_Name,
What_is_the_Impact_,
Country,
Impact,
Region,
Resolution_Time_in_Bhrs,
Sub_Category,
Due_by,
Type,
Issue_Owner,
Created_Date,
Number_of_Users,
Approval_Status,
Resolved_Date,
Closed_Date,
How_is_the_issue_affecting_the_service_,
Number_of_Users_staffed,
Resolution_Status,
Sites,
Requester_Location,
Number_of_Users_affected,
WH_Region,
CampaignOriginId,
ExecDate,
Date,
Full_Date,
AgeEvol,
Datatype,
Department_Name,
Requester_Emails,
ROW_NUMBER()
OVER (
PARTITION BY ID
ORDER BY Execdate DESC ) nn
FROM weekly_tickets
WHERE Created_date >= '2022-01-01 12:00:00 AM')
SELECT *
FROM cte
WHERE cte.nn = 1

Always read execution plan or show someone else if it's too complex for you. This query seems pretty obvious to reconstruct with a high degree of certainty. I assume following optimization steps:
Scan table and filter by Created_date, return all columns required by next steps (here all columns used by SELECT clause),
Order by ID, ExecDate DESC
Segment
[..]
Filter comes from WHERE clause. Ordering and segmenting comes from OVER clause. I also assume that date filter significantly reduces number of rows returned. Since there is less rows to process, query should perform better.
All that means you should start with the following index:
CREATE INDEX IX_Predicted ON weekly_tickets(Created_Date) INCLUDE (ID,
Priority,
Agent_Name,
Urgency,
Status,
Category,
Item_Category,
Agent_Group_Name,
What_is_the_Impact_,
Country,
Impact,
Region,
Resolution_Time_in_Bhrs,
Sub_Category,
Due_by,
Type,
Issue_Owner,
Number_of_Users,
Approval_Status,
Resolved_Date,
Closed_Date,
How_is_the_issue_affecting_the_service_,
Number_of_Users_staffed,
Resolution_Status,
Sites,
Requester_Location,
Number_of_Users_affected,
WH_Region,
CampaignOriginId,
ExecDate,
Date,
Full_Date,
AgeEvol,
Datatype,
Department_Name,
Requester_Emails);

You could create an index to help with performance:
CREATE NONCLUSTERED INDEX index_POC_ticket
ON dbo.weekly_tickets (ID, Execdate DESC)
WITH (DROP_EXISTING = ON);
Edit: added DESC to the Execdate column in the index, since I believe it should support the ordering in your window function.
If ID is the primary key, I don't believe you need to include it in the above index (maybe someone more knowledgeable can comment here).

I should think that your query can be rewrite as :
SELECT ID,
Priority,
Agent_Name,
Urgency,
STATUS,
Agent_Group_Name,
Country,
Region,
Due_by,
Type,
Created_Date,
Resolved_Date,
Closed_Date,
Resolution_Status,
Requester_Location,
WH_Region,
ExecDate,
Date,
Full_Date,
Datatype,
Department_Name,
Requester_Emails
FROM weekly_tickets AS wt
WHERE Created_date >= '2022-01-01 12:00:00 AM'
AND NOT EXISTS(SELECT *
FROM weekly_tickets AS t
WHERE wt.ID <> t.ID
AND wt.Execdate < t.Execdate);
And can be much more efficient with the following index :
CREATE INDEX X ON weekly_tickets (ID, Execdate);
If not exists...
Test it !

Subtracting dates in SQL Big Query

I'm sure this is super simple, but I am still very new to SQL. I am trying to subtract two dates, but one date is just a birthyear in an integer format and the other I would like for it to be the current date. I am trying to find age of these individuals. Any help would be greatly appreciated! What i have so far is below. Thank you!
select
distinct(count(usertype)),
gender, usertype,
date_diff(extract(year from current_date) as current_year, tripdata.birth_year, year)
from `project-1-349215.Dataset.tripdata`

Posting #JNevill's suggestion in the comment section as wikianswer:
with tripdata as (
select 1995 as birth_year, "premium" as usertype, "F" as gender,
union all select 1990 as birth_year, "standard" as usertype, "F" as gender,
union all select 1990 as birth_year, "standard" as usertype, "F" as gender,
union all select 1993 as birth_year, "standard" as usertype, "F" as gender,
union all select 1993 as birth_year, "premium" as usertype, "M" as gender,
union all select 1994 as birth_year, "premium" as usertype, "F" as gender,
)
select
count(usertype),
gender,
extract(year from current_date) - tripdata.birth_year as age
from tripdata
group by usertype, gender, tripdata.birth_year;
Output:

date_diff(Column1,column2,day) As date_diff_days
if you want to perform difference in minute and seconds --> change day to minute and seconds.

How to filter and group data imported from several tables in Redshift?

Nice to meet you, dear community!
I want to select users from several tables who performed their last event not later than 7 days since registration day and group them by their start version.
However, the number of select users is quite low, could you please tell me where I have a mistake?
FROM (SELECT user_id, country, user_creation_time, event_type, event_time, start_version
FROM PUBLIC.export_07_2020
UNION
SELECT user_id, country, user_creation_time, event_type, event_time, start_version
FROM PUBLIC.export_08_2020
UNION
SELECT user_id, country, user_creation_time, event_type, event_time, start_version
FROM PUBLIC.export_09_2020) dat
WHERE dat.country = 'United States'
AND dat.user_creation_time BETWEEN
'2020-07-01 00:00:00' AND '2020-09-23 23:59:59'
AND NOT EXISTS (SELECT dit.user_id
FROM (SELECT user_id,
country,
user_creation_time,
event_type,
event_time
FROM PUBLIC.export_07_2020
UNION
SELECT user_id,
country,
user_creation_time,
event_type,
event_time
FROM PUBLIC.export_08_2020
UNION
SELECT user_id,
country,
user_creation_time,
event_type,
event_time
FROM PUBLIC.export_09_2020) dit
WHERE dat.user_id = dit.user_id
AND Greatest(dit.event_time) >
Dateadd(day, 7, dit.user_creation_time))
GROUP BY dat.start_version```

BigQuery GroupBy with STRUCT

In BigQuery, I can successfully run the following query using standard SQL:
SELECT
COUNT(*) AS totalCount,
city,
DATE_TRUNC(timeInterval.intervalStart, YEAR) AS start
FROM
sandbox.CountByCity
GROUP BY
city, start
But it fails when I nest the start value in a STRUCT, like this...
SELECT
COUNT(*) AS totalCount,
city,
STRUCT(
DATE_TRUNC(timeInterval.intervalStart, YEAR) AS start
) as timeSpan
FROM
sandbox.CountByCity
GROUP BY
city, timeSpan.start
In this case, I get the following error message:
Cannot GROUP BY field references from SELECT list alias timeSpan at [10:11]
What is the correct way to write the query so that the start value is nested within a STRUCT?

You can do this using ANY_VALUE. The struct value that you get is well-defined, since the value is the same for the entire group:
SELECT
COUNT(*) AS totalCount,
city,
ANY_VALUE(STRUCT(
DATE_TRUNC(timeInterval.intervalStart, YEAR) AS start
)) as timeSpan
FROM
sandbox.CountByCity
GROUP BY
city, DATE_TRUNC(timeInterval.intervalStart, YEAR);
Here is an example using some sample data:
WITH `sandbox.CountByCity` AS (
SELECT 'Seattle' AS city, STRUCT(DATE '2017-12-11' AS intervalStart) AS timeInterval UNION ALL
SELECT 'Seattle', STRUCT(DATE '2016-11-10' AS intervalStart) UNION ALL
SELECT 'Seattle', STRUCT(DATE '2017-03-24' AS intervalStart) UNION ALL
SELECT 'Kirkland', STRUCT(DATE '2017-02-01' AS intervalStart)
)
SELECT
COUNT(*) AS totalCount,
city,
ANY_VALUE(STRUCT(
DATE_TRUNC(timeInterval.intervalStart, YEAR) AS start
)) as timeSpan
FROM
`sandbox.CountByCity`
GROUP BY
city, DATE_TRUNC(timeInterval.intervalStart, YEAR);
You could also consider submitting a feature request to enable GROUP BY with STRUCT types.

Not sure why exactly you would wanted this - but believe it is for some reason - so try below (at least formally it does what you ask)
#standardSQL
SELECT
totalCount,
city,
STRUCT(start) timeSpan
FROM (
SELECT
COUNT(*) AS totalCount,
city,
DATE_TRUNC(timeInterval.intervalStart, YEAR) AS start
FROM `sandbox.CountByCity`
GROUP BY city, start
)

SQL min / max with all fields

I am facing a simple problem with an SQL query that I do not know how to tackle.
I have a table with the following structure
CITY COUNTRY DATES TEMPERATURE
Note that for a given country, I can have several cities. And, for a given city, I have several rows giving me the TEMPERATURE at each available DATE. This is just a time serie.
I would like to write a query which gives me for every cities the DATE where the TEMPERATURE is the MIN and the DATE where the TEMPERATURE is the MAX. The query should return something like that:
CITY COUNTRY DATE_MIN_TEMPERATURE MIN_TEMPERATURE DATE_MAX_TEMPERATURE MAX_TEMPERATURE
Any idea on how to achieve this?
Best regards,
Deny

Oracle provides keep/dense_rank first for this purpose:
select city,
min(temperature) as min_temperature,
max(date) keep (dense_rank first order by temperature asc) as min_temperature_date,
max(temperature) as max_temperature,
max(date) keep (dense_rank first order by temperature desc) as max_temperature_date
from t
group by city;
Note that this returns only one date if there are ties. If you want to handle that, more logic is needed:
select city, min(temperature) as min_temperature,
listagg(case when seqnum_min = 1 then date end, ',') within group (order by date) as mindates,
max(temperature) as max_temperature,
listagg(case when seqnum_max = 1 then date end, ',') within group (order by date) as maxdates,
from (select t.*,
rank() over (partition by city order by temperature) as seqnum_min,
rank() over (partition by city order by temperature desc) as seqnum_max
from t
) t
where seqnum_min = 1 or seqnum_max = 1
group by city;

In Oracle 11 and above, you can use PIVOT. In the solution below I use LISTAGG to show all the dates in case of ties. Another option is, in the case of ties, to show the most recent date when the extreme temperature was reached; if that is preferred, simply replace LISTAGG(dt, ....) (including the WITHIN GROUP clause) with MAX(dt). However, in that case the first solution offered by Gordon (using the first function) is more efficient anyway - no need for pivoting.
Note that I changed "date" to "dt" - DATE is a reserved word in Oracle. I also show the rows by country first, then city (the more logical ordering). I created test data in a WITH clause, but the solution is everything below the comment line.
with
inputs ( city, country, dt, temperature ) as (
select 'Palermo', 'Italy' , date '2014-02-13', 3 from dual union all
select 'Palermo', 'Italy' , date '2002-01-23', 3 from dual union all
select 'Palermo', 'Italy' , date '1998-07-22', 42 from dual union all
select 'Palermo', 'Italy' , date '1993-08-24', 30 from dual union all
select 'Maseru' , 'Lesotho', date '1994-01-11', 34 from dual union all
select 'Maseru' , 'Lesotho', date '2004-08-13', 12 from dual
)
-- >> end test data; solution (SQL query) begins with the next line
select country, city,
"'min'_DT" as date_min_temp, "'min'_TEMP" as min_temp,
"'max'_DT" as date_max_temp, "'max'_TEMP" as max_temp
from (
select city, country, dt, temperature,
case when temperature = min(temperature)
over (partition by city, country) then 'min'
when temperature = max(temperature)
over (partition by city, country) then 'max'
end as flag
from inputs
)
pivot ( listagg(to_char(dt, 'dd-MON-yyyy'), ', ')
within group (order by dt) as dt, min(temperature) as temp
for flag in ('min', 'max'))
order by country, city -- ORDER BY is optional
;
COUNTRY CITY DATE_MIN_TEMP MIN_TEMP DATE_MAX_TEMP MAX_TEMP
------- ------- ------------------------ ---------- -------------- ----------
Italy Palermo 23-JAN-2002, 13-FEB-2014 3 22-JUL-1998 42
Lesotho Maseru 13-AUG-2004 12 11-JAN-1994 34
2 rows selected.

Instead of keep/dense_rank first function you can also use FIRST_VALUE and LAST_VALUE:
select distinct city,
MIN(temperature) OVER (PARTITION BY city) as min_temperature,
FIRST_VALUE(date) OVER (PARTITION BY city ORDER BY temperature) AS min_temperature_date,
MAX(temperature) OVER (PARTITION BY city) as max_temperature,
LAST_VALUE(date) OVER (PARTITION BY city ORDER BY temperature) AS max_temperature_date
FROM t;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Extract last modified date from backup snapshots in bigquery - google-bigquery

Consider using _TABLE_SUFFIX for wildcard tables. SELECT id, PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) AS date, rfm_r ... FROM `backup.t1_customer_dependent_*`

Related

Need help in Optimizing Query for Azure SQL

Subtracting dates in SQL Big Query

How to filter and group data imported from several tables in Redshift?

BigQuery GroupBy with STRUCT

SQL min / max with all fields

Categories

Resources