Google Bigquery How to convert lag() funcion result to timestamp - google-bigquery

Hi is it possible to convert the result of a lag function to timestamp. I basically want to get the diff of two timestamp in seconds.
With the following codes the system tells me that the type of 'last_timestamp' is unknown. When i put the mouse cursor on the column 'last_timestamp' of the inner query, then i can see that its of type timestamp.
SELECT clientId, timestamp
FROM (
SELECT clientId, timestamp,
LAG(timestamp,1) OVER
(PARTITION BY clientId ORDER BY timestamp)
AS last_timestamp
FROM [oxidation.201602]
) last
WHERE (TIMESTAMP_TO_SEC(timestamp) - TIMESTAMP_TO_SEC(last_timestamp) >= (60 * 30))
OR last_timestamp IS NULL

SELECT clientId, timestamp
FROM (
SELECT clientId, timestamp, timestamp_sec,
LAG(timestamp_sec, 1)
OVER (PARTITION BY clientId ORDER BY timestamp_sec) AS prev_timestamp_sec
FROM (
SELECT clientId, timestamp, TIMESTAMP_TO_SEC(timestamp) as timestamp_sec
FROM [oxidation.201602]
)
) last
WHERE timestamp_sec - prev_timestamp_sec >= 60 * 30
OR prev_timestamp_sec IS NULL

Related

Select latest 30 dates for each unique ID

This is a sample data file
Data Contains unique IDs with different latitudes and longitudes on multiple timestamps.I would like to select the rows of latest 30 days of coordinates for each unique ID.Please help me on how to run the query .This date is in Hive table
Regards,
Akshay
According to your example above (where no current year dates for id=2,3), you can numbering date for each id (order by date descending) using window function ROW_NUMBER(). Then just get latest 30 values:
--get all values for each id where num<=30 (get last 30 days for each day)
select * from
(
--numbering each date for each id order by descending
select *, row_number()over(partition by ID order by DATE desc)num from Table
)X
where num<=30
If you need to get only unique dates (without consider time) for each id, then can try this query:
select * from
(
--numbering date for each id
select *, row_number()over(partition by ID order by new_date desc)num
from
(
-- move duplicate using distinct
select distinct ID,cast(DATE as date)new_date from Table
)X
)Y
where num<=30
In Oracle this will be:
SELECT * FROM TEST_DATE1
WHERE DATEUPDT > SYSDATE - 30;
select * from MyTable
where
[Date]>=dateadd(d, -30, getdate());
To group by ID and perform aggregation, something like this
select ID,
count(*) row_count,
max(Latitude) max_lat,
max(Longitude) max_long
from MyTable
where
[Date]>=dateadd(d, -30, getdate())
group by ID;

how to calculate difference between dates in BigQuery

I have a table named Employees with Columns: PersonID, Name, StartDate. I want to calculate 1) difference in days between the newest and oldest employee and 2) the longest period of time (in days) without any new hires. I have tried to use DATEDIFF, however the dates are in a single column and I'm not sure what other method I should use. Any help would be greatly appreciated
Below is for BigQuery Standard SQL
#standardSQL
SELECT
SUM(days_before_next_hire) AS days_between_newest_and_oldest_employee,
MAX(days_before_next_hire) - 1 AS longest_period_without_new_hire
FROM (
SELECT
DATE_DIFF(
StartDate,
LAG(StartDate) OVER(ORDER BY StartDate),
DAY
) days_before_next_hire
FROM `project.dataset.your_table`
)
You can test, play with above using dummy data as in the example below
#standardSQL
WITH `project.dataset.your_table` AS (
SELECT DATE '2019-01-01' StartDate UNION ALL
SELECT '2019-01-03' StartDate UNION ALL
SELECT '2019-01-13' StartDate
)
SELECT
SUM(days_before_next_hire) AS days_between_newest_and_oldest_employee,
MAX(days_before_next_hire) - 1 AS longest_period_without_new_hire
FROM (
SELECT
DATE_DIFF(
StartDate,
LAG(StartDate) OVER(ORDER BY StartDate),
DAY
) days_before_next_hire
FROM `project.dataset.your_table`
)
with result
Row days_between_newest_and_oldest_employee longest_period_without_new_hire
1 12 9
Note use of -1 in calculating longest_period_without_new_hire - it is really up to you to use this adjustment or not depends on your preferences of counting gaps
1) difference in days between the newest and oldest record
WITH table AS (
SELECT DATE(created_at) date, *
FROM `githubarchive.day.201901*`
WHERE _table_suffix<'2'
AND repo.name = 'google/bazel-common'
AND type='ForkEvent'
)
SELECT DATE_DIFF(MAX(date), MIN(date), DAY) max_minus_min
FROM table
2) the longest period of time (in days) without any new records
WITH table AS (
SELECT DATE(created_at) date, *
FROM `githubarchive.day.201901*`
WHERE _table_suffix<'2'
AND repo.name = 'google/bazel-common'
AND type='ForkEvent'
)
SELECT MAX(diff) max_diff
FROM (
SELECT DATE_DIFF(date, LAG(date) OVER(ORDER BY date), DAY) diff
FROM table
)

Standard deviation of a set of dates

I have a table of transactions with columns id | client_id | datetime and I have calculated the mean of days between transactions to know how often this transactions are made by each client:
SELECT *, ((date_last_transaction - date_first_transaction)/total_transactions) AS frequency
FROM (
SELECT client_id, COUNT(id) AS total_transactions, MIN(datetime) AS date_first_transaction, MAX(datetime) AS date_last_transaction
FROM transactions
GROUP BY client_id
) AS t;
What would be the existing methods to calculate the standard deviation (in days) in a set of dates with postgresql? Preferably with only one query, if it is posible :-)
I have found this way:
SELECT extract(day from date_trunc('day', (
CASE WHEN COUNT(*) <= 1 THEN
0
ELSE
SUM(time_since_last_invoice)/(COUNT(*)-1)
END
) * '1 day'::interval)) AS days_between_purchases,
extract(day from date_trunc('day', (
CASE WHEN COUNT(*) <= 2 THEN
0
ELSE
STDDEV(time_since_last_invoice)
END
) * '1 day'::interval)) AS range_of_days
FROM (
SELECT client_id, datetime, COALESCE(datetime - lag(datetime)
OVER (PARTITION BY client_id ORDER BY client_id, datetime
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
), 0
) AS time_since_last_invoice
FROM my_table
GROUP BY client_id, datetime
ORDER BY client_id, datetime
)
Explanation:
This query groups by client and date and then calculates the difference between each pair of transaction dates (datetime) by client_id and returns a table with these results. After this, the external query processes the table and calculates de average time between differences greater than 0 (first value in each group is excluded because is the first transaction and therefore the interval is 0).
The standard deviation is calculated when there existe 2 o more transaction dates for the same client, to avoid division by zero errors.
All differences are returned in PostgreSQL interval format.

Compare timestamps stored as strings to a string formatted date

event_date contains timestamps stored as strings.
1382623200
1382682600
1384248600
...
How can I SELECT rows where event_date is less than a string formatted date? This is my best attempt:
SELECT *
FROM [analytics:workspace.events]
WHERE TIMESTAMP(event_date) < PARSE_UTC_USEC("2013-05-02 09:09:29");
I get all rows regardless of what date I pass to PARSE_UTC_USEC()
It looks like the event_date strings represent Unix seconds. Try this using standard SQL (uncheck "Use Legacy SQL" under "Show Options"):
WITH T AS (
SELECT x, event_date
FROM UNNEST(['1382623200',
'1382682600',
'1384248600']) AS event_date WITH OFFSET x
)
SELECT *
FROM (
SELECT * REPLACE (TIMESTAMP_SECONDS(CAST(event_date AS INT64)) AS event_date)
FROM T
)
WHERE event_date < '2013-05-02 09:09:29';
The subquery converts the event_date string into a timestamp using the REPLACE clause.
Try below. Hope this helps
SELECT event_date, TIMESTAMP(event_date) as ts
FROM -- [analytics:workspace.events]
(
SELECT event_date FROM
(SELECT '1382623200' AS event_date),
(SELECT '1382682600' AS event_date),
(SELECT '1384248600' AS event_date)
)
WHERE TIMESTAMP(event_date) < PARSE_UTC_USEC("2013-10-25 07:30:00")
above is just example - you should use your table in real life:
SELECT event_date, TIMESTAMP(event_date) as ts
FROM [analytics:workspace.events]
WHERE TIMESTAMP(event_date) < PARSE_UTC_USEC("2013-10-25 07:30:00")

Meaning of SQL-code in reviewed code

I need review some code of test-application written in PHP/MySQL.
Author of this code wrote three SQL-queries.
I can't understand, if he meant here some performace optimization?
DB::fetch("SELECT COUNT( * ) AS count, `ip`,`datetime`
FROM `logs`
WHERE `datetime` > ('2006-02-03' - INTERVAL 10 DAY)
GROUP BY `ip`
ORDER BY `datetime` DESC");
$hits = DB::fetchAll("SELECT COUNT( * ) AS count, `datetime`
FROM `logs`
WHERE `datetime` > ( '2006-02-03' - INTERVAL 10
DAY ) AND `is_doc` = 1
GROUP BY `datetime`
ORDER BY `datetime` DESC");
$hosts = DB::fetchAll("SELECT COUNT( * ) AS hosts , datetime
FROM (
SELECT `ip` , datetime
FROM `logs`
WHERE `is_doc` = 1
GROUP BY `datetime` , `ip`
ORDER BY `logs`.`datetime` DESC
) AS cnt
WHERE cnt.datetime > ( '2006-02-03' - INTERVAL 10
DAY )
GROUP BY cnt.datetime
ORDER BY datetime DESC ");
Results of first query are not used in application.
The 1st query is invalid, as it selects 2 columns + 1 aggregate and only groups by 1 of the 2 columns selected.
The 2nd query is getting a count of all rows in logs by date within the last 10 days since 2006-02-03
The 3rd query is getting a count of all distinct ip values from logs within the last 10 days since 2006-02-03 and could be better written as
SELECT COUNT(DISTINCT ip) hosts, datetime
FROM logs
WHERE is_doc = 1
GROUP BY datetime
ORDER BY datetime desc
If this was a submission for a job iterview you may wonder why the cutoff date isn't passed as a variable.