Adding a new column into Athena (Presto) table calculated by taking the difference between two rows - sql

Over the past few weeks, I've written a pipeline that picks up all the clickstream data that is being broadcasted from a website. The pipeline makes use of AWS in the following way: S3 > EC2 (for transforms) > Athena (scanning a clean, partitioned s3). New data comes into the pipeline every 24hour and this works great - my clickstream data is easily queriable. However, I now need to add some additional columns i.e. time spent on each page. This can be achieved by sorting by user ID, timestamp and then taking the difference between the timestamp column of row_n1 and row_n2. So my questions are:
1) How can I do this via an SQL query? I'm struggling to get it to work, but my thinking is that once I do I can trigger this query every 24hours to run on the new clickstream data that's coming into Athena.
2) Is this a reasonable way to add additional columns or new aggregate tables? for example, build a query that runs every 24hours on new data to append to a new table.
Ideally, I don't want to touch any of the source code that's been written to do the "core" ETL pipeline
for reference my table looks similar to the following (with the new column time spent on page) :
| userID | eventNum | Category| Time | ...... | timeSpentOnPage |
'103-1023' '3' 'View' '12-10-2019...' 3s
Thanks for any direction/advice that can be provided.

I'm not entirely sure what you are asking, and some example data and expected output would be helpful. For example, I don't quite understand what you mean by row_n and row_m.
I'm going to guess that you mean something like calculating the difference between the timestamps of consecutive rows. That can be achieved by a query like
SELECT
userID,
timestamp - LAG(timestamp, 1) OVER (PARTITION BY userID ORDER BY timestamp) AS timeSpentOnPage
FROM events
The LAG window function returns the value from a previous row (1 in this case means the previous row) in the window given by the window frame (in this case all rows with the same userID and sorted by timestamp). It's kind of like GROUP BY but for each row, if that makes sense.
It wouldn't quite give you the time spent on each page, some page views would look like they were very long when in fact there was just not any activity between them (say someone browsed some, went to lunch, and browsed some more – the last page view before lunch would look like it spanned the whole lunch).
There is no way to do the equivalent of UPDATE in Athena. The closest thing is doing a "CTAS" (Create Table AS) to create a new table (which with some automation can be turned into creating new partitions for existing tables).
If you provide some more information about your data I can revise this answer with other suggestions.

Related

What is the best approach for bulk cleaning a database table that has a large amount of duplicated data loaded every day (snowflake db)

Thanks in advance for reading this, I hope I explain my problem.
In one of our domains, we have a pipeline (Multiple) where data flows from S3 into a snowflake staging table using airflow. The data itself originates from a number of different applications but the process is always the same. The data is extracted from the application by the support teams (multiple support teams across multiple countries, using different technologies), then into AWS S3 and then bulk loaded into snowflake. Due to limitations on the data from source their often isn't any filter on the data itself and effectively the staging table is loaded with the raw CSV every single day, a file date column is added to the data itself. The result is that we have tables that have been loaded with the same data every single day since 2009.
However the data does change, from day to day a column value will change and so the file date is very useful in tracking changed attributes and something that I want to exploit. Further if the data was cleansed we would need approximately 1% of the data.
These tables are huge some contain around 16 trillion rows but can we be quite narrow.
I would like to optimally loop through each days worth of data and then only load into the staging tables new data as apposed to just loading everything each day.
I have tried the following
A query that windows over the entire set and compares the hashed value of each row (minus the file date) and then only returns if it did not appear in the previous dates data set. This works but not for the larger tables as the warehouse starts to write to disk and then it takes hours.
A day by day loop that looks at each file date data set and compares to the previous day and only loads the difference, this takes to long on the initial clean of the tables but is what I am doing once the data has been cleaned and will form the initial load procedure.
The current solution is where I dynamically create multiple minus set statement where I look at each day minus the day before then batch these into blocks of 10-20 based of the average daily row size so as an example
INSERT INTO TEMP TABLE
(Select * FROM TABLE A WHERE FILE_DATE = 040123
MINUS
Select * FROM TABLE A WHERE FILE_DATE = 030123)
UNION ALL
(Select * FROM TABLE A WHERE FILE_DATE = 030123
MINUS
Select * FROM TABLE A WHERE FILE_DATE = 020123)
etc...
This is not pretty though does work however its taking me around 12 hours to process 70 odd tables.
I would like advice on if their is another approach.
Please bear in mind that I am limited to using snowflake due to resourcing issues and politics.
Any guidance and ideas would be much appreciated.
Regards

Azure Stream Analytics : Select data with the last timestamp only

I'm working on a way to stream status of some jobs that are running on an HPC resource (sort of like trying to create a dashboard to look at real time flight status). I generate and push data every 60 seconds. Unfortunately, this way i end up with a lot of repeated data as the status of each 'job' changes unpredictably. I need a way to only keep the latest data. I'm not an SQL pro and do this work in my free time so any help will be appreciated!
Here is my query:
SELECT
Job, Ref, Location, Queue, Description, Status, ElapTime, cast (Time as datetime) as Time
INTO
output_source
FROM
input_source
Here is what my output looks like when i test the query:
Query Test Result
As you can see, in the image, there are two sets of data with two different time stamps. I would like the query to return all the columns associated with only the last timestamp. How do i do this? Any ideas? Apologies if this is a repeated question. I have not found an answer that has helped me solve this problem.
Thanks for all your help!

Doubleor triple timestamp issue

I am using SQL assistant and my data brings in snapshots from a huge database in the form of timestamps. Occasionally the snapshots bring in multiples per hour. The data is correct, multiple snapshots do happen from time to time within an hour, not always but it does happen.
I am bringing this into Spotfire and viewing by an hour and when more than one snapshot happens in the hour, the data shows as doubled.
I only want to display one per hour preferably the last(max) timestamp for the hour. Example; for the 7 am hour the data has a snapshot for 7:10 am and one for 7:55 am.
These are correct but I only want to display the last(max) timestamp, 7:55 am in this case. I can't figure the issue out in Spotfire so I am leaning towards a fix in SQL. How can I display only 1 for each hour?
You'd do this similarly to how you'd probably do it in SQL -- using a ranking/rownumber function.
The basic way Rank in Spotfire works is Rank(Order columns, order direction, partitioned columns, tie method)
You need to partition by the combination of Date and Hour, and then sort descending by your timestamp column.
So the code to identify the rows that you want to isolate should be something along the lines of:
Rank([TimestampColumn], "desc", Date([TimestampColumn]), Hour([TimestampColumn]), "ties.method=first")
What you do with it from here is going to depend on how you plan to use the data - for example, you can Limit Data Using Expression and set the code above = 1 which will limit your table accordingly (helpful if you don't want your users to accidentally forget to filter), or you can create a calculated column which turns it into a flag of some form like here:
If(Rank([TimestampColumn], "desc", Date([TimestampColumn]), Hour([TimestampColumn]), "ties.method=first") = 1, "Latest", "Duplicate")
Which allows your users to filter by this property. This way, they have the option to look at the extra rows.
Ultimately, though, if you want to only ever see these rows, and have no use for the earlier records, I'd probably do it in SQL, if you have that ability. This reduces the number of rows you have to load into your analytic.

Shifting Window in google Big Query dataset

I have 30 daily sharded tables in Big Query from Nov 1 to Nov 30, 2016.
Each of these tables follow the naming convention of "sample_datamart_YYYYMMDD".
Each of these daily tables have a field called timestampServer.
My goal is to advance the data by 24 hours at 00:00:00 UTC every day.
So that the data is kept current without me having to copy the tables.
Is there any way to :
1) do a calculation on the field timestampServer so that it gets updated every 24 hours?
2) and at the same time rename the table name from sample_datamart_20161130 to sample_datamart_20161201?
I've read the other posts and I think those are more on aggregations in a 30 day window. My objective is not to do any aggreagtions. I just want to move the whole dataset forward by 24 hours so that when I searched for the last 1 day, there will always be data there.
Does anyone know if Google Cloud Datasets: Update be able to perform the tasks?
https://cloud.google.com/bigquery/docs/reference/rest/v2/datasets/update#try-it
Thanks very much for any guidance.
As of #2 - how to rename the table name from sample_datamart_20161130
to sample_datamart_20161201?
This can be achieved by copying table to new table and then deleting original table.
Zero extra cost as copy job is free of charge
Table can be copied with Jobs: Insert API using copy configuration and then table can be deleted using Tables: Delete API
Just wanted to note that above answer just directly answers your (second) question. But somehow I feel you can go wrong direction. If you want to describe in more details what your are trying to achieve (as oposed to how you think you will implement it) we might be able to provide better help for you. If you will go this way - I would recommend to post it as a separate question :o)

Bigquery and Tableau

I attached Tableau with Bigquery and was working on the Dash boards. Issue hear is Bigquery charges on the data a query picks everytime.
My table is 200GB data. When some one queries the dash board on Tableau, it runs on total query. Using any filters on the dashboard it runs again on the total table.
on 200GB data, if someone does 5 filters on different analysis, bigquery is calculating 200*5 = 1 TB (nearly). For one day on testing the analysis we were charged on a 30TB analysis. But table behind is 200GB only. Is there anyway I can restrict Tableau running on total data on Bigquery everytime there is any changes?
The extract in Tableau is indeed one valid strategy. But only when you are using a custom query. If you directly access the table it won't work as that will download 200Gb to your machine.
Other options to limit the amount of data are:
Not calling any columns that you don't need. Do this by hiding unused fields in Tableau. It will not include those fields in the query it sends to BigQuery. Otherwise it's a SELECT * and then you pay for the full 200Gb even if you don't use those fields.
Another option that we use a lot is partitioning our tables. For instance, a partition per day of data if you have a date field. Using TABLE_DATE_RANGE and TABLE_QUERY functions you can then smartly limit the amount of partitions and hence rows that Tableau will query. I usually hide the complexity of these table wildcard functions away in a view. And then I use the view in Tableau. Another option is to use a parameter in Tableau to control the TABLE_DATE_RANGE.
1) Right now I learning BQ + Tableau too. And I found that using "Extract" is must for BQ in Tableau. With this option you can also save time building dashboard. So my current pipeline is "Build query > Add it to Tableau > Make dashboard > Upload Dashboard to Tableau Online > Schedule update for Extract
2) You can send Custom Quota Request to Google and set up limits per project/per user.
3) If each of your query touching 200GB each time, consider to optimize these queries (Don't use SELECT *, use only dates you need, etc)
The best approach I found was to partition the table in BQ based on a date (day) field which has no timestamp. BQ allows you to partition a table by a day level field. The important thing here is that even though the field is day/date with no timestamp it should be a TIMESTAMP datatype in the BQ table. i.e. you will end up with a column in BQ with data looking like this:
2018-01-01 00:00:00.000 UTC
The reasons the field needs to be a TIMESTAMP datatype (even though there is no time in the data) is because when you create a viz in Tableau it will generate SQL to run against BQ and for the partitioned field to be utilised by the Tableau generated SQL it needs to be a TIMESTAMP datatype.
In Tableau, you should always filter on your partitioned field and BQ will only scan the rows within the ranges of the filter.
I tried partitioning on a DATE datatype and looked up the logs in GCP and saw that the entire table was being scanned. Changing to TIMESTAMP fixed this.
The thing about tableau and Big Query is that tableau calculates the filter values using your query ( live query ). What I have seen in my project logging is, it creates filters from your own query.
select 'Custom SQL Query'.filtered_column from ( your_actual_datasource_query ) as 'Custom SQL Query' group by 'Custom SQL Query'.filtered_column
Instead, try to create the tableau data source with incremental extracts and also try to have your query date partitioned ( Big Query only supports date partitioning) so that you can limit the data use.