I'm working on a way to stream status of some jobs that are running on an HPC resource (sort of like trying to create a dashboard to look at real time flight status). I generate and push data every 60 seconds. Unfortunately, this way i end up with a lot of repeated data as the status of each 'job' changes unpredictably. I need a way to only keep the latest data. I'm not an SQL pro and do this work in my free time so any help will be appreciated!
Here is my query:
SELECT
Job, Ref, Location, Queue, Description, Status, ElapTime, cast (Time as datetime) as Time
INTO
output_source
FROM
input_source
Here is what my output looks like when i test the query:
Query Test Result
As you can see, in the image, there are two sets of data with two different time stamps. I would like the query to return all the columns associated with only the last timestamp. How do i do this? Any ideas? Apologies if this is a repeated question. I have not found an answer that has helped me solve this problem.
Thanks for all your help!
Related
Looks like there is no longer data being published to the public NOAA forecast table in bigquery's public dataset. Does anyone know why that is happening? I cannot find any info about the data being discontinued on either website.
project: bigquery-public-data
dataset: noaa_global_forecast_system
table: NOAA_GFS0P25
BigQuery sql that you can use to test this out:
SELECT * FROM `bigquery-public-data.noaa_global_forecast_system.NOAA_GFS0P25` WHERE DATE(creation_time) >= "2022-04-11" LIMIT 100
New forecast data has not been inserted into the table since 4/10/22. They have missed a day before, but we have not seen them miss multiple days in a row before. We would like to know if we need to migrate to a new forecast source, but we cannot find any info on whether this one is being shut down or if they are just having temporary technical difficulties.
Thanks for the heads up! This looks like a temporary technical issue, but we are working on getting this dataset back up and running.
By default how long a temp table will be on to get data? I know that we can set up the expiration, but what is the default?
And about the job? What's the default expiration time if it has?
I tried to find these in the documentation but i couldn't find, we return the jobId to the client so he can get the data when the job is complete, but some of them like to store and tries to fetch data with a jobId from 2 weeks ago, 1 month ago.
What's the default time here so i can explain them better?
Query results are stored for 24 hours:
All query results, including both interactive and batch queries, are cached in temporary tables for approximately 24 hours with some exceptions.
https://cloud.google.com/bigquery/docs/cached-results
As was mentioned by Alexey, they query results are stored for 24 hours when using cache.
Regarding the timelife for BigQuery jobs, you can get the job history for the last six months
By the other side, based on your description, creating a new table from your query results with expiration time seems to be the most appropriate strategy. Also, you could check if using materialized views could help you to store some results of recurring queries.
I am using SQL assistant and my data brings in snapshots from a huge database in the form of timestamps. Occasionally the snapshots bring in multiples per hour. The data is correct, multiple snapshots do happen from time to time within an hour, not always but it does happen.
I am bringing this into Spotfire and viewing by an hour and when more than one snapshot happens in the hour, the data shows as doubled.
I only want to display one per hour preferably the last(max) timestamp for the hour. Example; for the 7 am hour the data has a snapshot for 7:10 am and one for 7:55 am.
These are correct but I only want to display the last(max) timestamp, 7:55 am in this case. I can't figure the issue out in Spotfire so I am leaning towards a fix in SQL. How can I display only 1 for each hour?
You'd do this similarly to how you'd probably do it in SQL -- using a ranking/rownumber function.
The basic way Rank in Spotfire works is Rank(Order columns, order direction, partitioned columns, tie method)
You need to partition by the combination of Date and Hour, and then sort descending by your timestamp column.
So the code to identify the rows that you want to isolate should be something along the lines of:
Rank([TimestampColumn], "desc", Date([TimestampColumn]), Hour([TimestampColumn]), "ties.method=first")
What you do with it from here is going to depend on how you plan to use the data - for example, you can Limit Data Using Expression and set the code above = 1 which will limit your table accordingly (helpful if you don't want your users to accidentally forget to filter), or you can create a calculated column which turns it into a flag of some form like here:
If(Rank([TimestampColumn], "desc", Date([TimestampColumn]), Hour([TimestampColumn]), "ties.method=first") = 1, "Latest", "Duplicate")
Which allows your users to filter by this property. This way, they have the option to look at the extra rows.
Ultimately, though, if you want to only ever see these rows, and have no use for the earlier records, I'd probably do it in SQL, if you have that ability. This reduces the number of rows you have to load into your analytic.
Over the past few weeks, I've written a pipeline that picks up all the clickstream data that is being broadcasted from a website. The pipeline makes use of AWS in the following way: S3 > EC2 (for transforms) > Athena (scanning a clean, partitioned s3). New data comes into the pipeline every 24hour and this works great - my clickstream data is easily queriable. However, I now need to add some additional columns i.e. time spent on each page. This can be achieved by sorting by user ID, timestamp and then taking the difference between the timestamp column of row_n1 and row_n2. So my questions are:
1) How can I do this via an SQL query? I'm struggling to get it to work, but my thinking is that once I do I can trigger this query every 24hours to run on the new clickstream data that's coming into Athena.
2) Is this a reasonable way to add additional columns or new aggregate tables? for example, build a query that runs every 24hours on new data to append to a new table.
Ideally, I don't want to touch any of the source code that's been written to do the "core" ETL pipeline
for reference my table looks similar to the following (with the new column time spent on page) :
| userID | eventNum | Category| Time | ...... | timeSpentOnPage |
'103-1023' '3' 'View' '12-10-2019...' 3s
Thanks for any direction/advice that can be provided.
I'm not entirely sure what you are asking, and some example data and expected output would be helpful. For example, I don't quite understand what you mean by row_n and row_m.
I'm going to guess that you mean something like calculating the difference between the timestamps of consecutive rows. That can be achieved by a query like
SELECT
userID,
timestamp - LAG(timestamp, 1) OVER (PARTITION BY userID ORDER BY timestamp) AS timeSpentOnPage
FROM events
The LAG window function returns the value from a previous row (1 in this case means the previous row) in the window given by the window frame (in this case all rows with the same userID and sorted by timestamp). It's kind of like GROUP BY but for each row, if that makes sense.
It wouldn't quite give you the time spent on each page, some page views would look like they were very long when in fact there was just not any activity between them (say someone browsed some, went to lunch, and browsed some more – the last page view before lunch would look like it spanned the whole lunch).
There is no way to do the equivalent of UPDATE in Athena. The closest thing is doing a "CTAS" (Create Table AS) to create a new table (which with some automation can be turned into creating new partitions for existing tables).
If you provide some more information about your data I can revise this answer with other suggestions.
I have a view that lists employee (EmpID), request number (ReqNo), date request was opened (OpenDate) and the date it was moved to the next step in the process (AssignDate). What I am trying to do is get an average of the daily queue size. If EmpID 001 has 20 requests on 1/1/13, then has 24 on 1/2/13, 21 on 1/3/13 the average over 3 days should be 21.66, rounded up to 22. I have the following view:
CREATE VIEW EmpReqs
AS
SELECT [EmpID], [OpenDate], [AssignDate], [ReqID]
FROM [Metrics].[dbo].[Assignments]
WHERE OpenDate BETWEEN '01/01/2013' AND '12/31/2013' AND
[EmpID] IS NOT NULL AND
[ReqNo] NOT LIKE 'M%'
I then wrote a query to pull individual employee's queues per day:
/* First attempt to generate daily queue #s */
SELECT * FROM BLReqs
WHERE [BusLiaison] LIKE 'PN' AND
[OpenDate] <= '11/15/2013' AND
[AssignDate] > '11/15/2013'
Because no one has attempted to pull this information before, I have no way of verifying how accurate the above is. I tried using current dates, since I can see those in our database to compare but the code doesn't work, nothing is returned when I change the dates to 2014 and run my query.
What is the easiest way to verify that my code is correct, short of manually counting a day's queue?
Can anyone see any issues with the above scripts?
Is there a way to get the above code to work with current dates?
This question is really hard to answer because it is kind of broad and has little information at the same time. I'll try anyway:
Because no one has attempted to pull this information before, I have
no way of verifying how accurate the above is.
Try checking the result of this query for a few sampled dates.
I tried using current dates, since I can see those in our database to
compare but the code doesn't work, nothing is returned when I change
the dates to 2014 and run my query.
So clearly, the query is not working. You should probably find out why. Run the query for a date of which you know that it should return results but doesn't. Remove conditions one by one to see which one is incorrectly removing all rows. This should be enough to identify the bug.
Can anyone see any issues with the above scripts?
No, looks fine. A very simple query. That's why I said that we have too little information. There is some key piece of information missing that allows us to find the bug.
Is there a way to get the above code to work with current dates?
Stop staring at the code and hoping for a revelation. Debug it. Experiment.