How to find periods without activity in BigQuery - sql

I have a table of some type of activity in BigQuery with just about 40Mb of data now. Activity date is stored in one of the fields (string in format YYYY-MM-DD HH:MM:SS). I need to find way to determine periods of inactivity (with some predefined threshold) running reasonable amount of time.
Query that I built runs already hour. Here it is:
SELECT t1.date, MIN(PARSE_UTC_USEC(t1.date) - PARSE_UTC_USEC(t2.date)) AS mintime
FROM logs t1
JOIN (SELECT date, http_error FROM logs) t2 ON t1.http_error = t2.http_error
WHERE PARSE_UTC_USEC(t1.date) > PARSE_UTC_USEC(t2.date)
GROUP BY t1.date
HAVING mintime > 1000;
Idea is:
1. Take decart multiplication of the table with itself (http_error is field that almost never changes value, so it does the trick)
2. Take only pairs where date1 > date2
3. Take for every date1 date2 with minimal difference
4. Restrict choice by cases where this minimal difference is more than threshold.
I admit that real query I use is burden a bit by fixes to invalid data (this adds additional operations). But I really need better idea to do this. I'll be glad to hear other ideas

I don't know the granularity of inactivity you are looking for, but why not try bucketing by your timestamp, then counting the relative frequency of activities in each bucket:
SELECT
UTC_USEC_TO_HOUR(PARSE_UTC_USEC(timestamp_usec)) AS hour_bucket,
COUNT(*) as activity_count
GROUP BY
hour_bucket
ORDER BY
activity_count ASC;

Related

Use SQL to ensure I have data for each day of a certain time period

I'm looking to only select one data point from each date in my report. I want to ensure each day is accounted for and has at least one row of information, as we had to do a few different things to move a large data file into our data warehouse (import one large Google Sheet for some data, use Python for daily pulls of some of the other data - want to make sure no date was left out), and this data goes from now through last summer. I could do a COUNT DISTINCT clause to just make sure the number of days between the first data point and yesterday (the latest data point), but I want to verify each day is accounted for. Should mention I am in BigQuery. Also, an example of the created_at style is: 2021-02-09 17:05:44.583 UTC
This is what I have so far:
SELECT FIRST(created_at)
FROM 'large_table'
ORDER BY created_at
**I know FIRST is probably not the best clause for this case, and it's currently acting to grab the very first data point in created_at, but just as a jumping-off point.
You can use aggregation:
select any_value(lt).*
from large_table lt
group by created_at
order by min(created_at);
Note: This assumes that created_at is a date -- or at least only has one value per date. You might need to convert it to a date:
select any_value(lt).*
from large_table lt
group by date(created_at)
order by min(created_at);
BigQuery equivalent of the query in your question
SELECT created_at
FROM 'large_table'
ORDER BY created_at
LIMIT 1

How to find where a total condition exist

I am trying to create a report that will show how long an automated sprinkler system has run for. The system is comprised of several sprinklers, with each one keeping track of only itself, and then sends that information to a database. My problem is that each sprinkler has its own run time (I.E. if 5 sprinklers all ran at the same time for 10 minutes, it would report back a total run time of 50 minutes), and I want to know only the net amount of run time - in this example, it would be 10 minutes.
The database is comprised of a time stamp and a boolean, where it records the time stamp every time a sprinkler is shut on or off (its on/off state is indicated by the 1/0 of the boolean).
So, to figure out the total net time the system was on each day - whether it was 1 sprinkler running or all of them - I need to check the database for time frames where no sprinklers were turned at all (or where ANY sprinkler at all was turned on). I would think the beginning of the query would look something like
SELECT * FROM MyTable
WHERE MyBoolean = 0
AND [ ... ]
But I'm not sure what the conditional statements that would follow the AND would be like to check the time stamps.
Is there a query I can send to the database that will report back this format of information?
EDIT:
Here's the table the data is recorded to - it's literally just a name, a boolean, and a datetime of when the boolean was changed, and that's the entire database
Every time a sprinkler turns on the number of running sprinklers increments by 1, and every time one turns off the number decrements by 1. If you transform the data so you get this:
timestamp on/off
07:00:05 1
07:03:10 1
07:05:45 -1
then you have a sequence of events in order; which sprinklers they refer to is irrelevant. (I've changed the zeros to -1 for reasons that will become evident in a moment. You can do this with "(2 * value) - 1")
Now put a running total together:
select a.timestamp, (SELECT SUM(a.on_off)
FROM sprinkler_events b
WHERE b.timestamp <= a.timestamp) as run_total
from sprinkler_events a
order by a.timestamp;
where sprinkler_events is the transformed data I listed above. This will give you:
timestamp run_total
07:00:05 1
07:03:10 2
07:05:45 1
and so on. Every row in this which has a run total of zeros is a time at which all sprinklers were turned off, which I think is what you're looking for. If you need to sum the time they were on or off, you'll need to do additional processing: search for "date difference between consecutive rows" and you'll see solutions for that.
You might consider looking for whether all the sprinklers are currently off. For example:
SELECT COUNT (DISTINCT s._NAME) AS sprinkers_currently_off
FROM (
SELECT
_NAME,
_VALUE,
_TIMESTAMP,
ROW_NUMBER() OVER (PARTITION BY _NAME ORDER BY _TIMESTAMP DESC, _VALUE) AS latest_rec
FROM sprinklers
) s
WHERE
_VALUE = 0
AND latest_rec = 1
The inner query orders the records so that you can get the latest status of all the sprinklers, and the outer query counts how many are currently off. If you have 10 sprinklers you would report them all off when this query returns 10.
You could modify this by applying a date range to the inner query if you wanted to look into the past, but this should get you on the right track.

SQL for Next/Prior Business Day from Calendar table (in MS Access)

I have a Calendar table pulled from our mainframe DBs and saved as a local Access table. The table has history back to the 1930s (and I know we use back to the 50s in at least one place), resulting in 31k records. This Calendar table has 3 fields of interest:
Bus_Dt - every day, not just business days. Primary Key
Bus_Day_Ind - indicates if the day was a valid business day for the stock market.
Prir_Bus_Dt - the prior business day. Contains some errors (about 50), all old.
I have written a query to retrieve the first business day on or after the current calendar day, but it runs supremely slowly. (5+ minutes) I have examined the showplan output and see it is being run via an x-join, which between 30k+ record tables gives a solution space (and date comparisons) in the order of nearly 10 million. However, the actual task is not hard, and could be preformed comfortably by excel in minimal time using a simple sort.
My question is thus, is there any way to fix the poor performance of the query, or is this an inherent failing of SQL? (DB2 run on the mainframe also is slow, though not crushingly so. Throwing cycles at the problem and all that.) Secondarily, if I were to trust prir_bus_dt, can I get there better? Or restrict the date range (aka, "cheat"), or any other tricks I didn't think of yet?
SQL:
SELECT TE2Clndr.BUS_DT AS Cal_Dt
, Min(TE2Clndr_1.BUS_DT) AS Next_Bus_Dt
FROM TE2Clndr
, TE2Clndr AS TE2Clndr_1
WHERE TE2Clndr_1.BUS_DAY_IND="Y" AND
TE2Clndr.BUS_DT<=[te2clndr_1].[bus_dt]
GROUP BY TE2Clndr.BUS_DT;
Showplan:
Inputs to Query
Table 'TE2Clndr'
Table 'TE2Clndr'
End inputs to Query
01) Restrict rows of table TE2Clndr
by scanning
testing expression "TE2Clndr_1.BUS_DAY_IND="Y""
store result in temporary table
02) Inner Join table 'TE2Clndr' to result of '01)'
using X-Prod join
then test expression "TE2Clndr.BUS_DT<=[te2clndr_1].[bus_dt]"
03) Group result of '02)'
Again, the question is, can this be made better (faster), or is this already as good as it gets?
I have a new query that is much faster for the same job, but it depends on the prir_bus_dt field (which has some errors). It also isn't great theory since prior business day is not necessarily available on everyone's calendar. So I don't consider this "the" answer, merely an answer.
New query:
SELECT TE2Clndr.BUS_DT as Cal_Dt
, Max(TE2Clndr_1.BUS_DT) AS Next_Bus_Dt
FROM TE2Clndr
INNER JOIN TE2Clndr AS TE2Clndr_1
ON TE2Clndr.PRIR_BUS_DT = TE2Clndr_1.PRIR_BUS_DT
GROUP BY TE2Clndr.BUS_DT;
What about this approach
select min(bus_dt)
from te2Clndr
where bus_dt >= date()
and bus_day_ind = 'Y'
This is my reference for date() representing the current date

Using Table Decorators on Big Query Web Interface

I saw the news about Table Decorators being available to limit the amount of data that is queried by specifying a time interval or limit. I did not see any examples on how to use the Table Decorators in the Big Query UI. Below is an example query that I'd like to run and only look at data that came in over the last 4hours. Any tips on how I can modify this query to utilize Table Decorators?
SELECT
foo,
count(*)
FROM [bigtable.201309010000]
GROUP BY 1
EDIT after trying example below
The first query above scans 180GB of data for the month of September (up through Sept 19th). I'd expect the query below to only scan data that came in during the time period specified. In this case 4hrs, so I'd expect the billing to be about 1.6GB not 180GB. Is there a way to set up ETL/query so we do not get billed for scanning the whole table?
SELECT
foo,
count(*)
FROM [bigtable.201309010000#-14400000]
GROUP BY 1
To use table decorators, you can either specify #timestamp or #timestamp-end_time. Timestamp can be negative, in which case it is relative; end_time can be empty, in which case it is the current time. You can use both of these special cases together, to get a time range relative to now. e.g. [table#-time_in_ms-]. So for your case, since 4 hours is 14400000 milliseconds, you can use:
SELECT foo, count(*) FROM [dataset.table#-14400000-] GROUP BY 1
This is a little bit confusing, we're intending to publish better documentation and examples soon.

SQL: Calculating system load statistics

I have a table like this that stores messages coming through a system:
Message
-------
ID (bigint)
CreateDate (datetime)
Data (varchar(255))
I've been asked to calculate the messages saved per second at peak load. The only data I really have to work with is the CreateDate. The load on the system is not constant, there are times when we get a ton of traffic, and times when we get little traffic. I'm thinking there are two parts to this problem: 1. Determine ranges of time that are considered peak load, 2. Calculate the average messages per second during these times.
Is this the right approach? Are there things in SQL that can help with this? Any tips would be greatly appreciated.
I agree, you have to figure out what Peak Load is first before you can start to create reports on it.
The first thing I would do is figure out how I am going to define peak load. Ex. Am I going to look at an hour by hour breakdown.
Next I would do a group by on the CreateDate formated in seconds (no milleseconds). As part of the group by I would do an avg based on number of records.
I don't think you'd need to know the peak hours; you can generate them with SQL, wrapping a the full query and selecting the top 20 entries, for example:
select top 20 *
from (
[...load query here...]
) qry
order by LoadPerSecond desc
This answer had a good lesson about averages. You can calculate the load per second by looking at the load per hour, and dividing by 3600.
To get a first glimpse of the load for the last week, you could try (Sql Server syntax):
select datepart(dy,createdate) as DayOfYear,
hour(createdate) as Hour,
count(*)/3600.0 as LoadPerSecond
from message
where CreateDate > dateadd(week,-7,getdate())
group by datepart(dy,createdate), hour(createdate)
To find the peak load per minute:
select max(MessagesPerMinute)
from (
select count(*) as MessagesPerMinute
from message
where CreateDate > dateadd(days,-7,getdate())
group by datepart(dy,createdate),hour(createdate),minute(createdate)
)
Grouping by datepart(dy,...) is an easy way to distinguish between days without worrying about month borders. It works until you select more that a year back, but that would be unusual for performance queries.
warning, these will run slow!
this will group your data into "second" buckets and list them from the most activity to least:
SELECT
CONVERT(char(19),CreateDate,120) AS CreateDateBucket,COUNT(*) AS CountOf
FROM Message
GROUP BY CONVERT(Char(19),CreateDate,120)
ORDER BY 2 Desc
this will group your data into "minute" buckets and list them from the most activity to least:
SELECT
LEFT(CONVERT(char(19),CreateDate,120),16) AS CreateDateBucket,COUNT(*) AS CountOf
FROM Message
GROUP BY LEFT(CONVERT(char(19),CreateDate,120),16)
ORDER BY 2 Desc
I'd take those values and calculate what they want