Azure Stream Analytics - long living calculation - azure-stream-analytics

I'm using Azure stream analytics for real-time analytics and I have a basic problem. I have a field which I would like to count the number of messages.
The json is in the following format:
{ categoryId: 100, name: 'hello' }
I would like to see the number of count by category, so I assume that the query in Azure stream analytics should be:
SELECT
categoryId,
count(*) as categoryCount
INTO
categoriesCount
FROM
categoriesInput
GROUP BY
categoryId
The problem is that I have to add TumblingWindow or SlidingWindows to the group by clause. Is there a way to avoid that and have the calculation running indefinitely ? Also I need to make sure the output is written to the SQL server.

How about a sliding window with the length of "1". This way it would act like a pointer and everytime it changes you can do the calculation?
Hope this helps!
Mert

Related

How to get count of active users grouped by version? (from Firebase using BigQuery)

Problem description
I'm trying to get the information of how many active users I have in my app separated by the 2 or 3 latest versions of the app.
I've read some documentations and other stack questions but none of them was solving my problem (and some others had outdated solutions).
Examples of solutions I tried:
https://support.google.com/firebase/answer/9037342?hl=en#zippy=%2Cin-this-article (N-day active users - This solution is probably the best, but even changing the dataset name correctly and removing the _TABLE_SUFFIX conditions it kept returning me a single column n_day_active_users_count = 0 )
https://gist.github.com/sbrissenden/cab9bd3a043f1879ded605cba5005457
(this is not returning any values for me, didn't understand why)
How can I get count of active Users from google analytics (this is not a good fit because the other part of my job is already done and generating charts on Data Studio, so using REST API would be harder to join my two solutions - one from BigQuery and other from REST API)
Discrepancies on "active users metric" between Firebase Analytics dashboard and BigQuery export (this one uses outdated variables)
So, I started to write the solution out of my head, and this is what I get so far:
SELECT
user_pseudo_id,
app_info.version,
ROUND(COUNT(DISTINCT user_pseudo_id) OVER (PARTITION BY app_info.version) / SUM(COUNT(DISTINCT user_pseudo_id)) OVER (), 3) AS adoption
FROM `projet-table.events_*`
WHERE platform = 'ANDROID'
GROUP BY app_info.version, user_pseudo_id
ORDER BY app_info.version
Conclusions
I'm not sure if my logic is correct, but I think I can use user_pseudo_id to calculate it, right? The general idea is: user_of_X_version/users_of_all_versions.
(And the results are kinda close to the ones showing at Google Analytics web platform - I believe the difference is due to the date that I turned on the BigQuery integration. But.... I'd like some confirmation on that: if my logic is correct).
The biggest problem in my code now is that I cannot write it without grouping by user_pseudo_id (Because when I don't BigQuery says: "SELECT list expression references column
user_pseudo_id which is neither grouped nor aggregated at [2:3]") and that's why I have duplicated rows in the query result
Also, about the first link of examples... Is there any possibility of a record with engagement_time_msec param with value < 0? If not, why is that condition in the where clause?

Azure Stream Analytics Query - get last request for all device

I have IoT Hub, it collects messages from many devices. Data from IoT Hub are sent to stream analytics, and now I would like to stream analytics, display a list of all devices along with the last request. That is, a table in which there are eg 10 devices and with each device its last request.
My actual code:
SELECT
deviceId,
param1 as humidity,
param2 as temperature,
datetime as data
FROM hubMessage
GROUP By deviceId, data,temperature,humidity,
TumblingWindow(minute,5)
Ont this query i have error on deviceId:
GROUP BY with no aggregate expressions is not supported.
I dont'h have any idea, how resolve my problem with not supported expression and change last request for all device ;/
GROUP BY with no aggregate expressions is not supported.
According to this error message,you need to use group by with aggregate expression.All aggregate expressions supported by ASA are listed here.
If you want to get the latest request,i think the TopOne is suitable for you.
If you want to select all the data match the filter,maybe you could use COUNT with group by.

Hybrid Query Example in AgensGraph

I am using agensgraph but I dont know how to write a hybrid query, any examples of hybrid query in agensgraph would help a lot.
In AgensGraph you can write hybrid queries in two ways:
Let's say you are creating the followings:
CREATE GRAPH AG;
CREATE VLABEL dev;
CREATE (:dev {name: 'someone', year: 2015});
CREATE (:dev {name: 'somebody', year: 2016});
CREATE TABLE history (year, event)
AS VALUES (1996, 'PostgreSQL'), (2016, 'AgensGraph');
1- Cypher in SQL
Syntax:
SELECT [column_name]
FROM ({table_name|SQL-query|CYPHERquery})
WHERE [column_name operator value];
Example:
SELECT n->>'name' as name
FROM history, (MATCH (n:dev) RETURN n) as dev
WHERE history.year > (n->>'year')::int;
Result:
name ----
someone
(1 row)
2- SQL in Cypher
Syntax:
MATCH [table_name]
WHERE (column_name operator {value|SQLquery|CYPHERquery})
RETURN [column_name];
Example:
MATCH (n:dev)
WHERE n.year < (SELECT year FROM history WHERE event =
'AgensGraph')
RETURN properties(n) AS n;
Result:
n ----
{"name": "someone", "year": 2015}
(1 row)
You can find more information here
I found more info on the hybrid query language in these slides. Every other bit of information I have been able to find is just the same example that Eya posted, in different places.
I agree that more information about the hybrid queries in AgensGraph would be great, as it seems like a killer feature of software.
Let’s assume that we have a network management system and we are keeping our network topology in graph part of the AgensGraph (Graph Format) and our time-series data (such as date&time information regarding specific devices) in the relational part of the AgensGraph (Table Format). So, in this case, we know that we have a graph, tables and if we want, we can write a hybrid query to fetch data from both models.
In our graph, we have different devices that are connected to each other such as a modem, IoT sensors, etc. for each of these devices, we also have some information respectively stored in tables - related to those devices such as download speed, the upload speed or CPU usage.
In the following hybrid queries, our goal is to collect the information regarding specific devices by querying both from the graph and the tables simultaneously.
Cypher in SQL
In this hybrid query, we are looking to find modem devices which are having issues and their abnormality type is 2 (2 indicates that this device is having some issues regarding its download and upload speed) and after we find those devices, our goal is to return their id, download, and upload speed to investigate the issue. As you can see in the following query our inner query is Cypher and our outer query is SQL.
SELECT id,sysdnbps, sysupbps
from public.modemrdb where to_jsonb(id) in
(SELECT id FROM (MATCH(m:modem) where
m.abnormaltype=2
return m.name)
AS s(id));
SQL in Cypher
In this hybrid query, we are looking to find modem devices which their CPU usages are more than 80 (not in range of threshold) which indicate there is an issue with these devices and after we find those devices, our goal is to return that modems and any IoT devices that are connected to them. As you can see in the following example our inner query is SQL and our outer query is Cypher.
MATCH p=(n:modem)-[r*1..2]->(iot)
WHERE n.name in
(SELECT to_jsonb(id)
FROM public.modemrdb
WHERE syscpuusage >= 80)
RETURN p;
This can be another example of a hybrid query.

Export Bigquery Logs

I want to analyze the activity on BigQuery during the past month.
I went to the cloud console and the (very inconvenient) log viewer. I set up exports to Big-query, and now I can run queries on the logs and analyze the activity. There is even very convenient guide here: https://cloud.google.com/bigquery/audit-logs.
However, all this helps to look at data collected from now on. I need to analyze past month.
Is there a way to export existing logs (rather than new) to Bigquery (or to flat file and later load them to BQ)?
Thanks
While you cannot "backstream" the BigQuery's logs of the past, there is something you can still do, depending on what kind of information you're looking for. If you need information about query jobs (jobs stats, config etc), you can call Jobs: list method of BigQuery API to list all jobs in your project. The data is preserved there for 6 months and if you're project owner, you can list the jobs of all users, regardless who actually ran it.
If you don't want to code anything, you can even use API Explorer to call the method and save the output as json file and then load it back into BigQuery's table.
Sample code to list jobs with BigQuery API. It requires some modification but it should be fairly easy to get it done.
You can use Jobs: list API to collect job info and upload it to GBQ
Since it is in GBQ - you can analyze it any way you want using power of BigQuery
You can either flatten result or use original - i recommend using original as it is less headache as no any transformation before loading to GBQ (you just literally upload whatever you got from API). Of course all this in simple app/script that you still have to write
Note: make sure you use full value for projection parameter
I was facing the same problem when I found a article which describes how to inspect Big Query using INFORMATION_SCHEMA without any script nor Jobs: list as mentioned by other OPs.
I was able to run and got this working.
# Monitor Query costs in BigQuery; standard-sql; 2020-06-21
# #see http://www.pascallandau.com/bigquery-snippets/monitor-query-costs/
DECLARE timezone STRING DEFAULT "Europe/Berlin";
DECLARE gb_divisor INT64 DEFAULT 1024*1024*1024;
DECLARE tb_divisor INT64 DEFAULT gb_divisor*1024;
DECLARE cost_per_tb_in_dollar INT64 DEFAULT 5;
DECLARE cost_factor FLOAT64 DEFAULT cost_per_tb_in_dollar / tb_divisor;
SELECT
DATE(creation_time, timezone) creation_date,
FORMAT_TIMESTAMP("%F %H:%I:%S", creation_time, timezone) as query_time,
job_id,
ROUND(total_bytes_processed / gb_divisor,2) as bytes_processed_in_gb,
IF(cache_hit != true, ROUND(total_bytes_processed * cost_factor,4), 0) as cost_in_dollar,
project_id,
user_email,
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_USER
WHERE
DATE(creation_time) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) and CURRENT_DATE()
ORDER BY
bytes_processed_in_gb DESC
Credits: https://www.pascallandau.com/bigquery-snippets/monitor-query-costs/

How can I trigger an email or other notification based on a BigQuery query?

I would like to receive a notification, ideally via email, when some threshold is met in Google BigQuery. For example, if the query is:
SELECT name, count(id) FROM terrible_things
WHERE date(terrible_thing) < -1d
Then I would want to get an alert when there were greater than 0 results, and I would want that alert to contain the name of each object and how many there were.
BigQuery does not provide the kinds of services you'd need to build this without involving other technologies. However, you should be able to use something like appengine (which does have a task scheduling mechanism) to periodically issue your monitoring query probe, check the results of the job, and alert if there are nonzero rows in the results. Alternately, you could do this locally using some scripting and leveraging the BQ command line tool.
You could also refine things by using BQ's table decorators to only scan the data that's arrived since you last ran your monitoring query, if you retain knowledge of the last probe's execution in the calling system.
In short: Something else needs to issue the queries and react based on the outcome, but BQ can certainly evaluate the data.