Kusto query multiple resources by type, not by name - azure-log-analytics

I have written a simple query that shows exceptions from multiple instances of Application Insights:
app('app_insights_name_1').exceptions
| more unions here
| union app('app_insights_name_n').exceptions
| where timestamp > ago(24h)
| summarize count() by problemId, appName
| sort by count_ desc
I have also managed to write a query that finds all App Insights instances I would like the 1st query to run against:
resources
| where type in ~('microsoft.insights/components', 'microsoft.insights')
| where(resourceGroup in ~('dev', 'test'))
I don't think it is possible to combine both queries, but I wonder if it's possible to interrogate a union of Exceptions tables (or Trace or any other) of all instances for a given type in one or more resource groups?
Conceptually I wish for something similar to the query below:
resources
| where type in ~('microsoft.insights/components', 'microsoft.insights')
| where(resourceGroup in ~('dev', 'test'))
| MAGIC_HERE : get union of all exception tables from above
| where timestamp > ago(24h)
| summarize count() by problemId, appName
| sort by count_ desc

As suggested in the below tech community blog you can integrate both log analytics & Azure resource graph using workbooks.
The app expression is used in an Azure Monitor query to retrieve data from a specific Application Insights app in the same resource group, another resource group, or another subscription. This is useful to include application data in an Azure Monitor log query and to query data across multiple applications in an Application Insights query.
Resources is a one of the table in resource graph explorer service in Azure that is designed to extend Azure Resource Management by providing efficient and performant resource exploration with the ability to query at scale across a given set of subscriptions so that you can effectively govern your environment. These queries provide the following features:
Ability to query resources with complex filtering, grouping, and sorting by resource properties.
Ability to iteratively explore resources based on governance requirements.
Ability to assess the impact of applying policies in a vast cloud environment.
Ability to detail changes made to resource properties (preview).
Here is the reference git hub to create a application insights work books.

Related

Process Mining algorithm

If I have windows usage data table like
StartTime | EndTime | Window | Value
that records a history of windows usage - how can we mine this data to get some repetitive patters, e.g. wnd1->wnd2->wnd3 (set of records running consistently, set of records in different patterns may vary.. ) ?
What algorithm is better to use for this? Are there any implementations for Excel, Python and Delphi?
It seems that your data is not suitable for process mining. In process mining, we need a mandatory field that is "case-Id". Without this info, it is almost impossible to benefit most process mining techniques.
It would be great if you can provide case-Id or use Sequence mining techniques instead.

How to do product catalog building with MySQL and tensorflow?

I am trying to build a product catalog with data from several e-commerce websites. The goal is to build a product catalog where every product is specified as good as possible using leveraged data across multiple sources.
This seems to be a highly complex task as there is sometimes misinformation and in some cases the unique identifier is misspelled or not even present.
The current approach is to transform the extracted data into our format and then load it into a mysql database. In this process obvious duplicates get removed and I end up with about 250.000 datasets.
Now I am facing the problem on how to brake this down even further as there are thousands of duplicates but I can not say which as some info might not be accurate.
e.g.
ref_id | title | img | color_id | size | length | diameter | dial_id
this one dataset might be incomplete or might even contain wrong values.
Looking more into the topic, this seems to be a common use case for deep learning with e.g. tensorflow.
I am looking for an answer that will help me in order to create the process on how to do that. Is tensorflow the right tool? Should I write all datasets to the db and keep the records? How could a process look like etc.

How to tell there is new data available in ga_sessions_intraday_ efficiently

Google Analytics data should be exported to Big Query 3 times a day, according to the docs. I trying to determine an efficient way to detect new data is available in the ga_sessions_intraday_ table and run a query in BQ to extract on the new data.
My best idea is to poll ga_sessions_intraday_ by running a SQL query every hour. I would track the max visitStartTime (storing the state somewhere) and if a new max visitStartTime shows up in the ga_sessions_intraday_ then I would run my full queries.
Problems with this approach is I need to store state about the max visitStartTime. I would prefer something simpler.
Does GA Big Query have a better way of telling that new data is available in ga_sessions_intraday_? Some kind of event that fires? Do I use the last modified date of the table (but I need to keep track of the time window to run against)?
Thanks in advance for your help,
Kevin
Last modified time on the table is probably the best approach here (and cheaper than issuing a probe query). I don't believe there is any other signalling mechanism for delivery of the data.
If your full queries run more quickly than your polling interval, you could probably just use the modified time of your derived tables to hold the data (and update when your output tables are older than your input tables).
Metadata queries are free, so you can even embed most of the logic in a query:
SELECT
(
SELECT
MAX(last_modified_time)
FROM
`YOUR_INPUT_DATASET.__TABLES__`) >
(
SELECT
MAX(last_modified_time)
FROM
`YOUR_OUTPUT_DATASET.__TABLES__`) need_update
If you have a mix of tables in your output dataset, you can be more selective (with a WHERE clause) to filter down the tables you examine.
If you need a convenient place to run this scheduling logic (that isn't a developer's workstation), you might consider one of my previous answers. (Short version: Apps Script is pretty neat)
You might also consider filing a feature request for "materialized views" or "scheduled queries" on BigQuery's public issue tracker. I didn't see a existing entry for this with a quick skim, but I've certainly heard similar requests in the past.
I'm not sure how the Google Analytics team handles feature requests, but having a pubsub notification upon delivery of a new batch of Analytics data seems like it could be useful as well.

Bigquery caching when hitting table would provide a different result?

As part of our Bigquery solution we have a cron job which checks the latest table created in a dataset and will create more if this table is out of date.This check is done with the following query
SELECT table_id FROM [dataset.__TABLES_SUMMARY__] WHERE table_id LIKE 'table_root%' ORDER BY creation_time DESC LIMIT 1
Our integration tests have recently been throwing errors because this query is hitting Bigquery's internal cache even though running the query against the underlying table would provide a different result. This caching also occurs if I run this query in the web interface from Google cloud console.
If I specify for the query not to cache using the
queryRequest.setUseQueryCache(false)
flag in the code then the tests pass correctly.
My understanding was that Bigquery automatic caching would not occur if running the query against the underlying table would provide a different result. Am I incorrect in this assumption in which case when does it occur or is this a bug?
Well the answer for your question is: you are doing conceptually wrong. You always need to set the no cache param if you want no cache data. Even on the web UI there are options you need to use. The default is to use the cached version.
But, fundamentally you need to change the process and use the recent features:
Automatic table creation using template tables
A common usage pattern for streaming data into BigQuery is to split a logical table into many smaller tables, either for creating smaller sets of data (e.g., by date or by user ID) or for scalability (e.g., streaming more than the current limit of 100,000 rows per second). To split a table into many smaller tables without adding complex client-side code, use the BigQuery template tables feature to let BigQuery create the tables for you.
To use a template table via the BigQuery API, add a templateSuffix parameter to your insertAll request
By using a template table, you avoid the overhead of creating each table individually and specifying the schema for each table. You need only create a single template, and supply different suffixes so that BigQuery can create the new tables for you. BigQuery places the tables in the same project and dataset. Templates also make it easier to update the schema because you need only update the template table.
Tables created via template tables are usually available within a few seconds.
This way you don't need to have a cron, as it will automatically create the missing tables.
Read more here: https://cloud.google.com/bigquery/streaming-data-into-bigquery#template-tables

Feed aggregator using hbase. How to design the schema?

I am working on a project involving monitoring a large number of rss/atom feeds. I want to use hbase for data storage and I have some problems designing the schema. For the first iteration I want to be able to generate an aggregated feed (last 100 posts from all feeds in reverse chronological order).
Currently I am using two tables:
Feeds: column families Content and Meta : raw feed stored in Content:raw
Urls: column families Content and Meta : raw post version store in Content:raw and the rest of the data found in RSS stored in Meta
I need some sort of index table for the aggregated feed. How should I build that? Is hbase a good choice for this kind of application?
Question update: Is it possible( in hbase) to design a schema that could efficiently answer to queries like the one listed bellow?
SELECT data FROM Urls ORDER BY date DESC LIMIT 100
Peter Rietzler answer on hbase-user mail list:
Hi
In our project we are handling event
lists where we have similar
requirements. We do ordering by
choosing our row keys wisely. We use
the following key for our events (they
should be ordered by time in ascending
order):
eventListName/yyyyMMddHHmmssSSS-000[-111]
where eventListName is the name of the
event list and 000 is a three digit
instance id to disambiguate between
different running instances of
application, and -111 is optional to
disambiguate events that occured in
the same millisecond on one instance.
We additionally insert and artifical
row for each day with the id
eventListName/yyyyMMddHHmmssSSS
This allows us to start scanning at
the beginning of each day without
searching through the event list.
You need to be aware of the fact that
if you have a very high load of
inserts, then always one hbase region
server is busy inserting while the
others are idle ... if that's a
problem for you, you have to find
different keys for your purpose.
You could also use an HBase index
table but I have no experience with it
and I remember an email on the mailing
list that this would double all
requests because the API would first
lookup the index table and then the
original table ??? (please correct me
if this is not right ...)
Kind regards, Peter
Thanks Peter.