How can I improve KQL query for large dataset for heatmap - kql

I have a KQL query below which will provide a real nice heatmap to map out top access by country for Azure WAF.
The challenge here is that this query cannot go beyond 24 hours as the number of records I have way too big. How can i improve this to even display like weekly and monthly stats ?
// source: https://datahub.io/core/geoip2-ipv4
set notruncation;
let CountryDB=externaldata(Network:string, geoname_id:string, continent_code:string, continent_name:string, country_iso_code:string, country_name:string)
[#"https://datahub.io/core/geoip2-ipv4/r/geoip2-ipv4.csv"]
| extend Dummy=1;
let AppGWAccess = AzureDiagnostics
| where ResourceType == "APPLICATIONGATEWAYS"
| where Category == "ApplicationGatewayAccessLog"
| where userAgent_s !in ("bot")
| project TimeGenerated, clientIP_s;
AppGWAccess
| extend Dummy=1
| summarize count() by Hour=bin(TimeGenerated,6h), clientIP_s,Dummy
| partition by Hour(
lookup (CountryDB|extend Dummy=1) on Dummy
| where ipv4_is_match(clientIP_s, Network)
)
| summarize sum(count_) by country_name

What you're doing is creating hourly aggregations over all the data. Instead, you should create a Materialized View that will do the aggregations in the background for you.
Quoting the documentation:
Materialized views expose an aggregation query over a source table. Materialized views always return an up-to-date result of the aggregation query (always fresh). Querying a materialized view is more performant than running the aggregation directly over the source table, which is performed each query.

Related

Display most recent data using kusto query

I have the following kusto query:
customEvents
| where name == "Tracker"
| project Id = tostring(customDimensions["Id"]),
Rank = tostring(customDimensions["Rank"])
which gives the following result:
I see that the same Id is repeated multiple times. Is there a way to display ONLY the most recent data for each id. How do I update the above kusto query?
you can use the arg_max() aggregation function.
for example:
customEvents
| where name == "Tracker"
| summarize arg_max(timestamp, *) by Id
if this is a common use case, you can consider defining a materialized view that performs similar aggregation, then query the view instead of the table with the raw data.

Query a dynamic deduplicated table

I am using BigQuery to give my colleagues access to aggregated data in our system.
I have a raw_orders table where I store orders data. The thing is that the lines in this table are subject to change across time. When a change occurs, I add a new line in this table. So my table looks like this:
+-----+-------+---------------------+---------------------+
| id | total | created_at | updated_at |
+-----+-------+---------------------+---------------------+
| ABC | 15.76 | 2020-01-01 12:56:32 | 2020-01-02 14:58:43 |
| ABC | 12.43 | 2020-01-01 12:56:32 | 2020-01-01 12:56:32 |
| DEF | 19.03 | 2020-01-01 12:56:32 | 2020-01-02 14:58:43 |
| DEF | 12.03 | 2020-01-01 12:56:32 | 2020-01-01 12:56:32 |
+-----+-------+---------------------+---------------------+
To allow my collaborators to query on a deduplicated table easily, I created a view of deduplicated lines using:
CREATE OR REPLACE VIEW xxx.orders as
select ro.*
from (
select ro.id, max(ro.updated_at) max_updated_at
from xxx.raw_orders ro
group by ro.id
) tmp inner join xxx.raw_orders ro2 on ro2.id = tmp.id && ro2.updated_at = tmp.max_updated_at
order by f.created_at desc
This works great, but I feel that I am spending too much budget on simple requests like:
SELECT * FROM rubee.orders WHERE created_at > '2020-11-01 00:00:00';
If I understand well, because of the view step, big query must use a lot of storage to deduplicate lines before responding a single result.
Am I doing something wrong here? How do you give access to deduplicated data without spending too much storage? Would you have a better strategy for what I try to do?
Ideally, you will use a materialized view for the purpose, but right now BigQuery has limited support on materialized view. You cannot create a mview to replace the view you were using.
It is possible to create a materialized view for the inner query, which may make the whole query less expensive but please read on.
Cost. There is no simple answer whether you are "spending too much budget" on the query.
If you're on pay-per-query plan and charged by "processed bytes", then although the query is more expensive for BigQuery to process, you're charged no more than scanning the whole table once (although technically the table was scanned more than once). In another word, deduplication is free. However, if your query pattern allows to to cluster/partition your table somehow to avoid scanning the whole table, then this "self-join" view does prevent you from saving the budget.
If you have reservation on slots, then you will benefit from making the query faster.
Suggestions. Give the situation is different case by case, the general suggestions are:
If it is possible, separating the data into "archived" and "active" so that "archived" data stay deduplicated (and partitioned/clustered to allow efficient search), and you only need a view to dedup "active" data.
Create a materialized view (on the inner "GROUP BY" query) may speed up the query a bit but not necessarily make it "cheaper", you may be charged the size of the base table + mview.

How do you 'join' multiple SQL data sets side by side (that don't link to each other)?

How would I go about joining results from multiple SQL queries so that they are side by side (but unrelated)?
The reason I am thinking of this is so that I can run 1 query in Google Big Query and it will return 1 single table which I can import into Excel and do some charts.
e.g. Query 1 looks at dataset TableA and returns:
**Metric:** Sales
**Value:** 3,402
And then Query 2 looks at dataset TableB and returns:
**Name:** John
**DOB:** 13 March
They would both use different tables and different filters, etc.
What would I do to make it look like:
---Sales----------John----
---3,402-------13 March----
Or alternatively:
-----Sales--------3,402-----
-----John-------13 March----
Or is there a totally different way to do this?
I can see the use case for the above, I've used something similar to create a single table from multiple tables with different metrics to query in Data Studio so that filters apply to all data in the dataset for example. However in that case, the data did share some dimensions that made it worthwhile doing.
If you are going to put those together with no relationship between the tables, I'd have 4 columns with TYPE describing the data in that row to make for easier filtering.
Type | Sales | Name | DOB
Use UNION ALL to put the rows together so you have something like
"Sales" | 3402 | null | null
"Customer Details" | null | John | 13 March
However, like the others said, make sure you have a good reason to do that otherwise you're just creating a bigger table to query for no reason.

django database design when you will have too many rows

I have a django web app with postgres db; the general operation is that every day I have an array of values that need to be stored in one of the tables.
There is no foreseeable need to query the values of the array but need to be able to plot the values for a specific day.
The problem is that this array is pretty big and if I were to store it in the db, I'd have 60 million rows per year but if I store each row as a blob object, I'd have 60 thousand rows per year.
Is is a good decision to use a blob object to reduce table size when you do not want to query with the row of values?
Here are the two options:
option1: keeping all
group(foreignkey)| parent(foreignkey) | pos(int) | length(int)
A | B | 232 | 45
A | B | 233 | 45
A | B | 234 | 45
A | B | 233 | 46
...
option2: collapsing the array into a blob:
group(fk)| parent(fk) | mean_len(float)| values(blob)
A | B | 45 |[(pos=232, len=45),...]
...
so I do NOT want to query pos or length but I want to query group or parent.
An example of read query that I'm talking about is:
SELECT * FROM "mytable"
LEFT OUTER JOIN "group"
ON ( "group"."id" = "grouptable"."id" )
ORDER BY "pos" DESC LIMIT 100
which is a typical django admin list_view page main query.
I tried loading the data and tried displaying the table in the django admin page without doing any complex query (just a read query).
When I get pass 1.5 millions rows, the admin page freezes. All it takes is a some count query on that table to cause the app to crash so I should definitely either keep the data as a blob or not keep it in the db at all and use the filesystem instead.
I want to emphasize that I've used django 1.8 as my test bench so this is not a postgres evaluation but rather a system evaluation with django admin and postgres.

Vertica and joins

I'm adapting a web analysis tool to use Vertica as the DB. I'm having real problems optimizing joins. I tried creating pre-join projections for some of my queries, and while it did make the queries blazing fast, it slowed data loading into the fact table to a crawl.
A simple INSERT INTO ... SELECT * FROM which we use to load data into the fact table from a staging table goes from taking ~5 seconds to taking 20+ minutes.
Because of this I dropped all pre-join projections and tried using the Database Designer to design query specific projections but it's not enough. Even with those projections a simple join is taking ~14 seconds, something that takes ~1 second with a pre-join projection.
My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?
We're running Vertica on a 5 node cluster, each node having 2 x quad core CPU and 32 GB of memory. The tables in my example query have 188,843,085 and 25,712,878 rows respectively.
The EXPLAIN output looks like this:
EXPLAIN SELECT referer_via_.url as referralPageUrl, COUNT(DISTINCT sessio
n.id) as visits FROM owa_session as session JOIN owa_referer AS referer_vi
a_ ON session.referer_id = referer_via_.id WHERE session.yyyymmdd BETWEEN
'20121123' AND '20121123' AND session.site_id = '49' GROUP BY referer_via_
.url ORDER BY visits DESC LIMIT 250;
Access Path:
+-SELECT LIMIT 250 [Cost: 1M, Rows: 250 (STALE STATISTICS)] (PATH ID: 0)
| Output Only: 250 tuples
| Execute on: Query Initiator
| +---> SORT [Cost: 1M, Rows: 1 (STALE STATISTICS)] (PATH ID: 1)
| | Order: count(DISTINCT "session".id) DESC
| | Output Only: 250 tuples
| | Execute on: All Nodes
| | +---> GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 1M, Rows: 1 (STALE
STATISTICS)] (PATH ID: 2)
| | | Aggregates: count(DISTINCT "session".id)
| | | Group By: referer_via_.url
| | | Execute on: All Nodes
| | | +---> GROUPBY HASH (SORT OUTPUT) (RESEGMENT GROUPS) [Cost: 1M, Rows
: 1 (STALE STATISTICS)] (PATH ID: 3)
| | | | Group By: referer_via_.url, "session".id
| | | | Execute on: All Nodes
| | | | +---> JOIN HASH [Cost: 1M, Rows: 1 (STALE STATISTICS)] (PATH ID:
4) Outer (RESEGMENT)
| | | | | Join Cond: ("session".referer_id = referer_via_.id)
| | | | | Execute on: All Nodes
| | | | | +-- Outer -> STORAGE ACCESS for session [Cost: 463, Rows: 1 (ST
ALE STATISTICS)] (PUSHED GROUPING) (PATH ID: 5)
| | | | | | Projection: public.owa_session_projection
| | | | | | Materialize: "session".id, "session".referer_id
| | | | | | Filter: ("session".site_id = '49')
| | | | | | Filter: (("session".yyyymmdd >= 20121123) AND ("session"
.yyyymmdd <= 20121123))
| | | | | | Execute on: All Nodes
| | | | | +-- Inner -> STORAGE ACCESS for referer_via_ [Cost: 293K, Rows:
26M] (PATH ID: 6)
| | | | | | Projection: public.owa_referer_DBD_1_seg_Potency_2012112
2_Potency_20121122
| | | | | | Materialize: referer_via_.id, referer_via_.url
| | | | | | Execute on: All Nodes
To speedup join:
Design session table as being partitioned on column "yyyymmdd". This will enable partition pruning
Add condition on column "yyyymmdd" to _referer_via_ and partition on it, if it is possible (most likely not)
have column site_id as possible close to the beginning of order by list in used (super)projection of session
have both tables segmented on referer_id and id correspondingly.
And having more nodes in cluster do help.
My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?
I guess the amount affected would vary depending on data sets and structures you are working with. But, since this is the variable you changed, I believe it is safe to say the pre-join projection is causing the slowness. You are gaining query time at the expense of insertion time.
Someone please correct me if any of the following is wrong. I'm going by memory and by information picked up with conversations with others.
You can speed up your joins without a pre-join projection a few ways. In this case, the referrer ID. I believe if you segment your projections for both tables with the join predicate that would help. Anything you can do to filter the data.
Looking at your explain plan, you are doing a hash join instead of a merge join, which you probably want to look at.
Lastly, I would like to know via the explain plan or through system tables if your query is actually using the projections Database Designer has recommended. If not, explicitly specify them in your query and see if that helps.
You seem to have a lot of STALE STATISTICS.
Responding to STALE statistics is important. Because that is the reason why your queries are slow. Without statistics about the underlying data, Vertica's query optimizer cannot choose the best execution plan. And responding to STALE statistics only improves SELECT performance not update performance.
If you update your tables regularly do remember there are additional things you have to consider in VERTICA. Please check the answer that I posted to this question.
I hope that should help improve your update speed.
Explore the AHM settings as explained in that answer. If you don't need to be able to select deleted rows in a table later, it is often a good idea to not keep them around. There are ways to keep only the latest epoch version of the data. Or manually purge deleted data.
Let me know how it goes.
I think your query could use some more of being explicit. Also don't use that Devil BETWEEN Try this:
EXPLAIN SELECT
referer_via_.url as referralPageUrl,
COUNT(DISTINCT session.id) as visits
FROM owa_session as session
JOIN owa_referer AS referer_via_
ON session.referer_id = referer_via_.id
WHERE session.yyyymmdd <= '20121123'
AND session.yyyymmdd > '20121123'
AND session.site_id = '49'
GROUP BY referer_via_.url
-- this `visits` column needs a table name
ORDER BY visits DESC LIMIT 250;
I'll say I'm really perplexed as to why you would use the same DATE with BETWEEN may want to look into that.
this is my view coming from an academic background working with column databases, including Vertica (recent PhD graduate in database systems).
Blockquote
My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?
Blockquote
Yes, updating projections is very slow and you should ideally do it only in large batches to amortize the update cost. The fundamental reason is that each projection represents another copy of the data (of each table column that is part of the projection).
A single row insert requires adding one value (one attribute) to each column in the projection. For example, a single row insert in a table with 20 attributes requires at least 20 column updates. To make things worse, each column is sorted and compressed. This means that inserting the new value in a column requires multiple operations on large chunks of data: read data / decompress / update / sort / compress data / write data back. Vertica has several optimization for updates but cannot hide completely the cost.
Projections can be thought of as the equivalent of multi-column indexes in a traditional row store (MySQL, PostgreSQL, Oracle, etc.). The upside of projections versus traditional B-Tree indexes is that reading them (using them to answer a query) is much faster than using traditional indexes. The reasons are multiple: no need to access head data as for non-clustered indexes, smaller size due to compression, etc. The flipside is that they are way more difficult to update. Tradeoffs...