Log analytics query optimization - azure-log-analytics

I run this query on log analytics
Perf
| where TimeGenerated >= ago(5m)
| join kind = inner
(
Heartbeat
| where TimeGenerated >= ago(5m)
| summarize arg_max(TimeGenerated, *)
by SourceComputerId
) on Computer
| summarize arg_max(TimeGenerated, *) by SourceComputerId, CounterName
| extend Indicator = strcat(ObjectName,'-', CounterName)
| summarize dict = make_dictionary
(
pack
(
'WorkspaceId'
, TenantId
, Indicator
, CounterValue
, 'TimeGenerated'
, TimeGenerated
, 'Computer'
, Computer
)
) by SourceComputerId
| evaluate bag_unpack(dict)
But it's a little bit slow. Is there any way to optimize it, I want the fastest possible query to achieve the same results.

It's somewhat challenging to say without you mentioning the size of the data (e.g. record count) for each of the join legs and the cardinality of the SourceComputerId column.
I would recommend that you go over the query best practices document which covers several techniques for optimization, and see if that helps
Update: Explicitly mentioning best practices which may be helpful in your case: (for you to verify)
When using join operator - choose the table with less rows to be the first one (left-most).
When left side is small (up to 100,000 records) and right side is big then it is recommended to use the hint.strategy=broadcast.
When both sides of the join are too big and the join key is with high cardinality, then it is recommended to use the hint.strategy=shuffle.
When the group by keys of the summarize operator are with high cardinality (best practice: above 1 million) then it is recommended to use the hint.strategy=shuffle.

Related

Google Bigquery use of substr, never returns back results

I have a table which has two sets of data, one set of data has information like
Type | Name | Id
PackagedDrug |Pseudoephedrine HCl Oral Tablet 120 MG| 110
PackagedDrug |Pseudoephedrine HCl Oral Tablet 60 MG|111
DrugName| Pseudoephedrine HCl| 112
What I want to do is join PackagedDrug with DrugName concepts, so get all Ids for Type PackagedDrug whose Name is matching with Name for Type DrugName. If I hardcode the Name for DrugName in the following query, it runs instantenously, but if I take out the hardcoding then it just keeps on running. Could you please suggest me suitable ways to speed up the big query?
SELECT a.MSC_ID MSC_id, a.MSC_CONcept_type, a.concept_id, a.concept_name , b.concept_name
from
(select MSC_id, MSC_CONcept_type, concept_id, concept_name
FROM [ClientAlerts.MSC_Concepts]
where MSC_CONcept_type in ('MediSpan.Concepts.PackagedDrug') ) a
CROSS JOIN
(select MSC_CONcept_type, concept_id, concept_name , length(concept_name) len
FROM [ClientAlerts.MSC_Concepts]
where MSC_CONcept_type in ('MediSpan.Concepts.NamebasedClassification.DrugName')
-- and concept_name in ('Pseudoephedrine HCl')
) b
where substr(a.concept_name,1,b.len)+' ' = b.concept_name
Thanks,
Savita
This has nothing to do with BigQuery itself. When you hardcode, your values are "filtered" way faster, because it doesn't have to check every row, since it looks for the hardcoded value.
If you don't use the hardcoded value, it will look at WAY more rows, compare ALL the rows from your first query with your second. Honestly, if you describe your use case properly here, I don't think of any way to do this faster.
But one question does come to mind. Why do you have a "type". It seems like it should be two different tables instead.

Data Driven Restrictions of Plan Selections

I have a complex data structure I am working with and I am not quite sure how to tackle it in a single SQL query, although my gut tells me this should be possible to do.
The essence of what I am doing is trying to display the results of available plans for a given vendor based on the selected hardware model. The results should adhere to only possible combinations, and the plans contain restrictions which are currently stored as key/value pairs in a restrictions table. Below is a simplification of what I am working with:
(I will use a wireless device analogy since almost everyone is familair with cell phones)
models Table
model_id
vendor_id
is_data
is_voice
is_4g
is_3g
Sample Data:
model_id,vendor_id,is_data,is_voice,is_4g,is_3g
DeviceA,Sprint,1,1,0,1
DeviceB,Sprint,1,0,1,0
DeviceC,Sprint,0,1,0,0
DeviceD,Sprint,0,1,0,0
DeviceE,Sprint,0,1,0,0
DeviceF,Verizon,1,1,0,1
DeviceG,Verizon,1,0,1,0
DeviceH,Verizon,0,1,0,0
DeviceI,Verizon,0,1,0,0
DeviceJ,Verizon,0,1,0,0
DeviceK,Tmobile,1,1,0,1
DeviceL,Tmobile,1,0,1,0
DeviceM,Tmobile,0,1,0,0
DeviceN,Tmobile,0,1,0,0
DeviceO,Tmobile,0,1,0,0
plans Table
plan_id
vendor_id
name
Sample Data:
plan_id,vendor_id,name
PlanA,Sprint,Big Data Only Plan
PlanB,Verizon,Small Data Only Plan
PlanC,Sprint,300 Min Plan
PlanD,Verizon,900 Min Plan
PlanE,Verizon,Big Data Only Plan
PlanF,Tmobile,Small Data Only Plan
PlanG,Tmobile,300 Min Plan
PlanH,Tmobile,1000 Min Plan
plan_restrictions Table
restriction_id
vendor_id
plan_id
type
value
Sample Data:
restriction_id,vendor_id,plan_id,type,value
1,Sprint,PlanA,radio,3G
2,Sprint,PlanA,device_type,data
3,Verizon,PlanB,radio,4G
4,Sprint,PlanC,radio,3G
5,Sprint,PlanC,device_type,voice
6,Verizon,PlanD,radio,3G
7,Verizon,PlanD,device_type,voice
8,Verizon,PlanE,radio,3G
9,Verizon,PlanE,device_type,voice
10,Tmobile,PlanF,device_type,data
11,Tmobile,PlanG,device_type,voice
12,Tmobile,PlanH,device_type,voice
Restrictions keyed (I have closer to 50 actually, here is a same type of representation):
type / value possibilities
radio / 3g, 4g
device_type / data, voice
I am open to the possibility of restructuring the tables to make it easier to re-query, however I need to retain a certain amount of flexibility since I do have about 1000 models, 1000 plans, and about 2000 restrictions.
I personally think there is some sort of structure issue here, ie. models perhaps should have their elements as key/value pairs in a separate table, but that is even more complexity, and I haven't determined yet how to properly apply data driven restrictions in the first place.
Something like this should get you started:
SELECT p.name
FROM Plans as p
INNER JOIN plan_restriction as pr
ON p.plan_id = pr.plan_id
INNER JOIN models as m
ON pr.model_id = pr.model_id
WHERE p.vendor_id = 1 AND m.is_data = 1 AND is_4g = 1 AND ...
I kicked this around for about the last hour with the other dba's here and think I solved it. I am posting this for anyone who finds themselves in a similar situation. The biggest problem was that I was too close to the data, and was trying enforce "meaningful" properties and restrictions between the plans needs and the models properties.. which isn't really necessary.
I can restructure my data to be in the following tables:
Plans
Restrictions
Models
Plans would have a many to many relationship to Restrictions
Models would have a many to many relationship to Restrictions
I would solve the many to many relationships with intirum tables
Plans_Restrictions
Models_Restrictions
This would allow me to have stupid "Restrictions" such as a "Red Thing"
I would query as a chain:
Plans
Plans_Restrictions
Restrictions
Models_Restrictions
Models
ie. To get all models with their properties information (restriction info) that are eligible for a plan I could use:
SELECT
M.*
,R.*
FROM (
SELECT P1.*
FROM Plans P1
WHERE id_vendor = #id_vendor
) P
INNER JOIN Plans_Restrictions PR
ON P.plan_id = PR.plan_id
INNER JOIN Restrictions R
ON PR.property = R.property
INNER JOIN Model_Restrictions MR
ON R.property = MR.property
INNER JOIN Model M
ON MR.model_id = M.model_id
And to get all the plans that are eligible for a model, i would reverse the 5 table chained join.
Thanks Abe.. writing this all down in detail to explain it, and understanding why your suggestion didn't solve my problem really helped me understand what my problem was and what I really needed to do. I don't think I would have solved it so fast without you.

SQL - JOIN on two result tables, ideas to refactor?

This may already have been asked, but StackOverflow is massive and trying to google for something specific enough to actually help is a nightmare!
I've ended up with a fairly large SQL query, and was wondering if SO could maybe point out easier methods for doing it that I might have missed.
I have a table called usage with the following structure:
host | character varying(32) |
usage | integer |
logtime | timestamp without time zone | default now()
I want to get the usage value for both the MAX and MIN logtimes. After working through some of my old textbooks (been a while since I really used SQL properly), I've ended up with this JOIN:
SELECT *
FROM (SELECT u.host,u.usage AS min_val,r2.min
FROM usage u
JOIN (SELECT host,min(logtime) FROM usage GROUP BY host) r2
ON u.host = r2.host AND u.logtime = r2.min) min_table
NATURAL JOIN (SELECT u.host,u.usage AS max_val,r1.max
FROM usage u
JOIN (SELECT host,max(logtime) FROM usage GROUP BY host) r1
ON u.host = r1.host AND u.logtime = r1.max) max_table
;
This seems like a messy way to do it, as I'm basically running the same query twice, once with MAX and once with MIN. I can get both logtime columns in one query by doing SELECT usage,MAX(logtime),MIN(logtime) FROM ..., but I couldn't work out how to then show the usage values that correspond to the two different records.
Any ideas?
With PostgreSQL 9.1 you have window functions at your disposal (8.4+):
SELECT DISTINCT
u.host
,first_value(usage) OVER w AS first_usage
,last_value(usage) OVER w AS last_usage
FROM usage u
WINDOW w AS (PARTITION BY host ORDER BY logtime, usage
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
I sort the partition by logtime and by usage in addition to break any ties and arrive at a stable result. Read about window functions in the manual.
For some more explanation and links you might want to refer to recent related answers here or here.

How to optimize group by in table with huge number of records

I have a Person table with huge number of records(for about 16 million), and have a requirement to find all persons, with same lastname, first letter of firstname and birthyear, in other worlds I want to show assuming duplicate persons in UI for users to analyze and decide are there a same person or not.
Here is the query I write
SELECT *
FROM Person INNER JOIN
(
SELECT SUBSTRING(firstName, 1, 1) firstNameF,lastName,YEAR(birthDate) birthYear
FROM Person
GROUP BY SUBSTRING(firstName, 1,1),lastName,YEAR(birthDate)
HAVING count(*) > 1
) as dupPersons
ON SUBSTRING(Person.firstName,1,1) = dupPersons.firstNameF and Person.lastName = dupPersons.lastName and YEAR(Person.birthDate) = dupPersons.birthYear
order by Person.lastName,Person.firstName
but as I am not SQL expert, want too know, is this good way to do that? are there more optimized way?
EDIT
Note that I can cut data, which can have contribution in optimization
for example if I want to cut data by 2 it could return two persons
Johan Smith |
Jane Smith | have same lastname and first name inita
Jack Smith |
Mark Tween | have same lastname and first name inita
Mac Tween |
If the performance using a GROUP BY is not adequate, You could try using an INNER JOIN
SELECT *
FROM Person p1
INNER JOIN Person p2 ON p2.PersonID > p1.PersonID
WHERE SUBSTRING(p2.Firstname, 1, 1) = SUBSTRING(p1.Firstname, 1, 1)
AND p2.LastName = p1.LastName
AND YEAR(p2.BirthDate) = YEAR(p1.BirthDate)
ORDER BY
p1.LastName, p1.FirstName
Well, if you're not an expert, the query you wrote says to me that you're at least pretty competent. When we look at whether a query is "optimized", there are two immediate parts to that: 1. The query just on its own has something notably wrong with it - a bad join, keyword misuse, exploding result set size, supersitions about NOT IN, etc. 2. The context that the query operates within - DB specifics, task specifics, etc.
Your query passes #1, no problem. I would have written it differently - aliased the Person table, used LEFT(P.FirstName, 1) instead of SUBSTRING, and used a CTE (WITH-clause) instead of a subquery. But these aren't optimization issues. Maybe I'd use WITH(READUNCOMMITTED) if the results weren't sensitive to dirty reads. Out of any further context, your query doesn't look like a bomb waiting to go off.
As for #2 - You should probably switch to specifics. Like "I have to run this every week. It takes 17 minutes. How can I get it down to under a minute?" Then people will ask you what your plan looks like, what indexes you have, etc.
Things I'd want to know:
How long does it already take to run?
What's your runtime window? (User & app tolerance for query time.)
Is this run once a day? Week? Month? Quarter?
Do you have the permission to create tables, change current tables, or alter indexes?
Maybe based on having run it, what's the ratio of duplicates you're expecting to find? 5%? 90%?
How stable is the matching criteria requirement?
Example scenario: If this was a run-on-command feature, it will be in my app indefinitely, it will get run weekly, with 10% or fewer records expected to be duplicates, with ability to change the DB how I'd like, if the duplicate matching criteria is firm (not fluctuating), and I wan to cut it from 90s to 5s, I'd create a dedicated BirthYear column (possibly a persisted computed column off of BirthDate), and an index on LastName ASC, BirthYear ASC, FirstName ASC. If too many of those stipulations change, I might to a different direction entirely.
You can try something like this and see the difference on the execution plans, or benchmark the results on performance:
;WITH DupPersons AS
(
SELECT *, COUNT(1) OVER(PARTITION BY SUBSTRING(firstName, 1, 1), lastName, YEAR(birthDate)) Quant
FROM Person
)
SELECT *
FROM DupPersons
WHERE Quant > 1
Of course, it would also help to know your table definition and the indexes you created. I think that maybe it can help to add a computed column with the year of birthdate and create an index on it, the same with the first letter of firstname.

table design + SQL question

I have a table foodbar, created with the following DDL. (I am using mySQL 5.1.x)
CREATE TABLE foodbar (
id INT NOT NULL AUTO_INCREMENT,
user_id INT NOT NULL,
weight double not null,
created_at date not null
);
I have four questions:
How may I write a query that returns
a result set that gives me the
following information: user_id,
weight_gain where weight_gain is
the difference between a weight and
a weight that was recorded 7 days
ago.
How may I write a query that will
return the top N users with the
biggest weight gain (again say over
a week).? An 'obvious' way may be to
use the query obtained in question 1
above as a subquery, but somehow
picking the top N.
Since in question 2 (and indeed
question 1), I am searching the
records in the table using a
calculated field, indexing would be
preferable to optimise the query -
however since it is a calculated
field, it is not clear which field
to index (I'm guessing the 'weight'
field is the one that needs
indexing). Am I right in that
assumption?.
Assuming I had another field in the
foodbar table (say 'height') and I
wanted to select records from the
table based on (say) the product
(i.e. multiplication) of 'height'
and 'weight' - would I be right in
assuming again that I need to index
'height' and 'weight'?. Do I also
need to create a composite key (say
(height,weight)). If this question
is not clear, I would be happy to
clarify
I don't see why you should need the synthetic key, so I'll use this table instead:
CREATE TABLE foodbar (
user_id INT NOT NULL
, created_at date not null
, weight double not null
, PRIMARY KEY (user_id, created_at)
);
How may I write a query that returns a result set that gives me the following information: user_id, weight_gain where weight_gain is the difference between a weight and a weight that was recorded 7 days ago.
SELECT curr.user_id, curr.weight - prev.weight
FROM foodbar curr, foodbar prev
WHERE curr.user_id = prev.user_id
AND curr.created_at = CURRENT_DATE
AND prev.created_at = CURRENT_DATE - INTERVAL '7 days'
;
the date arithmetic syntax is probably wrong but you get the idea
How may I write a query that will return the top N users with the biggest weight gain (again say over a week).? An 'obvious' way may be to use the query obtained in question 1 above as a subquery, but somehow picking the top N.
see above, add ORDER BY curr.weight - prev.weight DESC and LIMIT N
for the last two questions: don't speculate, examine execution plans. (postgresql has EXPLAIN ANALYZE, dunno about mysql) you'll probably find you need to index columns that participate in WHERE and JOIN, not the ones that form the result set.
I think that "just somebody" covered most of what you're asking, but I'll just add that indexing columns that take part in a calculation is unlikely to help you at all unless it happens to be a covering index.
For example, it doesn't help to order the following rows by X, Y if I want to get them in the order of their product X * Y:
X Y
1 8
2 2
4 4
The products would order them as:
X Y Product
2 2 4
1 8 8
4 4 16
If mySQL supports calculated columns in a table and allows indexing on those columns then that might help.
I agree with just somebody regarding the primary key, but for what you're asking regarding the weight calculation, you'd be better off storing the delta rather than the weight:
CREATE TABLE foodbar (
user_id INT NOT NULL,
created_at date not null,
weight_delta double not null,
PRIMARY KEY (user_id, created_at)
);
It means you'd store the users initial weight in say, the user table, and when you write records to the foodbar table, a user could supply the weight at that time, but the query would subtract the initial weight from the current weight. So you'd see values like:
user_id weight_delta
------------------------
1 2
1 5
1 -3
Looking at that, you know that user 1 gained 4 pounds/kilos/stones/etc.
This way you could use SUM, because it's possible for someone to have weighings every day - using just somebody's equation of curr.weight - prev.weight wouldn't work, regardless of time span.
Getting the top x is easy in MySQL - use the LIMIT clause, but mind that you provide an ORDER BY to make sure the limit is applied correctly.
It's not obvious, but there's some important information missing in the problem you're trying to solve. It becomes more noticeable when you think about realistic data going into this table. The problem is that you're unlikely to to have a consistent regular daily record of users' weights. So you need to clarify a couple of rules around determining 'current-weight' and 'weight x days ago'. I'm going to assume the following simplistic rules:
The most recent weight reading is the 'current-weight'. (Even though that could be months ago.)
The most recent weight reading more than x days ago will be the weight assumed at x days ago. (Even though for example a reading from 6 days ago would be more reliable than a reading from 21 days ago when determining weight 7 days ago.)
Now to answer the questions:
1&2: Using the above extra rules provides an opportunity to produce two result sets: current weights, and previous weights:
Current weights:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Similarly for the x days ago reading:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
where Created_at < DATEADD(dd, -7, GETDATE()) /*Or appropriate MySql equivalent*/
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Now simply join these results as subqueries
select cur.User_id,
cur.Weight as Cur_weight,
prev.Weight as Prev_weight
cur.Weight - prev.Weight as Weight_change
from (
/*Insert query #1 here*/
) cur
inner join (
/*Insert query #2 here*/
) prev on
prev.User_id = cur.User_id
If I remember correctly the MySql syntax to get the top N weight gains would be to simply add:
ORDER BY cur.Weight - prev.Weight DESC limit N
2&3: Choosing indexes requires a little understanding of how the query optimiser will process the query:
The important thing when it comes to index selection is what columns you are filtering by or joining on. The optimiser will use the index if it is determined to be selective enough (note that sometimes your filters have to be extremely selective returning < 1% of data to be considered useful). There's always a trade of between slow disk seek times of navigating indexes and simply processing all the data in memory.
3: Although weights feature significantly in what you display, the only relevance is in terms of filtering (or selection) is in #2 to get the top N weight gains. This is a complex calculation based on a number of queries and a lot of processing that has gone before; so Weight will provide zero benefit as an index.
Another note is that even for #2 you have to calculate the weight change of all users in order to determine the which have gained the most. Therefore unless you have a very large number of readings per user you will read most of the table. (I.e. a table scan will be used to obtain the bulk of the data)
Where indexes can benefit:
You are trying to identify specific Foodbar rows based on User_id and Created_at.
You are also joining back to the Foodbar table again using User_id and Created_at.
This implies an index on User_id, Created__at would be useful (more-so if this is the clustered index).
4: No, unfortunately it is mathematically impossible to determine how the individual values H and W would independently determine the ordering of the product. E.g. both H=3 & W=3 are less than 5, yet if H=5 and W=1 then the product 3*3 is greater than 5*1.
You would have to actually store the calculation an index on that additional column. However, as indicated in my answer to #3 above, it is still unlikely to prove beneficial.