I am using application insights to monitor API usage in my application. I am trying to generate a report to list down how many times a particular API was called over the last 2 months. Here is my query
requests
| where timestamp >= ago(24*60h)
| summarize count() by name
| order by count_ desc
The problem is that the 'name' of the API has also got parameters attached along with the URL, and so the same API appears many times in the result set with different parameters (e.g. GET api/getTasks/1, GET api/getTasks/2). I tried to look through the 'requests' schema to check if there is a column that I could use which had the API name without parameters, but couldn't find it. Is there a way to group by 'name' without parameters on insights? Please help with the query. Thanks so much in advance.
This cuts everything after the second slash:
requests
| where timestamp > ago(1d)
| extend idx = indexof(name, "/", indexof(name, "api/") + 4)
| extend strippedname = iff(idx >= 0, substring(name, 0, idx), name)
| summarize count() by strippedname
| order by count_
Another approach (if API surface is small) is to extract values through nested iif operators.
Related
For AWS ALB access logs (https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-access-logs.html), I would like an Athena SQL query example to sort descending/ascending by the count of the client:port field for elb_status_code/target_status_code during a start and end date (DD-MM-YYYY HH-MM).
The result of the query for target_status_code=500 to be like:
client:port
count of target_status_code=500
70.132.2.XX:port
2570
70.132.2.XX:port
2315
80.122.1.XX:port
1750
...
...
The point would be to find the top clients:port (The IP address and port of the requesting client) with the elb_status_code/target_status_code=4xx or 5xx (https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).
Using the table described in Querying Classic Load Balancer Logs , assuming you partition it by date (the partition key is called date_partition_key below), you could do something like this:
SELECT
CONCAT(request_ip, ':', CAST(request_port AS VARCHAR)) AS client_port,
COUNT(*) AS count_of_status_500
FROM elb_logs
WHERE elb_response_code = '500'
AND date_partition_key BETWEEN '2022-01-01' AND '2022-01-03'
GROUP BY 1
ORDER BY 2 DESC
The 1 and 2 in the group and order by clauses refer back to the first and second items in the select list, i.e. the client port and the count, respectively. It's just a convenient way of not having to repeat the function calls etc.
Meanwhile I found this link
https://aws.amazon.com/premiumsupport/knowledge-center/athena-analyze-access-logs/
with some ALB access logs queries examples. This may be useful for users not very familiar with SQL queries (like me).
Context:
I have a test table:
=> \d+ test
Table "public.test"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
---------------+------------------------+-----------+----------+---------+----------+--------
------+-------------
id | character varying(255) | | | | extended |
|
configuration | jsonb | | | | extended |
|
The configuration column contains "well-defined" json, which has a key called source_url (Skipping other non-relevant keys). An example value for configuration column is:
{
"source_url": "https://<resource-address>?Signature=R1UzTGphWEhrTTFFZnc0Q4qkGRxkA5%2BHFZSfx3vNEvRsrlDcHdntArfHwkWiT7Qxi%2BWVJ4DbHJeFp3GpbS%2Bcb1H3r1PXPkfKB7Fjr6tFRCetDWAOtwrDrVOkR9G1m7iOePdi1RW%2Fn1LKE7MzQUImpkcZXkpHTUgzXpE3TPgoeVtVOXXt3qQBARpdSixzDU8dW%2FcftEkMDVuj4B%2Bwiecf6st21MjBPjzD4GNVA%2F6bgvKA6ExrdYmM5S6TYm1lz2e6juk81%2Fk4eDecUtjfOj9ekZiGJVMyrD5Tyw%2FTWOrfUB2VM1uw1PFT2Gqet87jNRDAtiIrJiw1lfB7Od1AwNxIk0Rqkrju8jWxmQhvb1BJLV%2BoRH56OHdm5nHXFmQdldVpyagQ8bQXoKmYmZPuxQb6t9FAyovGMav3aMsxWqIuKTxLzjB89XmgwBTxZSv5E9bkWUbom2%2BWq4O3%2BCrVxYwsqg%3D%3D&Expires-At=1569340020&Issued-At=1568293200"
.
.
}
The URL contains a query param Expires-At
Problem:
There is a scheduled job that runs every 24 hours. This job should find all such records which are expired/about to expire(and then do something about it).
Solution:
I have this query to get my job done:
select * from test where to_timestamp(split_part(split_part(configuration->>'source_url', 'Expires-At=', 2), '&', 1)::bigint) <= now() + interval '24 hours';
Explanation:
The query first splits the source_url at Expires-At= and picks the part present at the right of it and then it splits the resultant string on & and picks the left part of it, thus getting the exact epoch time needed as text
The same query also works for the corner case when Expires-At is the last query param in the source_url
Once it extracts the epoch time as text, it first converts it to a bigint and then convert it to Postgres timestamp and then this timestamp is compared if it is going to be less than or equal to the time 24 hours away from now()
All rows passing the above condition are selected
So, at the end, in each run, scheduler refreshes all the urls that will expire in the next 24 hours (including the ones, which are already expired)
Questions:
Though this solves my problem, I really don't like this solution. This has a lot of string manipulation which I kind of find as un-clean. Is there a much cleaner way to do this?
If we "have" to go with above solution, can we even use indices for this kind of query? I know the functions lower(), upper() extra can be indexed, but I really can't think of any way where I could index this query.
Alternatives:
Unless there is a real clean solution, I am going to go with this:
I would introduce a new key inside configuration json called expires_at, making sure, this gets filled with the correct value, every time a row is inserted.
And then directly query this newly added field(have the index on configuration column).
I admit that this way I am repeating the information Expires-At, but out of all possible solution I could think of, this is the one which I find to be most clean.
Is there a better way than this that you folks can think of?
EDIT:
Updated the query to use substring() with regex instead of inner split_part():
select * from test where to_timestamp(split_part(substring(configuration->>'source_url' from 'Expires-At=\d+'), '=', 2)::bigint) <= now() + interval '24 hours';
Given your current data model, I don't find your WHERE condition that bad.
You can index it with
CREATE INDEX ON test (
to_timestamp(
split_part(
split_part(
configuration->>'source_url',
'Expires-At=',
2
),
'&',
1
)::bigint
)
);
Essentially, youbhave to index the whole expression on the left side of =. You can only do that if all functions and operators involved are IMMUTABLE, which I think they are in your case.
I would change the data model though. First, I don't see the value of having a jsonb column with a single value in it. Why not have the URL as a text column instead?
You could go farther and split the URL into individual parts which are stored in columns.
If all this is a good idea depends on how you use the value in the database: often it is a good idea to split off those parts of the data that you use in WHERE conditions and the like and leave the rest "in a lump". This is to some extent a matter of taste.
You can use a URI parsing module, if that is the part you find unclean. You could use plperl or plpythonu, with whatever URI parser library in them you prefer. But if your json is really "well defined" I don't see much point. Unless you are already using plperl or plpythonu, adding those dependencies probably adds more "dirt" than it removes.
You can build an index:
create index on test (to_timestamp(split_part(split_part(configuration->>'source_url', 'Expires-At=', 2), '&', 1)::bigint));
set enable_seqscan TO off;
explain select * from test where to_timestamp(split_part(split_part(configuration->>'source_url', 'Expires-At=', 2), '&', 1)::bigint) <= now() + interval '24 hours';
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using test_to_timestamp_idx1 on test (cost=0.13..8.15 rows=1 width=36)
Index Cond: (to_timestamp(((split_part(split_part((configuration ->> 'source_url'::text), 'Expires-At='::text, 2), '&'::text, 1))::bigint)::double precision) <= (now() + '24:00:00'::interval))
I would introduce a new key inside configuration json called expires_at, making sure, this gets filled with the correct value, every time a row is inserted.
Isn't that just re-arranging the dirt? It makes the query look nicer, at the expense of making the insert uglier. Perhaps you could put it in an INSERT OR UPDATE trigger.
Can someone explain why these two queries (sometimes) do cause errors? I googled some explanations but none of them were right. I dont want to fix it. This queries should be actually used for SQL injection attack (I think error based sql injection). Triggered error should be "duplicate entry". I'm trying to found out why are they sometimes counsing errors.
Thanks.
select
count(*)
from
information_schema.tables
group by
concat(version(),
floor(rand()*2));
select
count(*),
concat(version(),
floor(rand()*2))x
from
information_schema.tables
group by
x;
It seems the second one is trying to guess which database the victim of the injection is using.
The second one is giving me this:
+----------+------------------+
| count(*) | x |
+----------+------------------+
| 88 | 10.1.38-MariaDB0 |
| 90 | 10.1.38-MariaDB1 |
+----------+------------------+
Okay, I'm going to post an answer - and it's more of a frame challenge to the question itself.
Basically: this query is silly, and it should be written; find out what it's supposed to do and rewrite it in a way that makes sense.
What does the query currently do?
It looks like it's getting a count of the tables in the current database... except it's grouping by a calculated column. And that column looks like it is Version() and appends either a '0' or a '1' to it (chosen randomly.)
So the end result? Two rows, each with a numerical value, the sum of which adds up to the total number of tables in the current database. If there are 30 tables, you might get 13/17 one time, 19/11 the next, followed by 16/14.
I have a hard time believing that this is what the query is supposed to do. So instead of just trying to fix the "error" - dig in and figure out what piece of data it should be returning - and then rewrite the proc to do it.
I have request:
Model.group(:p_id).pluck("AVG(desired)")
=> [0.77666666666666667e1, 0.431666666666666667e2, ...]
but when I ran SQL
SELECT AVG(desired) AS desired
FROM model
GROUP BY p_id
I got
-----------------
| desired |
|-----------------|
| 7.76666666666667|
|43.1666666666667 |
| ... |
-----------------
What is the reason of this? Sure I can multiply, but I bet where are should be an explanation.
I found that
Model.group(:p_id).pluck("AVG(desired)").map{|a| a.to_f}
=> [7.76666666666667,43.1666666666667, ...]
Now I'm struggle with other task, I need numbers attributes in pluck so my request is:
Model.group(:p_id).pluck("p_id, AVG(desired)")
how to get correct AVG value in this case?
0.77666666666666667e1 is (almost) 7.76666666666667, they're the same number in two different representations with slightly different precision. If you dump the first one into irb, you'll see:
> 0.77666666666666667e1
=> 7.766666666666667
When you perform an avg in the database, the result has type numeric which ActiveRecord represents using Ruby's BigDecimal. The BigDecimal values are being displayed in scientific notation but that shouldn't make any difference when you format your data for display.
In any case, pluck isn't the right tool for this job, you want to use average:
Model.group(:p_id).average(:desired)
That will give you a Hash which maps p_id to averages. You'll still get the averages in BigDecimals but that really shouldn't be a problem.
Finally I've found solution:
Model.group(:p_id).pluck("p_id, AVG(Round(desired))")
=> [[1,7.76666666666667],[2,43.1666666666667], ...]
I have a problem with retrieving language information from GitHub Archive Google BigQuery since the structure of the tables changed which was at the beginning of 2015.
When querying github_timeline table I have a field named repository_language. It allows me to get my language statistics.
Unfortunately for 2015 the structure has changed and the table doesn't contain any events after 2014.
For example the following query doesn't return any data:
select
repository_language, repository_url, created_at
FROM [githubarchive:github.timeline]
where
PARSE_UTC_USEC(created_at) > PARSE_UTC_USEC('2015-01-02 00:00:00')
Events for 2015 are in: githubarchive:month & githubarchive:day tables. None of them have language information tho (or at least repository_language column).
Can anyone help me?
Look at payload field
It is string that, I think, actually holds JSON with all "missing" attributes
You can process this using JSON Functions
Added Query
Try as below:
SELECT
JSON_EXTRACT_SCALAR(payload, '$.pull_request.head.repo.language') AS language,
COUNT(1) AS usage
FROM [githubarchive:month.201601]
GROUP BY language
HAVING NOT language IS NULL
ORDER BY usage DESC
What Mikhail said + you can use a query like this:
SELECT JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.language') language, COUNT(*) c
FROM [githubarchive:month.201501]
GROUP BY 1
ORDER BY 2 DESC