Why '| count' doesn't work after 'as' operator

Why '| count' doesn't work after 'as' operator - kql

Seems the '| count' expression works unexpectedly on a tabular expression bind with 'as' operator. This query returned 100 record instead of 1 record:
traces
| take 100 | as traces100;
traces100
| count

You should do it differently:
let traces100 = traces | take 100;
traces100
| count
It's because as and let are different:
let statements (which is what you need) bind names to expressions. For the rest of the scope, where the let statement appears, the name can be used to refer to its bound value. See more details in the doc.
as (which is what you tried to use) bind a name to the operator's input tabular expression, thus allowing the query to reference the value of the tabular expression multiple times without breaking the query and binding a name through the let statement. See more details in the doc.

Related

How to get only distinct values from list

What I have: A datasource with a string column, let's call it "name".
There are more, but those are not relevant to the question.
The "name" column in the context of a concrete query contains only 2 distinct values:
""
"SomeName"
But any of the two a varying amount of times. There will only be those two.
Now, what I need is: In the context of a summarize statement, I need a column filled with the two distinct values strcated together, so I end up with just "SomeName".
What I have is not meeting this requirement and I cannot bring myself to find a solution for this:
datatable(name:string)["","SomeName","SomeName"] // just to give a minimal reproducible example
| summarize Name = strcat_array(make_list(name), "")
which gives me
| Name
> SomeNameSomeName
but I need just
| Name
> SomeName
I am aware that I need to do some sort of "distinct" somehow and somewhere or maybe there is a completely different solution to get to the same result?
So, my question is: What do I need to change in the shown query to fullfill my requirement?

take_any()
When the function is provided with a single column reference, it will
attempt to return a non-null/non-empty value, if such value is
present.
datatable(name:string)["","SomeName","SomeName", ""]
| summarize take_any(name)
name
SomeName
Fiddle

Wow, just as I posted the question, I found an answer:
datatable(name:string)["","SomeName","SomeName", ""]
| summarize Name = max(name)
I have no idea, why this works for a string column, but here I am.
This results in my desired outcome:
| Name
> SomeName
...which I suppose is probably less efficient than David's answer. So I'll prefer his one.

Azure Data Explorer Query Language-batches and materialize

Reference:
https://learn.microsoft.com/en-us/azure/kusto/query/materializefunction
https://learn.microsoft.com/en-us/azure/kusto/query/batches
I am using Microsoft's demo portal: https://aka.ms/LADemo
A query can include multiple tabular expression statements, as long as
they are delimited by a semicolon (;) character. The query then
returns multiple tabular results, as produced by the tabular
expression statements, and ordered according to the order of the
statements in the query text.
When I run following code I am only getting result from the first expression and not the second and third one. Which is contrary to the above statement. Why?
let randomSet = materialize(range x from 1 to 30000000 step 1 | project value = rand(10000000));
randomSet | summarize dcount(value);
randomSet | top 3 by value;
randomSet | summarize sum(value)
Picture of the result set:

Log Analytics does not support this feature, to try it out use the Kusto (Azure Data Explorer) demo cluster, you will see multiple tables in the output pane:

Postgres use index with `split_part`

Context:
I have a test table:
=> \d+ test
Table "public.test"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
---------------+------------------------+-----------+----------+---------+----------+--------
------+-------------
id | character varying(255) | | | | extended |
|
configuration | jsonb | | | | extended |
|
The configuration column contains "well-defined" json, which has a key called source_url (Skipping other non-relevant keys). An example value for configuration column is:
{
"source_url": "https://<resource-address>?Signature=R1UzTGphWEhrTTFFZnc0Q4qkGRxkA5%2BHFZSfx3vNEvRsrlDcHdntArfHwkWiT7Qxi%2BWVJ4DbHJeFp3GpbS%2Bcb1H3r1PXPkfKB7Fjr6tFRCetDWAOtwrDrVOkR9G1m7iOePdi1RW%2Fn1LKE7MzQUImpkcZXkpHTUgzXpE3TPgoeVtVOXXt3qQBARpdSixzDU8dW%2FcftEkMDVuj4B%2Bwiecf6st21MjBPjzD4GNVA%2F6bgvKA6ExrdYmM5S6TYm1lz2e6juk81%2Fk4eDecUtjfOj9ekZiGJVMyrD5Tyw%2FTWOrfUB2VM1uw1PFT2Gqet87jNRDAtiIrJiw1lfB7Od1AwNxIk0Rqkrju8jWxmQhvb1BJLV%2BoRH56OHdm5nHXFmQdldVpyagQ8bQXoKmYmZPuxQb6t9FAyovGMav3aMsxWqIuKTxLzjB89XmgwBTxZSv5E9bkWUbom2%2BWq4O3%2BCrVxYwsqg%3D%3D&Expires-At=1569340020&Issued-At=1568293200"
.
.
}
The URL contains a query param Expires-At
Problem:
There is a scheduled job that runs every 24 hours. This job should find all such records which are expired/about to expire(and then do something about it).
Solution:
I have this query to get my job done:
select * from test where to_timestamp(split_part(split_part(configuration->>'source_url', 'Expires-At=', 2), '&', 1)::bigint) <= now() + interval '24 hours';
Explanation:
The query first splits the source_url at Expires-At= and picks the part present at the right of it and then it splits the resultant string on & and picks the left part of it, thus getting the exact epoch time needed as text
The same query also works for the corner case when Expires-At is the last query param in the source_url
Once it extracts the epoch time as text, it first converts it to a bigint and then convert it to Postgres timestamp and then this timestamp is compared if it is going to be less than or equal to the time 24 hours away from now()
All rows passing the above condition are selected
So, at the end, in each run, scheduler refreshes all the urls that will expire in the next 24 hours (including the ones, which are already expired)
Questions:
Though this solves my problem, I really don't like this solution. This has a lot of string manipulation which I kind of find as un-clean. Is there a much cleaner way to do this?
If we "have" to go with above solution, can we even use indices for this kind of query? I know the functions lower(), upper() extra can be indexed, but I really can't think of any way where I could index this query.
Alternatives:
Unless there is a real clean solution, I am going to go with this:
I would introduce a new key inside configuration json called expires_at, making sure, this gets filled with the correct value, every time a row is inserted.
And then directly query this newly added field(have the index on configuration column).
I admit that this way I am repeating the information Expires-At, but out of all possible solution I could think of, this is the one which I find to be most clean.
Is there a better way than this that you folks can think of?
EDIT:
Updated the query to use substring() with regex instead of inner split_part():
select * from test where to_timestamp(split_part(substring(configuration->>'source_url' from 'Expires-At=\d+'), '=', 2)::bigint) <= now() + interval '24 hours';

Given your current data model, I don't find your WHERE condition that bad.
You can index it with
CREATE INDEX ON test (
to_timestamp(
split_part(
split_part(
configuration->>'source_url',
'Expires-At=',
2
),
'&',
1
)::bigint
)
);
Essentially, youbhave to index the whole expression on the left side of =. You can only do that if all functions and operators involved are IMMUTABLE, which I think they are in your case.
I would change the data model though. First, I don't see the value of having a jsonb column with a single value in it. Why not have the URL as a text column instead?
You could go farther and split the URL into individual parts which are stored in columns.
If all this is a good idea depends on how you use the value in the database: often it is a good idea to split off those parts of the data that you use in WHERE conditions and the like and leave the rest "in a lump". This is to some extent a matter of taste.

You can use a URI parsing module, if that is the part you find unclean. You could use plperl or plpythonu, with whatever URI parser library in them you prefer. But if your json is really "well defined" I don't see much point. Unless you are already using plperl or plpythonu, adding those dependencies probably adds more "dirt" than it removes.
You can build an index:
create index on test (to_timestamp(split_part(split_part(configuration->>'source_url', 'Expires-At=', 2), '&', 1)::bigint));
set enable_seqscan TO off;
explain select * from test where to_timestamp(split_part(split_part(configuration->>'source_url', 'Expires-At=', 2), '&', 1)::bigint) <= now() + interval '24 hours';
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using test_to_timestamp_idx1 on test (cost=0.13..8.15 rows=1 width=36)
Index Cond: (to_timestamp(((split_part(split_part((configuration ->> 'source_url'::text), 'Expires-At='::text, 2), '&'::text, 1))::bigint)::double precision) <= (now() + '24:00:00'::interval))
I would introduce a new key inside configuration json called expires_at, making sure, this gets filled with the correct value, every time a row is inserted.
Isn't that just re-arranging the dirt? It makes the query look nicer, at the expense of making the insert uglier. Perhaps you could put it in an INSERT OR UPDATE trigger.

Meaning of these two queries (sql injection)

Can someone explain why these two queries (sometimes) do cause errors? I googled some explanations but none of them were right. I dont want to fix it. This queries should be actually used for SQL injection attack (I think error based sql injection). Triggered error should be "duplicate entry". I'm trying to found out why are they sometimes counsing errors.
Thanks.
select
count(*)
from
information_schema.tables
group by
concat(version(),
floor(rand()*2));
select
count(*),
concat(version(),
floor(rand()*2))x
from
information_schema.tables
group by
x;

It seems the second one is trying to guess which database the victim of the injection is using.
The second one is giving me this:
+----------+------------------+
| count(*) | x |
+----------+------------------+
| 88 | 10.1.38-MariaDB0 |
| 90 | 10.1.38-MariaDB1 |
+----------+------------------+

Okay, I'm going to post an answer - and it's more of a frame challenge to the question itself.
Basically: this query is silly, and it should be written; find out what it's supposed to do and rewrite it in a way that makes sense.
What does the query currently do?
It looks like it's getting a count of the tables in the current database... except it's grouping by a calculated column. And that column looks like it is Version() and appends either a '0' or a '1' to it (chosen randomly.)
So the end result? Two rows, each with a numerical value, the sum of which adds up to the total number of tables in the current database. If there are 30 tables, you might get 13/17 one time, 19/11 the next, followed by 16/14.
I have a hard time believing that this is what the query is supposed to do. So instead of just trying to fix the "error" - dig in and figure out what piece of data it should be returning - and then rewrite the proc to do it.

Limit dimension values displayed in QlikView Pivot Table Chart

I have a pivot table chart in QlikView that has a dimension and an expression. The dimension is a column with 5 possible values: 'a','b','c','d','e'.
Is there a way to restrict the values to 'a','b' and 'c' only?
I would prefer to enforce this from the chart properties with a condition, instead of choosing the values from a listbox if possible.
Thank you very much, I_saw_drones! There is an problem I have though. I have different expressions defined depending on the category, like this:
IF( ([Category]) = 'A' , COUNT( {<[field1] = {'x','y'} >} [field2]), IF ([Category]) = 'B' , SUM( {<[field3] = {'z'} >} [field4]), IF (Category='C', ..., 0)))
In this case, where would I add $<Category={'A','B','C'} ? My expression so far doesn't help because although I tell QV to use a different formula/calculation for each category, the category overall (all 5 values) represents the dimension.

One possible method to do this is to use QlikView's Set Analysis to create an expression which sums only your desired values.
For this example, I have a very simple load script:
LOAD * INLINE [
Category, Value
A, 1
B, 2
C, 3
D, 4
E, 5
];
I then have the following Pivot Table Chart set up with a single expression which just sums the values:
What we need to do is to modify the expression, so that it only sums A, B and C from the Category field.
If I then use QlikView's Set Analysis to modify the expression to the following:
=sum({$<Category={A,B,C}>} Value)
I then achieve my desired result:
This then restricts my Pivot Table Chart to displaying only these three values for Category without me having to make a selection in a Listbox. The form of this expression also allows other dimensions to be filtered at the same time (i.e. the selections "add up"), so I could say, filter on a Country dimension, and my restriction for Category would still be applied.
How this works
Let's pick apart the expression:
=sum({$<Category={A,B,C}>} Value)
Here you can recognise the original form we had before (sum(Value)), but with a modification. The part {$<Category={A,B,C}>} is the Set Analysis part and has this format: {set_identifier<set_modifier>}. Coming back to our original expression:
{: Set Analysis expressions always start with a {.
$: Set Identifier: This symbol represents the current selections in the QlikView document. This means that any subsequent restrictions are applied on top of the existing selections. 1 can also be used, this represents the full set of data in your document irrespective of selections.
<: Start of the set modifiers.
Category={A,B,C}: The dimension that we wish to place a restriction on. The values required are contained within the curly braces and in this case they are ORed together.
>: End of the set modifiers.
}: End of the set analysis expression.
Set Analysis can be quite complex and I've only scratched the surface here, I would definitely recommend checking the QlikView topic "Set Analysis" in both the installed helpfile and the reference manual (PDF).
Finally, Set Analysis in QlikView is quite powerful, however it should be used sparingly as it can lead to some performance problems. In this case, as this is a fairly simple expression the performance should be reasonable.

Woa! a year later, but what you are loking for is osmething near this:
Go to the dimension sheet, then select the Category Dimension, and click on the Edit Dimesnion button
there you can use something like this:
= If(Match(Category, 'a', 'b', 'c'), Category, Null())
This will make the object display only a b and c Categories, and a line for the Null value.
What leasts is that you check the "Suppress value when null" option on the Dimension sheet.
c ya around

Just thought another solution to this which may still be useful to people looking for this.
How about creating a bookmark with the categories that you want and then setting the expressions to be evaluated in the context of that bookmark only?
(Will expand on this later, but take a look at how set analysis can be affected by a bookmark)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Why '| count' doesn't work after 'as' operator - kql

Seems the '| count' expression works unexpectedly on a tabular expression bind with 'as' operator. This query returned 100 record instead of 1 record: traces | take 100 | as traces100; traces100 | count

Related

How to get only distinct values from list

Azure Data Explorer Query Language-batches and materialize

Postgres use index with `split_part`

Meaning of these two queries (sql injection)

Limit dimension values displayed in QlikView Pivot Table Chart

Categories

Resources