set analysis expression distinct count with multiple exclusions

set analysis expression distinct count with multiple exclusions - qlikview

I am trying to display a unique count of people. There are multiple programs that the people could have been in, and some of them I don't want to be counted (they were only on a wait list). I've searched the internet and tried an if statement as well as set analysis. I keep getting an answer that isn't accurate and isn't changed by my set analysis.
My current expression is:
=count({<[ProgramName]-={'p2', 'p3', 'p4', 'p5'...}>} distinct [HumanName])
Here is an example of data assuming we are counting the number of distinct people in p1. The count of unique people would be 4:
In my real data I have 50 programs, I want to exclude about 20 and count distinct patients that are in the other 30.
I've tried moving the distinct to the front of the expression (count(distinct( and that didn't affect my answer. When I remove the '-' making the expression include the listed programs, my answer remains the same.
I tried using count({1-[ProgramName]={....} which also returns the same value.
Please help! I am new to set analysis although I'm familiar with other types of coding. Thank you :)

The expression count({<[Program]-={'p2', 'p3', 'p4', 'p5'}>} distinct Name) should work. usualy when its not working mean something in your schema is wrong (or QV bug - can happen).
But you can try this expression (slower):
count(distinct if(match([Program],'p2', 'p3', 'p4', 'p5')=0,[Name]))

Related

Aggregation of an annotation in GROUP BY in Django

UPDATE
Thanks to the posted answer, I found a much simpler way to formulate the problem. The original question can be seen in the revision history.
The problem
I am trying to translate an SQL query into Django, but am getting an error that I don't understand.
Here is the Django model I have:
class Title(models.Model):
title_id = models.CharField(primary_key=True, max_length=12)
title = models.CharField(max_length=80)
publisher = models.CharField(max_length=100)
price = models.DecimalField(decimal_places=2, blank=True, null=True)
I have the following data:
publisher title_id price title
--------------------------- ---------- ------- -----------------------------------
New Age Books PS2106 7 Life Without Fear
New Age Books PS2091 10.95 Is Anger the Enemy?
New Age Books BU2075 2.99 You Can Combat Computer Stress!
New Age Books TC7777 14.99 Sushi, Anyone?
Binnet & Hardley MC3021 2.99 The Gourmet Microwave
Binnet & Hardley MC2222 19.99 Silicon Valley Gastronomic Treats
Algodata Infosystems PC1035 22.95 But Is It User Friendly?
Algodata Infosystems BU1032 19.99 The Busy Executive's Database Guide
Algodata Infosystems PC8888 20 Secrets of Silicon Valley
Here is what I want to do: introduce an annotated field dbl_price which is twice the price, then group the resulting queryset by publisher, and for each publisher, compute the total of all dbl_price values for all titles published by that publisher.
The SQL query that does this is as follows:
SELECT SUM(dbl_price) AS total_dbl_price, publisher
FROM (
SELECT price * 2 AS dbl_price, publisher
FROM title
) AS A
GROUP BY publisher
The desired output would be:
publisher tot_dbl_prices
--------------------------- --------------
Algodata Infosystems 125.88
Binnet & Hardley 45.96
New Age Books 71.86
Django query
The query would look like:
Title.objects
.annotate(dbl_price=2*F('price'))
.values('publisher')
.annotate(tot_dbl_prices=Sum('dbl_price'))
but gives an error:
KeyError: 'dbl_price'.
which indicates that it can't find the field dbl_price in the queryset.
The reason for the error
Here is why this error happens: the documentation says
You should also note that average_rating has been explicitly included
in the list of values to be returned. This is required because of the ordering of the values() and annotate() clause.
If the values() clause precedes the annotate() clause, any annotations
will be automatically added to the result set. However, if the
values() clause is applied after the annotate() clause, you need to explicitly include the aggregate column.
So, the dbl_price could not be found in aggregation, because it was created by a prior annotate, but wasn't included in values().
However, I can't include it in values either, because I want to use values (followed by another annotate) as a grouping device, since
If the values() clause precedes the annotate(), the annotation will be computed using the grouping described by the values() clause.
which is the basis of how Django implements SQL GROUP BY. This means that I can't include dbl_price inside values(), because then the grouping will be based on unique combinations of both fields publisher and dbl_price, whereas I need to group by publisher only.
So, the following query, which only differs from the above in that I aggregate over model's price field rather than annotated dbl_price field, actually works:
Title.objects
.annotate(dbl_price=2*F('price'))
.values('publisher')
.annotate(sum_of_prices=Count('price'))
because the price field is in the model rather than being an annotated field, and so we don't need to include it in values to keep it in the queryset.
The question
So, here we have it: I need to include annotated property into values to keep it in the queryset, but I can't do that because values is also used for grouping (which will be wrong with an extra field). The problem essentially is due to the two very different ways that values is used in Django, depending on the context (whether or not values is followed by annotate) - which is (1) value extraction (SQL plain SELECT list) and (2) grouping + aggregation over the groups (SQL GROUP BY) - and in this case these two ways seem to conflict.
My question is: is there any way to solve this problem (without things like falling back to raw sql)?
Please note: the specific example in question can be solved by moving all annotate statements after values, which was noted by several answers. However, I am more interested in solutions (or discussion) which would keep the annotate statement(s) before values(), for three reasons: 1. There are also more complex examples, where the suggested workaround would not work. 2. I can imagine situations, where the annotated queryset has been passed to another function, which actually does GROUP BY, so that the only thing we know is the set of names of annotated fields, and their types. 3. The situation seems to be pretty straightforward, and it would surprise me if this clash of two distinct uses of values() has not been noticed and discussed before.

Update: Since Django 2.1, everything works out of the box. No workarounds needed and the produced query is correct.
This is maybe a bit too late, but I have found the solution (tested with Django 1.11.1).
The problem is, call to .values('publisher'), which is required to provide grouping, removes all annotations, that are not included in .values() fields param.
And we can't include dbl_price to fields param, because it will add another GROUP BY statement.
The solution in to make all aggregation, which requires annotated fields firstly, then call .values() and include that aggregations to fields param(this won't add GROUP BY, because they are aggregations).
Then we should call .annotate() with ANY expression - this will make django add GROUP BY statement to SQL query using the only non-aggregation field in query - publisher.
Title.objects
.annotate(dbl_price=2*F('price'))
.annotate(sum_of_prices=Sum('dbl_price'))
.values('publisher', 'sum_of_prices')
.annotate(titles_count=Count('id'))
The only minus with this approach - if you don't need any other aggregations except that one with annotated field - you would have to include some anyway. Without last call to .annotate() (and it should include at least one expression!), Django will not add GROUP BY to SQL query. One approach to deal with this is just to create a copy of your field:
Title.objects
.annotate(dbl_price=2*F('price'))
.annotate(_sum_of_prices=Sum('dbl_price')) # note the underscore!
.values('publisher', '_sum_of_prices')
.annotate(sum_of_prices=F('_sum_of_prices')
Also, mention, that you should be careful with QuerySet ordering. You'd better call .order_by() either without parameters to clear ordering or with you GROUP BY field. If the resulting query will contain ordering by any other field, the grouping will be wrong.
https://docs.djangoproject.com/en/1.11/topics/db/aggregation/#interaction-with-default-ordering-or-order-by
Also, you might want to remove that fake annotation from your output, so call .values() again.
So, final code looks like:
Title.objects
.annotate(dbl_price=2*F('price'))
.annotate(_sum_of_prices=Sum('dbl_price'))
.values('publisher', '_sum_of_prices')
.annotate(sum_of_prices=F('_sum_of_prices'))
.values('publisher', 'sum_of_prices')
.order_by('publisher')

This is expected from the way group_by works in Django. All annotated fields are added in GROUP BY clause. However, I am unable to comment on why it was written this way.
You can get your query to work like this:
Title.objects
.values('publisher')
.annotate(total_dbl_price=Sum(2*F('price'))
which produces following SQL:
SELECT publisher, SUM((2 * price)) AS total_dbl_price
FROM title
GROUP BY publisher
which just happens to work in your case.
I understand this might not be the complete solution you were looking for, but some even complex annotations can also be accommodated in this solution by using CombinedExpressions(I hope!).

Your problem comes from values() follow by annotate(). Order are important.
This is explain in documentation about [order of annotate and values clauses](
https://docs.djangoproject.com/en/1.10/topics/db/aggregation/#order-of-annotate-and-values-clauses)
.values('pub_id') limit the queryset field with pub_id. So you can't annotate on income
The values() method takes optional positional arguments, *fields,
which specify field names to which the SELECT should be limited.

This solution by #alexandr addresses it properly.
https://stackoverflow.com/a/44915227/6323666
What you require is this:
from django.db.models import Sum
Title.objects.values('publisher').annotate(tot_dbl_prices=2*Sum('price'))
Ideally I reversed the scenario here by summing them up first and then doubling it up. You were trying to double it up then sum up. Hope this is fine.

Query to Find Adjacent Date Records

There exists in my database a page_history table; the idea is that whenever a record in the page table is changed, that record's old values are stored in the history table.
My job now is to find occasions in which a record was changed, and retrieve the pre- and post-conditions of that change. Specifically, I want to know when a page changed groups, and what groups were involved in the change. The query I have below can find these instances, but with the use of the min function, I can only get back the values that match between the two records:
select page_id,
original_group,
min(created2) change_date
from (select h.page_id,
h.group_id original_group,
i.group_id new_group,
h.created_dttm created1,
i.created_dttm created2
from page_history h,
page_history i
where h.page_id = i.page_id
and h.created_dttm < i.created_dttm
and h.group_id != i.group_id)
group by page_id, original_group, created1
order by page_id
When I try to get, say, any details of the second record, like new_group, I'm hit with a ORA-00979: not a GROUP BY expression error. I don't want to group by new_group, though, because that's going to destroy the logic (I think it would find records displaying times a page changed from a group to another group, regardless of any changes to other groups in between).
My question, then, is how can I modify this query, or go about writing a new one, that achieves a similar end, but with the added availability of columns that do not match between the two records? In essence, how can I find that min record without sacrificing all the other columns I'm not trying to compare? I don't exactly need a complete answer, any suggestions that point me in the right direction would be appreciated.
I use PL/SQL Developer, and it looks like version 11.2.0.2.0 of Oracle.
EDIT: I have found a solution. It's not pretty, and I'd still like to see some alternatives, but if helping me out would threaten to explode your brain, I would advise relocating to an easier question.

Without seeing your table structure it's hard to re-write the query but when you have a min function used like that it invariably seems better to put it into a separate sub select to get what you want and then compare the result of that.

SQL (MS) - Custom compare for Null=Value on many columns

I have a table with 50 columns of identifying information that is inconsistently filled out, even for the same individual. Sadly, individuals do not have a unique identifier in this system.
For example, some times we may capture a person's middle name, preferred name, and sometimes it is null - for the SAME individual.
Simplest solution I could think of would be a custom compare function that takes (NULL,VALUE) and returns true, but I'm not sure how to implement this, or if it's even wise.
Ideally I would like to link up records with a lag over partition, but there is frustratingly little information on how partition works other than it takes a 'value expression'. I have tested that it can accept multiple comma separated columns, but the occurrence of null values causes us to miss matches.

Can scalar functions be applied before filtering when executing a SQL Statement?

I suppose I have always naively assumed that scalar functions in the select part of a SQL query will only get applied to the rows that meet all the criteria of the where clause.
Today I was debugging some code from a vendor and had that assumption challenged. The only reason I can think of for this code failing is that the Substring() function is getting called on data that should have been filtered out by the WHERE clause. But it appears that the substring call is being applied before the filtering happens, the query is failing.
Here is an example of what I mean. Let's say we have two tables, each with 2 columns and having 2 rows and 1 row respectively. The first column in each is just an id. NAME is just a string, and NAME_LENGTH tells us how many characters in the name with the same ID. Note that only names with more than one character have a corresponding row in the LONG_NAMES table.
NAMES: ID, NAME
1, "Peter"
2, "X"
LONG_NAMES: ID, NAME_LENGTH
1, 5
If I want a query to print each name with the last 3 letters cut off, I might first try something like this (assuming SQL Server syntax for now):
SELECT substring(NAME,1,len(NAME)-3)
FROM NAMES;
I would soon find out that this would give me an error, because when it reaches "X" it will try using a negative number for in the substring call, and it will fail.
The way my vendor decided to solve this was by filtering out rows where the strings were too short for the len - 3 query to work. He did it by joining to another table:
SELECT substring(NAMES.NAME,1,len(NAMES.NAME)-3)
FROM NAMES
INNER JOIN LONG_NAMES
ON NAMES.ID = LONG_NAMES.ID;
At first glance, this query looks like it might work. The join condition will eliminate any rows that have NAME fields short enough for the substring call to fail.
However, from what I can observe, SQL Server will sometimes try to calculate the the substring expression for everything in the table, and then apply the join to filter out rows. Is this supposed to happen this way? Is there a documented order of operations where I can find out when certain things will happen? Is it specific to a particular Database engine or part of the SQL standard? If I decided to include some predicate on my NAMES table to filter out short names, (like len(NAME) > 3), could SQL Server also choose to apply that after trying to apply the substring? If so then it seems the only safe way to do a substring would be to wrap it in a "case when" construct in the select?

Martin gave this link that pretty much explains what is going on - the query optimizer has free rein to reorder things however it likes. I am including this as an answer so I can accept something. Martin, if you create an answer with your link in it i will gladly accept that instead of this one.
I do want to leave my question here because I think it is a tricky one to search for, and my particular phrasing of the issue may be easier for someone else to find in the future.
TSQL divide by zero encountered despite no columns containing 0
EDIT: As more responses have come in, I am again confused. It does not seem clear yet when exactly the optimizer is allowed to evaluate things in the select clause. I guess I'll have to go find the SQL standard myself and see if i can make sense of it.

Joe Celko, who helped write early SQL standards, has posted something similar to this several times in various USENET newsfroups. (I'm skipping over the clauses that don't apply to your SELECT statement.) He usually said something like "This is how statements are supposed to act like they work". In other words, SQL implementations should behave exactly as if they did these steps, without actually being required to do each of these steps.
Build a working table from all of
the table constructors in the FROM
clause.
Remove from the working table those
rows that do not satisfy the WHERE
clause.
Construct the expressions in the
SELECT clause against the working table.
So, following this, no SQL dbms should act like it evaluates functions in the SELECT clause before it acts like it applies the WHERE clause.
In a recent posting, Joe expands the steps to include CTEs.
CJ Date and Hugh Darwen say essentially the same thing in chapter 11 ("Table Expressions") of their book A Guide to the SQL Standard. They also note that this chapter corresponds to the "Query Specification" section (sections?) in the SQL standards.

You are thinking about something called query execution plan. It's based on query optimization rules, indexes, temporaty buffers and execution time statistics. If you are using SQL Managment Studio you have toolbox over your query editor where you can look at estimated execution plan, it shows how your query will change to gain some speed. So if just used your Name table and it is in buffer, engine might first try to subquery your data, and then join it with other table.

Explanation of particular sql injection

Browsing through the more dubious parts of the web, I happened to come across this particular SQL injection:
http://server/path/page.php?id=1+union+select+0,1,concat_ws(user(),0x3a,database(),0x3a,version()),3,4,5,6--
My knowledge of SQL - which I thought was half decent - seems very limiting as I read this.
Since I develop extensively for the web, I was curious to see what this code actually does and more importantly how it works.

It replaces an improperly written parametrized query like this:
$sql = '
SELECT *
FROM products
WHERE id = ' . $_GET['id'];
with this query:
SELECT *
FROM products
WHERE id = 1
UNION ALL
select 0,1,concat_ws(user(),0x3A,database(),0x3A,version()),3,4,5,6
, which gives you information about the database name, version and username connected.

The injection result relies on some assumptions about the underlying query syntax.
What is being assumed here is that there is a query somewhere in the code which will take the "id" parameter and substitute it directly into the query, without bothering to sanitize it.
It's assuming a naive query syntax of something like:
select * from records where id = {id param}
What this does is result in a substituted query (in your above example) of:
select * from records where id = 1 union select 0, 1 , concat_ws(user(),0x3a,database(),0x3a,version()), 3, 4, 5, 6 --
Now, what this does that is useful is that it manages to grab not only the record that the program was interested in, but also it UNIONs it with a bogus dataset that tells the attacker (these values appear separated by colons in the third column):
the username with which we are
connected to the database
the name of the database
the version of the db software
You could get the same information by simply running:
select concat_ws(user(),0x3a,database(),0x3a,version())
Directly at a sql prompt, and you'll get something like:
joe:production_db:mysql v. whatever
Additionally, since UNION does an implicit sort, and the first column in the bogus data set starts with a 0, chances are pretty good that your bogus result will be at the top of the list. This is important because the program is probably only using the first result, or there is an additional little bit of SQL in the basic expression I gave you above that limits the result set to one record.
The reason that there is the above noise (e.g. the select 0,1,...etc) is that in order for this to work, the statement you are calling the UNION with must have the same number of columns as the first result set. As a consequence, the above injection attack only works if the corresponding record table has 7 columns. Otherwise you'll get a syntax error and this attack won't really give you what you want. The double dashes (--) are just to make sure anything that might happen afterwords in the substitution is ignored, and I get the results I want. The 0x3a garbage is just saying "separate my values by colons".
Now, what makes this query useful as an attack vector is that it is easily re-written by hand if the table has more or less than 7 columns.
For example if the above query didn't work, and the table in question has 5 columns, after some experimentation I would hit upon the following query url to use as an injection vector:
http://server/path/page.php?id=1+union+select+0,1,concat_ws(user(),0x3a,database(),0x3a,version()),3,4--
The number of columns the attacker is guessing is probably based on an educated look at the page. For example if you're looking at a page listing all the Doodads in a store, and it looks like:
Name | Type | Manufacturer
Doodad Foo Shiny Shiny Co.
Doodad Bar Flat Simple Doodads, Inc.
It's a pretty good guess that the table you're looking at has 4 columns (remember there's most likely a primary key hiding somewhere if we're searching by an 'id' parameter).
Sorry for the wall of text, but hopefully that answers your question.

this code adds an additional union query to the select statement that is being executed on page.php. The injector has determined that the original query has 6 fields, thus the selection of the numeric values (column counts must match with a union). the concat_ws just makes one field with the values for the database user , the database, and the version, separated by colons.

It seems to retrieve the user used to connect to the database, the database adress and port, the version of it. And it will be put by the error message.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas