COUNT Function: Result of a query - sql

I'm having trouble with a part of the following question. Thank you in advance for your help. I have a hard time visualizing this "fake" database table. I was hoping someone could help me run through my logic and see if it's correct. If someone could just point me in the right direction that would be great!
About:
Sesame is a way to find online class for adults & activities for adults around you.
Imagine a database table named activities. It has four columns:
activity_id [int, non null]
activity_provider_id [int, non null]
area_id [int, nullable]
starts_at [timestamp, non null]
Question: Given the following query, which counts would you expect to return the highest and lowest values? Which counts would you expect to be the same? Why?
select
count(activity_id),
count(distinct activity_provider_id),
count(area_id),
count(distinct area_id),
count(*)

 from activities
My Solution
Highest values: count(*)
Reasoning: The Count(*) function returns the number of rows returned by a SELECT statement, including NULL and duplicates.
Lowest values: count(distinct activity_provider_id)
Reasoning: Less activity providers per activity per area*
Same: Unsure - Could someone just point me in the right direction?

count(*) takes in account all rows in the table, while count(some_col) only counts non-null values of some_col.
Since activity_id is a non nullable colum, one would expect the following expressions return the same, "highest" count:
count(activity_id)
count(*)
As for wich expression returns the lowest count out of the three remaining choices, it is not really possible to tell for sure from the information provided in the question. If actually depends whether that are more, or less, distinct areas than activity providers.
There even is an edge case where all expressions return the same case, if all activity providers (resp. areas) are not null and unique in the table.

Related

Finding the last 4, 3, 2, 1 months consecutive order drops among clients based on drop variance

Here I have this query that finds out the drop percentage of a bunch of clients based on the orders they have received(i.e. It finds the percentage difference in orders by comparing the current month with the previous month). What I want to achieve here is to have a field where I can see the clients who had 4 months continuous drop, 3 months drop, 2 months drop, and 1 month drop.
I know, it can only be achieved by comparing the last 4 months using the lag function or sub queries. can you guys pls help me out on this one, would appreciate it very much
select
fd.customers2, fd.Month1, fd.year1, fd.variance, case when
(fd.variance < -0.00001 and fd.year1 = '2022.0' and fd.Month1 = '1')
then '1month drop' else fd.customers2 end as 1_most_host_drop
from 
(SELECT
c.*,
sa.customers as customers2,
sum(sa.order) as orders,
date_part(mon, sa.date) as Month1,
date_part(year, sa.date) as year1,
(cast(orders - LAG(orders) OVER(Partition by customers2 ORDER BY
 year1, Month1) as NUMERIC(10,2))/NULLIF(LAG(orders) 
OVER(partition by customers2 ORDER BY year1, Month1) * 1, 0)) AS variance
FROM stats sa join (select distinct
    d.id, d.customers 
     from configer d 
    ) c on sa.customers=c.customers
WHERE sa.date >= '2021-04-1' 
GROUP BY Month1, sa.customers, c.id,  year1, 
     c.customers)fd
In a spirit of friendliness: I think you are a little premature in posting this here as there are several issues with the syntax before even reaching the point where you can solve the problem:
You have at least two places with a comma immediately preceding the word FROM:
...AS variance, FROM stats_archive sa ...
...d.customers, FROM config d...
Recommend you don't use VARIANCE as an alias (it is a system function in PostgreSQL and so is likely also a system function name in Redshift)
Not super important, but there's no need for c.* - just select the columns you will use
DATE_PART requires a string as the first parameter DATE_PART('mon',current_date)
I might be wrong about this, but I suspect you cannot use column aliases in the partition by or order by of a window function. Put the originating expressions there instead:
... OVER (PARTITION BY customers2 ORDER BY DATE_PART('year',sa.date),DATE_PART('mon',sa.date))
LAG has three parameters. (1) The column you want to retrieve the value from, (2) the row offset, where a positive integer indicates how many rows prior to the current row you should retrieve a value from according to the partition and order context and (3) the value the function should return as a default (in case of the first row in the partition). As such, you don't need NULLIF. So, to get the row immediately prior to the current row, or return 0 in case the current row is the first row in the partition:
LAG(orders,1,0) OVER (PARTITION BY customers2 ORDER BY DATE_PART('year',sa.date),DATE_PART('mon',sa.date))
If you use 0 as a default in the calculation of what is currently aliased variance, you will almost certainly run into a div/0 error either now, or worse, when you least expect it in the future. You should protect against that with some CASE logic or better, provide a more appropriate default value or even better, calculate the LAG with the default 0, then filter out the 0 rows before doing the calculation.
You can't use column aliases in the GROUP BY. You must reference each field that is not participating in an aggregate in the group by, whether through direct mention (sa.date) or indirectly in an expression (DATE_PART('mon',sa.date))
Your date should be '2021-04-01'
All in all, without sample data, expected results using the posted sample data and without first removing syntax errors, it is a tall order to attempt to offer advice on the problem which is any more specific than:
Build the source of the calculation as a completely separate query first. Calculate the LAG in that source query. Only when you've run that source query and verified that the LAG is producing the correct result should you then wrap it as a sub-query or CTE (not sure if Redshift supports these, but presumably) at which point you can filter out the rows with a zero as the denominator (the first month of orders for each customer).
Good luck!

how to calculate prevalence using sql code

I am trying to calculate prevalence in sql.
kind of stuck in writing the code.
I want to make automative code.
I have check that I have 1453477 of sample size and number of people who has disease is 851451 using count.
The formula of calculating prevalence is no.of person who has disease/no.sample size.
select (COUNT(condition_id)/COUNT(person_id)) as prevalence
from disease
where condition_id=12345;
when I run above code, I get 1 as a output where I am suppose to get 0.5858.
Can some one please help me out?
Thanks!
In your current query you count the number of rows in the disease table, once using the column condition_id, once using the column person_id. But the number of rows is the same - this is why you get 1 as a result.
I think you need to find the number of different values for these columns. This can be done using count distinct:
select (COUNT(DISTINCT condition_id)/COUNT(DISTINCT person_id)) as prevalence
from disease
where condition_id=12345;
You can cast by
count(...)/count(...)::numeric(6,4) or
count(...)/count(...)::decimal
as two options.
Important point is apply cast to denominator or numerator part(in this case denominator), Do not apply to division as
(count(...)/count(...))::numeric(6,4) which again results an integer.
I am pretty sure that the logic that you want is something like this:
select avg( (condition_id = 12345)::int )
from disease;
Your version doesn't have the sample size, because you are filtering out people without the condition.
If you have duplicate people in the data, then this is a little more complicated. One method is:
select (count(distinct person_id) filter (where condition_id = 12345)::numeric /
count(distinct person_id
)
from disease;

SQL counting number of rows

I am looking for a way to search for a certain number of rows as a quality check. For example, we have tables that have a certain set of results that are needed.
Here is a quick table for an example:
ID: Name: Result: Reportable:
ONE A 10 X
TWO B 12 X
THREE C 1
FOUR D 18 X
FOUR(redo) D 11 X
So we are looking to double check results as there are people who accidentally report results multiple times (as in the case with ID FOUR). We have used having counts but we need the numbers to be specific and need a query to verify that number is satisfied.
In the table above we only want IDs ONE, TWO, and FOUR, however we have 4 results (one extra). Currently we have our check showing the count needed (ie 3) and the current result count (4) to show the mismatch but want a query to easily only show the result needed. We would need the redo result most of the time so we have set it so we take the latest date, but it doesn't help filter how many rows or results. I apologize if anything is confusing and I am not able to share the SQL query that we have currently. It's my first time posting so if I need to clarify anything please let me know as this seems to be very complicated. Thank you for your time.
EDIT: The details
We have one table (Table A) letting us know which results are reportable. The ones that are reportable go into another table (Table B). We have had issues in which people have made too many results reportable which overpopulates the Table B. Our old query had a count in Table B, but due to mistakes in people placing multiple reportables, samples which had many redos seem to be finished as they were all placed and met the count in Table B.
So now by using the Table A that helps tell us how many are Reportable, we want this to double check that the samples are indeed ready.
As I understand the question, you want ids that have multiple reportables. Assuming you really mean name, then:
select name
from t
where reportable = 'X'
group by name
having count(*) >= 2;

SQL grouping with "Invalid use of group" error

I'll be upfront, this is a homework question, but I've been stuck on this one for hours and I just was looking for a push in the right direction. First I'll give you the relations and the hw question for background, then I'll explain my question:
Branch (BookCode, BranchNum, OnHand)
HW problem: List the BranchNum for all branches that have at least one book that has at least 10 copies on hand.
My question: I understand that I must take the SUM(OnHand) grouped by BookCode, but how do I then take that and group it by BranchNum? This is logically what I come up with and various versions:
select distinct BranchNum
from Inventory
where sum(OnHand) >= 10
group by BookCode;
but I keep getting an error that says "Invalid use of group function."
Could someone please explain what is wrong here?
UPDATE:
I understand now, I had to use the HAVING statement, the basic form is this:
select distinct (what you want to display)
from (table)
group by
having
Try this one.
SELECT BranchNum
FROM Inventory
GROUP BY BranchNum
HAVING SUM(OnHand) >= 10
You can also find Group By Clause with example here.
Although all comments in the question seem to be valid and add information they all seem to be missing why your query is not working. The reason is simple and is strictly related by the state/phase at which the sum is calculated.
The where clause is the first thing that will get executed. This means it will filter all rows at the beginning. Then the group by will come in effect and will merge all rows that are not specified in the clause and apply the aggregated functions (if any).
So if you try to add an aggregated function to the where clause you're trying to aggregate before data is being grouped by and even filtered. The having clause gets executed after the group by and allows you to filter the aggregated functions, as they have already been calculated.
That's why you can write HAVING SUM(OnHand) >= 10 and you can't write WHERE SUM(OnHand) >= 10.
Hope this helps!

What is an unbounded query?

Is an unbounded query a query without a WHERE param = value statement?
Apologies for the simplicity of this one.
An unbounded query is one where the search criteria is not particularly specific, and is thus likely to return a very large result set. A query without a WHERE clause would certainly fall into this category, but let's consider for a moment some other possibilities. Let's say we have tables as follows:
CREATE TABLE SALES_DATA
(ID_SALES_DATA NUMBER PRIMARY KEY,
TRANSACTION_DATE DATE NOT NULL
LOCATION NUMBER NOT NULL,
TOTAL_SALE_AMOUNT NUMBER NOT NULL,
...etc...);
CREATE TABLE LOCATION
(LOCATION NUMBER PRIMARY KEY,
DISTRICT NUMBER NOT NULL,
...etc...);
Suppose that we want to pull in a specific transaction, and we know the ID of the sale:
SELECT * FROM SALES_DATA WHERE ID_SALES_DATA = <whatever>
In this case the query is bounded, and we can guarantee it's going to pull in either one or zero rows.
Another example of a bounded query, but with a large result set would be the one produced when the director of district 23 says "I want to see the total sales for each store in my district for every day last year", which would be something like
SELECT LOCATION, TRUNC(TRANSACTION_DATE), SUM(TOTAL_SALE_AMOUNT)
FROM SALES_DATA S,
LOCATION L
WHERE S.TRANSACTION_DATE BETWEEN '01-JAN-2009' AND '31-DEC-2009' AND
L.LOCATION = S.LOCATION AND
L.DISTRICT = 23
GROUP BY LOCATION,
TRUNC(TRANSACTION_DATE)
ORDER BY LOCATION,
TRUNC(TRANSACTION_DATE)
In this case the query should return 365 (or fewer, if stores are not open every day) rows for each store in district 23. If there's 25 stores in the district it'll return 9125 rows or fewer.
On the other hand, let's say our VP of Sales wants some data. He/she/it isn't quite certain what's wanted, but he/she/it is pretty sure that whatever it is happened in the first six months of the year...not quite sure about which year...and not sure about the location, either - probably in district 23 (he/she/it has had a running feud with the individual who runs district 23 for the past 6 years, ever since that golf tournament where...well, never mind...but if a problem can be hung on the door of district 23's director so be it!)...and of course he/she/it wants all the details, and have it on his/her/its desk toot sweet! And thus we get a query that looks something like
SELECT L.DISTRICT, S.LOCATION, S.TRANSACTION_DATE,
S.something, S.something_else, S.some_more_stuff
FROM SALES_DATA S,
LOCATIONS L
WHERE EXTRACT(MONTH FROM S.TRANSACTION_DATE) <= 6 AND
L.LOCATION = S.LOCATION
ORDER BY L.DISTRICT,
S.LOCATION
This is an example of an unbounded query. How many rows will it return? Good question - that depends on how business conditions were, how many location were open, how many days there were in February, etc.
Put more simply, if you can look at a query and have a pretty good idea of how many rows it's going to return (even though that number might be relatively large) the query is bounded. If you can't, it's unbounded.
Share and enjoy.
http://hibernatingrhinos.com/Products/EFProf/learn#UnboundedResultSet
An unbounded result set is where a query is performed and does not explicitly limit the number of returned results from a query. Usually, this means that the application assumes that a query will always return only a few records. That works well in development and in testing, but it is a time bomb waiting to explode in production.
The query may suddenly start returning thousands upon thousands of rows, and in some cases, it may return millions of rows. This leads to more load on the database server, the application server, and the network. In many cases, it can grind the entire system to a halt, usually ending with the application servers crashing with out of memory errors.
Here is one example of a query that will trigger the unbounded result set warning:
var query = from post in blogDataContext.Posts
where post.Category == "Performance"
select post;
If the performance category has many posts, we are going to load all of them, which is probably not what was intended. This can be fixed fairly easily by using pagination by utilizing the Take() method:
var query = (from post in blogDataContext.Posts
where post.Category == "Performance"
select post)
.Take(15);
Now we are assured that we only need to handle a predictable, small result set, and if we need to work with all of them, we can page through the records as needed. Paging is implemented using the Skip() method, which instructs Entity Framework to skip (at the database level) N number of records before taking the next page.
But there is another common occurrence of the unbounded result set problem from directly traversing the object graph, as in the following example:
var post = postRepository.Get(id);
foreach (var comment in post.Comments)
{
// do something interesting with the comment
}
Here, again, we are loading the entire set without regard for how big the result set may be. Entity Framework does not provide a good way of paging through a collection when traversing the object graph. It is recommended that you would issue a separate and explicit query for the contents of the collection, which will allow you to page through that collection without loading too much data into memory.