This is hard to explain, but say I have this query:
SELECT *
FROM "late_fee_tiers"
And it returns this:
I have a validation in code set up to prevent duplicate days from being saved (notice there are 2 rows of days = 2).
I want my query to double-check there are only unique rows of day, and if there are multiple, select the first one (so it should return 3 rows with 2,3,5).
My first thought is to use GROUP BY day, while selecting a MIN("id").
The problem is, I don't understand SQL enough, because it forces me to add different aggregator functions to every single column... but what if I don't want to do that? I want THAT row to be "chosen" according to the single aggregator function I define, I don't need multiple aggregators creating some weird hybrid row. I just want the MIN() function to choose that 1 row and fill in all the rest of the values for that row.
What function do I use to do this, or how would I do it?
Thanks
You want to use DISTINCT ON:
select distinct on (day) *
from "late_fee_tiers"
order by day, id;
Why day is also required in the order by:
From the official documentation:
The DISTINCT ON expression(s) must match the leftmost ORDER BY
expression(s). The ORDER BY clause will normally contain additional
expression(s) that determine the desired precedence of rows within
each DISTINCT ON group.
Related
We are working on converting Informatica mappings to Google Bigquery SQL. In one of the mappings, there are a couple ports/columns, say A and B which are not getting grouped by in the Aggregator transformation and neither have been applied any aggregation function like sum, avg etc.
According to senior devs in my org, in Informatica, we will get last values of these ports/columns as a result after the aggregator. My question is, how do we convert this behaviour in BigQuery SQL? Because we cannot use that columns in select statement, which are not present in the Group by clause and we don't want to group by these columns.
For getting last value of the column, we have LAST_VALUE() analytic function in bigquery, but even then we cannot use the group by and analytic function in same select statement.
I would really appreciate some help!
Use some aggregation function.
In Informatica you will get LAST value. This is not deterministic. It basically means that either
you have same values across all the column,
you don't care which one you get, or
you have specific order, on which the last value is taken.
First two cases mean you can use MIN / MAX / whatsoever. The result will be same or you don't care.
If the last one is your case, ARRAY_AGG should help you, as per this answer.
to convert Infa mapping with aggregator to big SQL, I would use row_number over (partitioned by id order by id) as rn and then in outside put a filter rn=1.
Informatica aggregator - id is group by column.
Equivalent SQL should look like this -
select a,b,id
from
(select a,b,row_number over (partitioned by id order by id desc) as rn --this will mimic informatica aggregator. id column is the group by port. if you have any sorter before aggregator add all ports as per order in order by column on same sequence but reverse order(asc/desc)
from mytable) rs
where rs.rn=1 -- this will ensure to pick latest row.
I am using oracle DB. I have a Aggregated script. We found that some of the rows in the table are repeated, unwanted and hence, is not supposed to be added in the sum.
now suppose i use Distinct command just after the select statement, will distinct command applied before aggregation or after it.
If you use SELECT DISTINCT, then the result set will have no duplicate rows.
If you use SELECT COUNT(DISTINCT), then the count will only count distinct values.
If you are thinking of using SUM(DISTINCT) (or DISTINCT with any other aggregation function) be warned. I have never used it (except perhaps as a demonstration), and I have written a fair number of queries.
You really need to solve the problem at the source. For instance, if accounts are being repeated, then SUM(DISTINCT) does not distinguish between accounts, only by the values assigned to the account. You need to get the logic right.
when you say that you have repeated rows - you must have a clear idea of uniqueness for the combination of some specific columns.
If you expect that certain column combinations are unique within specified groups yo can detect the groups deviating from that using queries following the pattern below.
select <your group by columns>
from <your table name>
group by <your group by predicate>
having (max(A)!=min(A) or max(B)!=min(B) or max(C)!=min(C))
Then you have to decide what to do with the problem. I would suggest cleaning up and adding unique constraints to the table.
The aggregate query you mention would run successfully for the rows in your table not having duplicate values for the combination of columns that needs to be unique. Using my example you could get the aggregates for that part of your data using the inverted having predicate.
It would be something like this
select <your aggregate functions, counts, sums, averages and so on>
from <your table name>
group by <your group by predicate>
having (max(A)=min(A) and max(B)=min(B) and max(C)=min(C))
If you must include the groups breaking uniqueness expectations you must somehow do a qualified selection of which of the variants in the group to use - you could for example go for the last one or the first one if one of your columns should happen to express something about when the row was created.
Is "TOP 1" a reliable substitute for aggregate functions such as MIN() and MAX()..? Shown below is a basic query in Access 2007, to determine the first time a customer ordered a certain product during a certain month. The DBMS system is probably irrelevant, since this question could apply to any system.
In this query, "TOP 1" is used in combination with ORDER BY on the date field. This returns one record, that being the oldest by date. But...is this ok..? What can go wrong..? Is there a better way..?
SELECT TOP 1 DAACCT, DAITEM, DAQTY, DAIDAT
FROM fqlOrdersGrandHistory
WHERE DAACCT="T7414" AND DAITEM="45234" AND (DAIDAT>=20170501 AND DAIDAT<=20170531)
ORDER BY DAIDAT;
Reliable is a subjective term. Yes, TOP 1 will give you a result, as will MAX() or MIN(). It depends on what you are after.
If you look for a specific user only (as you appear to be in this case) and sort by DATE ascending and use TOP 1, you will get all of the details for that one record. However, if you are looking for the first purchase of every user in the table, then TOP 1 will only give you the info for the very first person who made an order.
On the other hand, if you use SELECT DAACCT, MIN(DAIDAT) FROM table GROUP BY DAACCT then you will get the earliest purchase for each user. This assumes you are storing the DAIDAT as a date format with a time component, not just the date value itself. If you do that, you open yourself up to multiple possible records.
TL;DR: If you stick with the concept of the query 1) looking for a very specific user for 2) a very specific product and 3) your dates are stored as proper dates, TOP 1 will be better to use than an aggregate function. If one of these three conditions are not met, reevaluate.
TOP is used to limit the fetched rows and yes it's fine unless you have multiple records with same data. Not sure how it's relates to Min() or Max() ... you use Min() or Max() aggregate function when you are grouping the rows using Group By. Even if you don't specify a group by grouping happens on the entire result set
It is OK if either one field or a combination of fields of the selected fields is unique.
If not, the result set will contain all the records where the field or combination match. To avoid this, always include, say, an autonumber field in the selected fields.
I have a database (running on postgres 9.3) of bookings of resources. This database contains a table reservations which contains beside other values the start and stop time of the reservation (as timestamp with time zone)
Now I need to know how much reservations a given company has currently active in the future in terms of total hours of all these reservations added together.
I have put together the following query that does the job:
SELECT EXTRACT(EPOCH FROM Sum(stop-start))/3600 AS total
FROM (reservations JOIN partners ON partner = email)
WHERE stop > now() AND company = 'givencompany'
This works quite well if the given company has reservations in the future. The problem I am experiencing is that when the company doesnt have any reservations the query does in fact return a row but the collumn total is empty whereas I would like it to return no row at all (or a row containing 0 if nothing is too complicated) in that case.
Is this possible to accomplish with a different SELECT or another modification to the database or does the consuming application have to check for null every time?
Sorry if my question is trivial but I am very new to databases altogether
Edit
I found out that I could default the returned value with 0 by using COALESCE but I would much prefer it if no row would be returned
Short answer: just add HAVING Sum(stop-start) IS NOT NULL at the end of query.
Long answer:
This query has no explicit GROUP BY, but since it aggregates the rows with sum(), it's implicitly turned into a GROUP BY query, with all the rows matching the WHERE condition taken as one group.
See the doc on SELECT :
without GROUP BY, an aggregate produces a single value computed across
all the selected rows
And about the HAVING clause:
The presence of HAVING turns a query into a grouped query even if
there is no GROUP BY clause. This is the same as what happens when the
query contains aggregate functions but no GROUP BY clause. All the
selected rows are considered to form a single group, and the SELECT
list and HAVING clause can only reference table columns from within
aggregate functions. Such a query will emit a single row if the HAVING
condition is true, zero rows if it is not true.
I am using Jaspersoft's iReport to create a report that will pull data from my Maintenance Assistant CMMS database. The DB is on the localhost, and I am not creating any tables or columns. MA CMMS takes care of that. I only want to pull the data to arrange in a report.
Here is my code:
SELECT *
FROM "tblworkordertask"
WHERE "dbltimespenthours" > 0
AND "dtmdatecompleted" BETWEEN $P{DATE_FROM} AND $P{DATE_TO}
GROUP BY "intworkorderid"
and my error:
Caused by: java.sql.SQLSyntaxErrorException: Column reference 'tblWorkOrderTask.id' is invalid, or is part of an invalid expression. For a SELECT list with a GROUP BY, the columns and expressions being selected may only contain valid grouping expressions and valid aggregate expressions.
I don't know why the error is referring to 'tblWorkOrderTask.id' because I don't have such a column, nor did I ask for that column.
If I take out the group by clause, it works fine, but as you could expect, I get multiple results with the same WorkOrderID. I want to group it by this column, and then count the results. I tried using SELECT DISTINCT, but then I get errors about columns that aren't selected.
You're selecting all columns in the tblWorkOrderTask table. The "id" column is the first column in that table. You are getting an error because you do not have all columns specified in the select list.
This select would work, but I'm not sure what information you need out of your table.
SELECT id, intworkorderid
FROM tblWorkOrderTask
group by id, intworkorderid
http://www.w3schools.com/sql/sql_groupby.asp
Get rid of the GROUP BY clause -- if you're just trying to order the result, then use ORDER BY instead; but otherwise, you don't need either.
EDIT
As the error says, everythign in your SELECT list must be one of two things -- either 1) also listed in your GROUP BY list, or 2) an aggregated value. Here is a sample that will work:
SELECT intworkorderid, COUNT(*)
FROM "tblworkordertask"
WHERE "dbltimespenthours" > 0
AND "dtmdatecompleted" BETWEEN $P{DATE_FROM} AND $P{DATE_TO}
GROUP BY "intworkorderid"
Yes - in order to use group by, you need to be specific in the select line.
So first, decide which fields you want to display. If you want them all, then include them all.
As soon as you add a COUNT() function to get a count of the selected fields, you will need to add the GROUP BY clause. COUNT() is an AGGREGATE function, like SUM() and AVG().
It's a little counter-intuitive and a bit of a pain to specify so many fields in the GROUP BY clause, but it's necessary.
The FIRST GROUP BY field is the most important, since this is usually what you are concerned about.
This first field can be any of the SELECTed fields, it is not necessarily the first.
Include EVERY field in your GROUP BY that is not an AGGREGATE function like COUNT().
Also, if you are trying to COUNT a group of orders, you probably don't want or need all of the fields in the SELECT.
You probably want to specify just the fields that are unique to the work order ID.
Example: If you want to get a COUNT of these fields, you would specify all of the SELECTED fields EXCEPT the COUNT().
SELECT
intWorkOrderID,
COUNT(id),
strDescription
FROM tblworkordertask
WHERE dbltimespenthours > 0
AND dtmdatecompleted BETWEEN $P{DATE_FROM} AND $P{DATE_TO}
GROUP BY
intworkorderid,
strDescription