I have a simple query such as this
select duration, host from Jobs
group by host;
i want it actually group by a pool of hosts which is something that needs to be defined at query time
for example, host01-10 would be pool1, host11-20 would be pool2, etc.
at the moment, there isnt a field which says what pool it is in and but it needs to be derived from the host field.
how do i achieve that? I want to be able to creation some sort of function on the slide to maniuplate the field so that it is group-able
def get_pool(host):
if get_hostnumber(host) < 10:
return 'pool1'
elif:
...
select duration, get_pool(host) from Jobs
group by get_pool(host);
In SQL, you don't need a function for this. I would suggest just using a case expression:
select (case when host <= 'host10' then 'pool1'
when host <= 'host20' then 'pool2'
. . .
end) as hostgrp, sum(duration) as duration
from jobs
group by (case when host <= 'host10' then 'pool1'
when host <= 'host20' then 'pool2'
. . .
end);
For your particular example, you could get away with:
select 'pool' || floor( (cast(substr(host, 5, 6) as number) + 1) / 10),
sum(duration) as duration
from jobs
group by 'pool' || floor( (cast(substr(host, 5, 6) as number) + 1) / 10);
And, lest I forget, I you have a permanent mapping between hosts and their groups, then you should put a hosts reference table in the database and have a second column for the group. Then this query would simply use a join, and any other query you write would have the same information.
You can use case when in select and in group by:
select duration, (case when host <10 then 'pool1' when host between 10 and 19 then 'pool2')
from Jobs
group by (case when host <10 then 'pool1' when host between 10 and 19 then 'pool2');
Related
I am using PSQL.
I have a table with a few columns, one column is event that can have 4 different values - X1, X2, Y1, Y2. I have another column that is the name of the service and I want to group by using this column.
My goal is to make a query that take an event and verify that for a specific service name I have count(X1) == count(X2) if not display a new column with "error"
Is this even possible? I am kinda new to SQL and not sure how to write this.
So far I tried something like this
select
service_name, event, count(service_name)
from
service_table st
group by
(service_name, event);
I am getting the count of each event for specific service_name but I would like to verify that count of event 1 == count of event 2 for each service_name.
I want to add that each service_name have a choice of 2 different event only.
You may not need a subquery/CTE for this, but it will work (and makes the logic easier to follow):
WITH event_counts_by_service AS (SELECT
service_name
, COUNT(CASE WHEN event='X1' THEN 1 END) AS count_x1
, COUNT(CASE WHEN event='X2' THEN 1 END) AS count_x2
FROM service_table
GROUP BY service_name)
SELECT service_name
, CASE WHEN count_x1=count_x2 THEN NULL ELSE 'Error' END AS are_counts_equal
FROM event_counts_by_service
My application is used with different database instances. A particular query is executing in 1 second in all database instances except for one where it is taking more than 30 minutes. What can be the reason? Although data volume is almost the same. My Database is Oracle 11g.
Here is the query
SELECT b.VC_CUSTOMER_NAME customer,
TO_CHAR( sum(c.INV_VALUE), '999,999,999,999') value,
ROUND(
(SUM (c.inv_value) / (SELECT SUM (c.inv_value)
FROM mks_mst_customer b,
sls_temp_invoice_ticket c,
sls_dt_invoice_ticket d
WHERE c.vc_comp_code = b.vc_comp_code
AND b.vc_comp_code = '01'
AND INV_LABEL LIKE 'COLLECT FROM CUSTOMER%'
AND d.vc_ticket_no=c.vc_ticket_no
AND d.dt_invoice_date BETWEEN '01-Dec-2021' AND '07-Dec-2021'
AND b.nu_account_code=c.nu_account_code)
)* 100
) PERCENT
FROM mks_mst_customer b,
sls_temp_invoice_ticket c,
sls_dt_invoice_ticket d
WHERE c.vc_comp_code = b.vc_comp_code
AND b.vc_comp_code = '01'
AND INV_LABEL like 'COLLECT FROM CUSTOMER%'
AND b.nu_account_code=c.nu_account_code
AND d.vc_ticket_no=c.vc_ticket_no
AND d.dt_invoice_date BETWEEN '01-Dec-2021' AND '07-Dec-2021'
GROUP BY b.VC_CUSTOMER_NAME
ORDER BY SUM(c.INV_VALUE) DESC
The most obvious step would be to check indexes, on this slow instance they might not be configured.
Little more demanding would be to get statistics
Bit of an SQL newbie question..
If I have a table along the lines of the following :
host fault fatal groupname
Host A Data smells iffy n research
Host B Flanklecrumpet needs a cuddle y production
Host A RAM loves EWE n research
Host Z One of the crossbeams gone askew on the treadle y research
.. and I want to get some stats, I can..
select count(distinct host) as hosts, count(host) as faults, group from tablename group by groupname
.. which gives me the number of faults and affected hosts per groupname.
hosts faults groupname
2 3 research
1 1 production
Can I, in the same query, show the number of fatal entries?
I would use aggregation, but in Postgres would phrase this as:
select groupname, count(distinct host) as hosts,
count(*) as num_faults,
count(*) filter (where fatal = 'Y') as num_fatal
from t
group by groupname;
use conditional aggregation
select count(distinct host) as hosts,
count(host) as faults,sum(case when fatal='y' then 1 else 0 end) as numberofenty,
groupname from tablename group by groupname
I am using PostgreSQL on Amazon Redshift.
My table is :
drop table APP_Tax;
create temp table APP_Tax(APP_nm varchar(100),start timestamp,end1 timestamp);
insert into APP_Tax values('AFH','2018-01-26 00:39:51','2018-01-26 00:39:55'),
('AFH','2016-01-26 00:39:56','2016-01-26 00:40:01'),
('AFH','2016-01-26 00:40:05','2016-01-26 00:40:11'),
('AFH','2016-01-26 00:40:12','2016-01-26 00:40:15'), --row x
('AFH','2016-01-26 00:40:35','2016-01-26 00:41:34') --row y
Expected output:
'AFH','2016-01-26 00:39:51','2016-01-26 00:40:15'
'AFH','2016-01-26 00:40:35','2016-01-26 00:41:34'
I had to compare start and endtime between alternate records and if the timedifference < 10 seconds get the next record endtime till last or final record.
I,e datediff(seconds,2018-01-26 00:39:55,2018-01-26 00:39:56) Is <10 seconds
I tried this :
SELECT a.app_nm
,min(a.start)
,max(b.end1)
FROM APP_Tax a
INNER JOIN APP_Tax b
ON a.APP_nm = b.APP_nm
AND b.start > a.start
WHERE datediff(second, a.end1, b.start) < 10
GROUP BY 1
It works but it doesn't return row y when conditions fails.
There are two reasons that row y is not returned is due to the condition:
b.start > a.start means that a row will never join with itself
The GROUP BY will return only one record per APP_nm value, yet all rows have the same value.
However, there are further logic errors in the query that will not successfully handle. For example, how does it know when a "new" session begins?
The logic you seek can be achieved in normal PostgreSQL with the help of a DISTINCT ON function, which shows one row per input value in a specific column. However, DISTINCT ON is not supported by Redshift.
Some potential workarounds: DISTINCT ON like functionality for Redshift
The output you seek would be trivial using a programming language (which can loop through results and store variables) but is difficult to apply to an SQL query (which is designed to operate on rows of results). I would recommend extracting the data and running it through a simple script (eg in Python) that could then output the Start & End combinations you seek.
This is an excellent use-case for a Hadoop Streaming function, which I have successfully implemented in the past. It would take the records as input, then 'remember' the start time and would only output a record when the desired end-logic has been met.
Sounds like what you are after is "sessionisation" of the activity events. You can achieve that in Redshift using Windows Functions.
The complete solution might look like this:
SELECT
start AS session_start,
session_end
FROM (
SELECT
start,
end1,
lead(end1, 1)
OVER (
ORDER BY end1) AS session_end,
session_boundary
FROM (
SELECT
start,
end1,
CASE WHEN session_switch = 0 AND reverse_session_switch = 1
THEN 'start'
ELSE 'end' END AS session_boundary
FROM (
SELECT
start,
end1,
CASE WHEN datediff(seconds, end1, lead(start, 1)
OVER (
ORDER BY end1 ASC)) > 10
THEN 1
ELSE 0 END AS session_switch,
CASE WHEN datediff(seconds, lead(end1, 1)
OVER (
ORDER BY end1 DESC), start) > 10
THEN 1
ELSE 0 END AS reverse_session_switch
FROM app_tax
)
AS sessioned
WHERE session_switch != 0 OR reverse_session_switch != 0
UNION
SELECT
start,
end1,
'start'
FROM (
SELECT
start,
end1,
row_number()
OVER (PARTITION BY APP_nm
ORDER BY end1 ASC) AS row_num
FROM APP_Tax
) AS with_row_number
WHERE row_num = 1
) AS with_boundary
) AS with_end
WHERE session_boundary = 'start'
ORDER BY start ASC
;
Here is the breadkdown (by subquery name):
sessioned - we first identify the switch rows (out and in), the rows in which the duration between end and start exceeds limit.
with_row_number - just a patch to extract the first row because there is no switch into it (there is an implicit switch that we record as 'start')
with_boundary - then we identify the rows where specific switches occur. If you run the subquery by itself it is clear that session start when session_switch = 0 AND reverse_session_switch = 1, and ends when the opposite occurs. All other rows are in the middle of sessions so are ignored.
with_end - finally, we combine the end/start of 'start'/'end' rows into (thus defining session duration), and remove the end rows
with_boundary subquery answers your initial question, but typically you'd want to combine those rows to get the final result which is the session duration.
I am trying to make a query of
"What are the names of the producers
with at least 2 properties with areas
with less than 10"
I have made the following query that seems to work:
select Producers.name
from Producers
where (
select count(Properties.prop_id)
from Properties
where Properties.area < 10 and Properties.owner = Properties.nif
) >= 2;
yet, my lecturer was not very happy about it. He even thought (at least gave me the impression of) that this kind of queries wouldn't be valid in oracle.
How should one make this query, then? (I have at the moment no way of getting to speak with him btw).
Here are the tables:
Producer (nif (pk), name, ...)
Property (area, owner (fk to
producer), area, ... )
The having clause is typically used to filter on aggregate data (like counts, sums, max, etc).
select
producers.name,
count(*)
from
producers,
property
where
producers.nif = property.owner and
property.area < 10
group by
producers.name
having
count(*) >= 2
select P.name
from Producers p, Properties pr
where p.nif = pr.Owner
AND Properties.area < 10
GROUP BY Producers.name
having Count(*) >= 2