My table looks like this
The goal is to count how many times the actuator_state of an specific actuator (from the column actuator_names) changes in a period of time. Keep in mind that a specific actuator has various actuators (For instance Heater has Heator0, Heator1, etc) and the goal is to count how many times has changed Heater0+ Heater1+ Heator2+ Heater3.... (Also the name of the table is state_actuator
I tried this:
SELECT actuator_nome AS NOME,
SUM (DISTINCT CASE WHEN actuator_state.actuator AND DISTINCT actuator_state.actuator_time AND DISTINCT actuator_state.actuator_state THEN 1 ELSE 0) AS TROCAS_ESTADO
FROM actuator_state WHERE actuator_time BETWEEN '2020-05-17 16:58:54' AND '2020-05-17 17:09:58' AND actuator_name='Heater'
The result should be
Heater: 5;
(for instance Heater0 has changed 3 times and Heater1 two times and other Heaters 0 changes)
You can use window functions for this:
select
actuator_name,
count(*) filter(where actuator_state <> lag_actuator_state) no_changes
from (
select
t.*,
lag(actuator_state)
over(partition by actuator_name, actuator order by actuator_time) lag_actuator_state
from mytable t
where actuator_time between '2020-05-17 16:58:54' and '2020-05-17 17:09:58'
) t
group by actuator_name
The subquery uses lag() to retrieve the "previous" state of each actuator. Then, the outer query aggregates by actuator_name, and performs a count that increments by 1 everytime the consecutive values are not equal.
You can add additional filters in the where clause of the subquery as needed.
Note that this query does not count the first value in the period as a change. Only further changes are taken into account.
You can use lag():
select actuator_name,
count(*) filter (where prev_as is distinct from actuator_state)
from (select sa.*,
lag(actuator_state) over (partition by actuator order by actuator_time) as prev_as
from state_actuator sa
) sa
where actuator_time between '2020-05-17 16:58:54' and '2020-05-17 17:09:58'
group by actuator_name;
You can filter on a particular name in the where clause as well.
Note that this counts the first appearance as a "change". It is not clear if that matches your intention.
Related
Simplified example:
In hive, I have a table t with two columns:
Name, Value
Bob, 2
Betty, 4
Robb, 3
I want to do a case when that uses the total of the Value column:
Select
Name
, CASE
When value>0.5*sum(value) over () THEN ‘0’
When value>0.9*sum(value) over () THEN ‘1’
ELSE ‘2’
END as var
From table
I don’t like the fact that sum(value) over () is computed twice. Is there a way to compute this only once. Added twist, I want to do this in one query, so without declaring user variables.
I was thinking of scalar queries:
With total as
(Select sum(value) from table)
Select
Name
, CASE
When value>0.5*(select * from total) THEN ‘0’
When value>0.9*(select * from total)THEN ‘1’
ELSE ‘2’
END as var
From table;
But this doesn’t work.
TLDR: Is there a way to simplify the first query without user variables ?
Don't worry about that. Let the optimizer worry about it. But, you can use a subquery or CTE if you don't want to repeat the expression:
select Name,
(case when value > 0.5 * total then '0'
when value > 0.9 * total then '1'
else '2'
end) as var
From (select t.*, sum(value) over () as total
from table t
) t;
Cross join a subquery that fetches the sum to the table:
Select
t.Name
, CASE
When t.value>0.9*tt.value THEN '1'
When t.value>0.5*tt.value THEN '0'
ELSE '2'
END as var
From table t cross join (select sum(value) value from table) tt
and change the order of the WHEN clauses in the CASE expression because as they are, the 2nd case will never succeed.
Since I/O is the major factor the slows down Hive queries, we should strive to reduce the num of stages to get better performance.
So it's better not to use a sub-query or CTE here.
Try this SQL with a global window clause:
select
name,
case
when value > 0.5*sum(value) over w then '0'
when value > 0.9*sum(value) over w then '1'
else '2'
end as var
from my_table
window w as (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
In this case window clause is the recommended way to reduce repetition of code.
Both the windowing and the sum aggregation will be computed only once. You can run explain select..., confirming that only ONE meaningful MR stage will be launched.
Edit:
1. A simple select clause on a subquery is not sth to worry about. It can be pushed down to the last phase of the subquery, so as to avoid additional MR stage.
2. Two identical aggregations residing in the same query block will only be evaluated once. So don’t worry about potential repeated calculation.
Using the Google Big Query database bigquery-public-data.crypto_ethereum_classic.transactions as reference.
For each transaction row, I want to calculate the count of all transactions to the same address that occurred before that transaction, and sum of the gas usage of them. I am sure I can do this with a join as I have tried and Google accepts my old query, but since there is so much data as a result of the (inner) join, there is almost always a "quota limit exceeded" error. At the same time, I think a subquery solution is inefficient, as it is querying almost the same thing in both aggregate functions.
In a perfect world the query would use something like a join to create a temporary table with all columns I need (transaction_hash, receipt_gas_used, to_address, block_timestamp), according to the conditions (where to_address = table_1.to_address and block_timestamp < table_1.block_timestamp), where I can then perform the aggregate functions on the columns of that table.
What I have so far and what I'm looking for is something like...:
SELECT
table_1.*,
COUNT(
DISTINCT IF(block_timestamp < table_1.block_timestamp and to_address = table_1.to_address, `hash`, NULL)
) as txn_count,
SUM(
IF(block_timestamp < table_1.block_timestamp and to_address = table_1.to_address, `receipt_gas_used`, NULL)
) as total_gas_used
from
`bigquery-public-data.crypto_ethereum_classic.transactions` as table_1
where block_number >= 3000000 and block number <= 3500000 #just to subset the data a bit
I think you want window functions:
select t.*,
row_number() over (partition by to_address order by block_timestamp) as txn_seqnum,
sum(receipt_gas_used) over (partition by to_address order by block_timestamp) as total_gas_used
from `bigquery-public-data.crypto_ethereum_classic.transactions` as t
where block_number >= 3000000 and block number <= 3500000 #just to subset the
If you really have ties and need the distinct, then use dense_rank() instead of row_number().
I'm trying to write a query to return the town, and the number of runners from each town where the number of runners is greater than 5.
My Query right now look like this:
select hometown, count(hometown) from marathon2016 where count(hometown) > 5 group by hometown order by count(hometown) desc;
but sqlite3 responds with this:
Error: misuse of aggregate: count()
What am i doing wrong, Why cant I use the count() here, and what should I use instead.
When you're trying to use an aggregate function (such as count) in a WHERE cause, you're usually looking for HAVING instead of WHERE:
select hometown, count(hometown)
from marathon2016
group by hometown
having count(*) > 5
order by count(*) desc
You can't use an aggregate in a WHERE cause because aggregates are computed across multiple rows (as specified by GROUP BY) but WHERE is used to filter individual rows to determine what row set GROUP BY will be applied to (i.e. WHERE happens before grouping and aggregates apply after grouping).
Try the following:
select
hometown,
count(hometown) as hometown_count
from
marathon2016
group by
hometown
having
hometown_count > 5
order by
hometown_count desc;
There may be a very simple way to do this, but I can't quite think of it -- I have a dataset that returns a minimum job title and minimum effective date, then all effdts > than the min_effdt. In order to use this data in a charting program, I would like to rank each successive effdt if it exists, as in Min Role Effdt, then 2nd, 3rd, Max. Of course there could be anywhere from 2 to 20 jobs per person.
At first I considered trying a case statement, but I don't think that works when analyzing two columns at once. Is there a SQL statement that will allow ranking? Right now my data looks like
Employee Number | Min Base Role | Min Role Effdt | Base Role | Role Effdt
and comes from two tables, with the 2nd table brought in twice to get the Role / Effdt as Min, then All greater than Min.
I am using ORACLE. Code is below:
SELECT DISTINCT AL4.FULL_NAME,
AL4.EMPLOYEE_NUMBER,
AL4.HIRE_DATE,
AL4.DATE_OF_BIRTH,
AL4.AGE,
AL4.TERM_DATE,
AL4.ETHNIC_ORIGIN,
AL2.RECORDVALUE AS MIN_BASE_ROLE,
AL3.RECORDVALUE AS BASE_ROLE,
AL3.EFFECTIVE_START_DATE AS "ROLE EFFECTIVE DATE",
AL2.EFFECTIVE_START_DATE AS "MIN ROLE EFFDT"
FROM T1 AL2,
T2 AL3,
T3 AL4
WHERE AL4.PERSON_ID = AL2.PERSON_ID
AND AL4.PERSON_ID = AL3.PERSON_ID
AND AL4.EMPLOYEE_NUMBER = AL2.HISL_ID
AND AL4.EMPLOYEE_NUMBER = AL3.HISL_ID
AND AL2.RECORDTYPE = 'BASE_ROLE'
AND AL3.RECORDTYPE = 'BASE_ROLE'
AND AL2.EFFECTIVE_START_DATE = (SELECT MIN(A.EFFECTIVE_START_DATE) from T1 A where A.person_id = al2.person_id and a.recordtype = al2.recordtype)
AND AL3.EFFECTIVE_START_DATE > AL2.EFFECTIVE_START_DATE
AND (AL4.TERM_DATE >= '01-JAN-2012' or AL4.TERM_DATE is NULL)
order by AL4.EMPLOYEE_NUMBER
The function that you are looking for is row_number(). I think the expression you want is:
row_number() over (partition by AL4.EMPLOYEE_NUMBER
order by AL2.EFFECTIVE_START_DATE
) as ranking
The function row_number() says "assign a sequential number to a group of rows". The partition by clause defines the group, where the numbering starts over again at 1. The order by clause specifies the ordering within the group.
Similar functions rank() and dense_rank() might also be useful. They differ in how they handle duplicate values.
Suppose I have a database of athletic meeting results with a schema as follows
DATE,NAME,FINISH_POS
I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data
2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1
The following rows:
2013-06-20,Johnson,4
2013-06-19,Johnson,2
Would be matched. I have only managed to get started at the following stub:
select date,name FROM table WHERE ...;
I've been trying to wrap my head around the where clause but I can't even get a start
I think this can be even simpler / faster:
SELECT day, place, athlete
FROM (
SELECT *, min(place) OVER (PARTITION BY athlete
ORDER BY day
ROWS 3 PRECEDING) AS best
FROM t
) sub
WHERE best > 1
->SQLfiddle
Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.
Details about window function calls in the manual here. In particular:
If frame_end is omitted it defaults to CURRENT ROW.
If place (finishing_pos) can be NULL, use this instead:
WHERE best IS DISTINCT FROM 1
min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.
Don't use type names and reserved words as identifiers, I substituted day for your date.
This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.
#Craig already mentioned the index to make this fast.
Here's an alternative formulation that does the work in two scans without subqueries:
SELECT
"date", athlete, place
FROM (
SELECT
"date",
place,
athlete,
1 <> ALL (array_agg(place) OVER w) AS include_row
FROM Table1
WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;
See: http://sqlfiddle.com/#!1/fa3a4/34
The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.
Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.
I'd be really interested in the relative efficiency of this vs #Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.
; with CTE as
(
select row_number() over (partition by athlete order by date) rn
, *
from Table1
)
select *
from CTE cur
where not exists
(
select *
from CTE prev
where prev.place = 1
and prev.athlete = cur.athlete
and prev.rn between cur.rn - 3 and cur.rn
)
Live example at SQL Fiddle.