i need to count how many below average salary the table has - hive

SELECT
*,
COUNT (AnnualSalary < avg(AnnualSalary)) AS Count
FROM Assessment
GROUP BY ServiceType
This is a Hive query, im trying to count how many records from table earn less that the average salary

First, distribute rows into different partitions according their ServiceType. Without specifying ORDER BY and window specification, the default is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
Then, apply the aggregation AVG as an analytic function over each window to get the average AnnualSalary for each partition. Consequently we get to know whether a record's AnnualSalary is below the average of its partition.
Finally, a count on the intermediate result set.
SELECT
SERVICETYPE,
SUM(ISBELOW)
FROM (
SELECT
*,
CASE
WHEN ANNUALSALARY < AVG(ANNUALSALARY) OVER (PARTITION BY SERVICETYPE) THEN 1
ELSE 0
END AS ISBELOW
FROM ASSESSMENT
) TMP
GROUP BY SERVICETYPE
;
Note, a HAVING clause is for filtering after GROUP BY, and details for individual rows are lost before the filter.

Related

How to implement a reset when (Teradata) using ANSI SQL only?

enter image description here
I need to write a query that count the number of times customers transactions exceed 250 Pounds. Adding cumulatively until the sum exceeds 250 then reset and start from the following row until it exceeds 250 and so on. This functionality can be carried out using Teradata keywords 'RESET WHEN' yet I am supposed to create a query that's only composed of ANSI SQL SYNTAX.
Can anyone help with that?
SUM(sales) OVER (
PARTITION BY region
ORDER BY day_of_calendar
RESET WHEN sales < /* preceding row */ SUM(sales) OVER (
PARTITION BY region
ORDER BY day_of_calendar
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)
ROWS UNBOUNDED PRECEDING
)
1: https://i.stack.imgur.com/lu4Jp.png This is a sample of the input of customer
enter image description here
And that's the output.
Every time the customer's total spent exceeds 250, I should be summing from 0 once again and find the day at which the customer exceeded 250 USD.
Without your table definitions and just a screenshot of a very limited dataset it is hard to test my answer on your data - so I'm showing it first on the dataset supplied in the match_recognize tutorial on live SQL and then with your columns:
SELECT
*
FROM
ticker MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES
nvl(SUM(up.price),0) AS tot
ALL ROWS PER MATCH
PATTERN ( up* ) DEFINE
up AS SUM(up.price) - up.price <= 100
);
So on your table this would be something like
SELECT
*
FROM
your_table MATCH_RECOGNIZE (
PARTITION BY region
ORDER BY day_of_calendar
MEASURES
nvl(SUM(up.sales),0) AS tot
ALL ROWS PER MATCH
PATTERN ( up* ) DEFINE
up AS SUM(up.sales) - up.sales <= 250
);

Is there a way to calculate percentile using percentile_cont() function over a rolling window in Big Query?

I have a dataset with the following columns
city
user
week
month
earnings
Ideally I want to calculate a 50th % from percentile_cont(earnings,0.5) over (partition by city order by month range between 1 preceding and current row). But Big query doesn't support window framing in percentile_cont. Can anyone please help me if there is a work around this problem.
If I understand correctly, you can aggregate into an array and then unnest:
select t.*,
(select percentile_cont(earning) over ()
from unnest(ar_earnings) earning
limit 1
) as median_2months
from (select t.*,
array_agg(earnings) over (partition by city
order by month
range between 1 preceding and current month
) as ar_earnings
from t
) t;
You don't provide sample data, but this version assumes that month is an incrementing integer that represents the month. You may need to adjust the range depending on the type.

Returning single records per month

I have a use case function that needs to returns a single row only for every end of month.
I tried using select distinct and it is showing multiple records for the same end of month
SELECT DISTINCT CASE
WHEN eff_interest_balance < 0.01 THEN trial_balance_date
WHEN date_paid < trial_balance_date THEN date_paid
END as A
, period
FROM dbo.Intpayments[enter image description here][1]
WHERE loan_number = 60023
ORDER BY period ASC
Each row should return single date for each month
Distinct is returning unique rows, not grouping them. You are looking to aggregate rows. This means using some combination of aggregate functions and group by.
What your current query is missing is some sort of logic for aggregating the rows that are in the same period. Do you want to compare the sum of these values? The min, the max?
In any case, the basic idea of aggregating and grouping would look like this - I don't think this summing is what you want, but the query shows the basic idea of aggregating and grouping:
SELECT
period
, SUM(eff_interest_balance) AS SumOfBalance
FROM dbo.Intpayments
WHERE loan_number = 60023
GROUP BY period

Redshift: Find MAX in list disregarding non-incremental numbers

I work for a sports film analysis company. We have teams with unique team IDs and I would like to find the number of consecutive weeks they have uploaded film to our site moving backwards from today. Each upload also has its own row in a separate table that I can join on teamid and has a unique date of when it was uploaded. So far I put together a simple query that pulls each unique DATEDIFF(week) value and groups on teamid.
Select teamid, MAX(weekdiff)
(Select teamid, DATEDIFF(week, dateuploaded, GETDATE()) as weekdiff
from leroy_events
group by teamid, weekdiff)
What I am given is a list of teamIDs and unique weekly date differences. I would like to then find the max for each teamID without breaking an increment of 1. For example, if my data set is:
Team datediff
11453 0
11453 1
11453 2
11453 5
11453 7
11453 13
I would like the max value for team: 11453 to be 2.
Any ideas would be awesome.
I have simplified your example assuming that I already have a table with weekdiff column. That would be what you're doing with DATEDIFF to calculate it.
First, I'm using LAG() window function to assign previous value (in ordered set) of a weekdiff to the current row.
Then, using a WHERE condition I'm retrieving max(weekdiff) value that has a previous value which is current_value - 1 for consecutive weekdiffs.
Data:
create table leroy_events ( teamid int, weekdiff int);
insert into leroy_events values (11453,0),(11453,1),(11453,2),(11453,5),(11453,7),(11453,13);
Code:
WITH initial_data AS (
Select
teamid,
weekdiff,
lag(weekdiff,1) over (partition by teamid order by weekdiff) as lag_weekdiff
from
leroy_events
)
SELECT
teamid,
max(weekdiff) AS max_weekdiff_consecutive
FROM
initial_data
WHERE weekdiff = lag_weekdiff + 1 -- this insures retrieving max() without breaking your consecutive increment
GROUP BY 1
SQLFiddle with your sample data to see how this code works.
Result:
teamid max_weekdiff_consecutive
11453 2
You can use SQL window functions to probe relationships between rows of the table. In this case the lag() function can be used to look at the previous row relative to a given order and grouping. That way you can determine whether a given row is part of a group of consecutive rows.
You still need overall to aggregate or filter to reduce the number of rows for each group of interest (i.e. each team) to 1. It's convenient in this case to aggregate. Overall, it might look like this:
select
team,
case min(datediff)
when 0 then max(datediff)
else -1
end as max_weeks
from (
select
team,
datediff,
case
when (lag(datediff) over (partition by team order by datediff) != datediff - 1)
then 0
else 1
end as is_consec
from diffs
) cd
where is_consec = 1
group by team
The inline view just adds an is_consec column to the data, marking whether each row is part of a group of consecutive rows. The outer query filters on that column (you cannot filter directly on a window function), and chooses the maximum datediff from the remaining rows for each team.
There are a few subtleties there:
The case expression in the inline view is written as it is to exploit the fact that the lag() computed for the first row of each partition will be NULL, which does not evaluate unequal (nor equal) to any value. Thus the first row in each partition is always marked consecutive.
The case testing min(datediff) in the outer select clause picks up teams that have no record with datediff = 0, and assigns -1 to column max_weeks for them.
It would also have been possible to mark rows non-consecutive if the first in their group did not have datediff = 0, but then you would lose such teams from the results altogether.

Filtering by window function result in Postgresql

Ok, initially this was just a joke we had with a friend of mine, but it turned into interesting technical question :)
I have the following stuff table:
CREATE TABLE stuff
(
id serial PRIMARY KEY,
volume integer NOT NULL DEFAULT 0,
priority smallint NOT NULL DEFAULT 0,
);
The table contains the records for all of my stuff, with respective volume and priority (how much I need it).
I have a bag with specified volume, say 1000. I want to select from the table all stuff I can put into a bag, packing the most important stuff first.
This seems like the case for using window functions, so here is the query I came up with:
select s.*, sum(volume) OVER previous_rows as total
from stuff s
where total < 1000
WINDOW previous_rows as
(ORDER BY priority desc ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
order by priority desc
The problem with it, however, is that Postgres complains:
ERROR: column "total" does not exist
LINE 3: where total < 1000
If I remove this filter, total column gets properly calculated, results properly sorted but all stuff gets selected, which is not what I want.
So, how do I do this? How do I select only items that can fit into the bag?
I don't know if this qualifies as "more elegant" but it is written in a different manner than Cybernate's solution (although it is essentially the same)
WITH window_table AS
(
SELECT s.*,
sum(volume) OVER previous_rows as total
FROM stuff s
WINDOW previous_rows as
(ORDER BY priority desc ROWS between UNBOUNDED PRECEDING and CURRENT ROW)
)
SELECT *
FROM window_table
WHERE total < 1000
ORDER BY priority DESC
If by "more elegant" you mean something that avoids the sub-select, then the answer is "no"
I haven't worked with PostgreSQL. However, my best guess would be using an inline view.
SELECT a.*
FROM (
SELECT s.*, sum(volume) OVER previous_rows AS total
FROM stuff AS s
WINDOW previous_rows AS (
ORDER BY priority desc
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
)
ORDER BY priority DESC
) AS a
WHERE a.total < 1000;