SQL code for time and record specific selection? - sql

I've been troubling my brain about the next problem. I work with a large dataset which contains records of species observations and location. Here's what I want to do:
For each period of time (5 years) I want to know the number of distinct species that were described in that particular period, but which were not described in an earlier period. So for each new period, the query needs to check if a certain species wasn't already described before. This needs to be done for 100 separate areas. The dataset already knows for each record in which 5 year period it was recorded.
My final result should be a table with areas on the x-axis, the periods on the y-axis, and in the cells the number of described species for each period, per area. It would be great if this is possible with one query. But I'm planning to do this in Excel, since I would be very happy to get for each area, the number of distinct species per time period.

PostgreSQL supports Windowed Aggregate Functions:
SELECT
area, period, SUM(x) AS newSpecies
FROM
(
SELECT area, period,
CASE -- check for the first description
WHEN date_col = MIN(date_col) OVER (PARTITION BY species) THEN 1
ELSE 0
END AS x
FROM au.trans
) AS dt
GROUP BY area, period
Depending on your data you might need to switch to ROW_NUMBER instead:
CASE -- check for the first description
WHEN ROW_NUMBER() OVER (PARTITION BY species ORDER BY date_col) = 1 THEN 1
ELSE 0
END AS x
Now you just have to pivot that data, don't know if there's a PIVOT function in PostgreSQL, otherwise you'll need to do the classical MAX(CASE). For each area you need to add a
SELECT period,
-- cut&paste&modify for each area
MAX(CASE WHEN area = 'area52' THEN newSpecies ELSE 0 END AS area52,
....
FROM (previous query) AS dt
GROUP BY period

Related

Finding the initial sampled time window after using SAMPLE BY again

I can't seem to find a perhaps easy solution to what I'm trying to accomplish here, using SQL and, more importantly, QuestDB. I also find it hard to put my exact question into words so bear with me.
Input
My real input is different of course but a similar dataset or case is the gas_prices table on the demo page of QuestDB. On https://demo.questdb.io, you can directly write and run queries against some sample database, so it should be easy enough to follow.
The main task I want to accomplish is to find out which month was responsible for the year's highest galon price.
Output
Using the following query, I can get the average galon price per month just fine.
SELECT timestamp, avg(galon_price) as avg_per_month FROM 'gas_prices' SAMPLE BY 1M
timestamp
avg_per_month
2000-06-05T00:00:00.000000Z
1.6724
2000-07-05T00:00:00.000000Z
1.69275
2000-08-05T00:00:00.000000Z
1.635
...
...
Then, I get all these monthly averages, group them by year and return the maximum galon price per year by wrapping the above query in a subquery, like so:
SELECT timestamp, max(avg_per_month) as max_per_year FROM (
SELECT timestamp, avg(galon_price) as avg_per_month FROM 'gas_prices' SAMPLE BY 1M
) SAMPLE BY 12M
timestamp
max_per_year
2000-01-05T00:00:00.000000Z
1.69275
2001-01-05T00:00:00.000000Z
1.767399999999
2002-01-05T00:00:00.000000Z
1.52075
...
...
Wanted output
I want to know which month was responsible for the maximum price of a year.
Looking at the output of the above query, we see that the maximum galon price for the year 2000 was 1.69275. Which month of the year 2000 had this amount as average price? I'd like to display this month in an additional column.
For the first row, July 2000 is shown in the additional column for year 2000 because it is responsible for the highest average price in 2000. For the second row, it was May 2001 as that month had the highest average price of 2001.
timestamp
max_per_year
which_month_is_responsible
2000-01-05T00:00:00.000000Z
1.69275
2000-07-05T00:00:00.000000Z
2001-01-05T00:00:00.000000Z
1.767399999999
2001-05-05T00:00:00.000000Z
...
...
What did I try?
I tried by adding a subquery to the SELECT to have a "duplicate" of some sort for the timestamp column but that's apparently never valid in QuestDB (?), so probably the solution is by adding even more subqueries in the FROM? Or a UNION?
Who can help me out with this? The data is there in the database and it can be calculated. It's just a matter of getting it out.
I think 'wanted output' can be achieved with window functions.
Please have a look at:
CREATE TABLE electricity (ts TIMESTAMP, consumption DOUBLE) TIMESTAMP(ts);
INSERT INTO electricity
SELECT (x*1000000)::timestamp, rnd_double()
FROM long_sequence(10000000);
SELECT day, ts, max_per_day
FROM
(
SELECT timestamp_floor('d', ts) as day,
ts,
avg_in_15_min as max_per_day,
row_number() OVER (PARTITION BY timestamp_floor('d', ts) ORDER BY avg_in_15_min desc) as rn_per_day
FROM
(
SELECT ts, avg(consumption) as avg_in_15_min
FROM electricity
SAMPLE BY 15m
)
) WHERE rn_per_day = 1

Group or Sum the data based on overlapping period

I'm working on migrating legacy system data to a new system. I'm trying to migrate the data with history based on changed date. My current query results to below output.
Since it's a legacy system, some of the data falls within same period. I want to group the data based on id and name, and add the value as active record or inactive based on the data falls under same period.
My expected output:
For example, lets take 119 as an example and explain the same. One row marked as yellow since its not falls any overlapping period between other rows, but other two rows overlaps the period 01-No-18 to 30-Sep-19.
I need to split the data for overlapping period, and add the value only for overlapped period. So I need to look for combination based on date, which results to introduce a two rows one for non overlapped which results to below two rows
Another row for overlapped row
Same scenario applied for 148324, two rows introduced, one for overlapped and another non overlapped row.
Also is it possible to get non-overlapped data alone based on any condition ? I want to move overlapping data alone to temp table, and I can move the non-overlapped data directly to output table.
I think I dont have 100% solution, but its hard to decision what data are right and how them sort.
This query is based on lead/lag analytic functions. I had to change NULL values to adequate values in sequence (future and past).
Please try and modify this query and I hope it will fit in your case.
My table:
Query:
SELECT id,name,value,startdate,enddate,
CASE WHEN nvl(next_startdate,29993112)>nvl(prev_enddate,19900101) THEN 'Y' ELSE 'N' END AS active
FROM
(
SELECT datatable.*,
lag(enddate) over (partition by id,name order by startdate,value desc) prev_enddate,
lead(startdate) over (partition by id,name order by startdate,value desc) next_startdate
FROM datatable
) dt
Results:

Determining a subset of user's whose scores have improved using SQL

To start, I'm not sure if SQL is the best way to go about this, but given that my data is currently in a Postgres table, I figured that solving this problem using SQL would be the most logical place to start. I'll start out with my problem in plain english:
Problem statement in english: I have a bunch of users (> 1 million) taking daily tests on my app. Their scores range from 0 to 100. I have about 5 years of this data. I would like to know which users have improved "most significantly" during this time.
There are quite a few things that I should elaborate on:
Improvement is arbitrary, but let's say that by "improvement" I mean that the average scores between the first N tests and the last N tests is at least D.
This means that there must be at least 2N rows for a user, but let's say that for a user to be eligible for analysis, they must have taken at least M * N tests. Finally, the difference between the first test and last test should be at least Y years.
To summarize, we have:
N: The number of tests we are averaging to determine initial and final performance scores.
M: Will be multiplied by N to determine the minimum number of tests that a user must have taken to be eligible for this analysis.
D: A threshold for filtering out top performers.
Y: The number of years that a user must have participated for.
Test table schema (relative parts)
user_id (UUID): The ID of the user who took this test
score (INT): The score on this day's test
created_at (DATETIME): The test date (one per day per user)
My question
What would be a good way to query this in SQL?
Ideally the solution would be relatively fast (run within less than a minute or so). I can add table indices or make any other similar structural changes if required.
My thoughts so far
I feel like there may be a way to create groups by a user_id, but only show the groups passing the initial constraints:
Having at least N * M entries in the group
Difference between the first and last entries being at least Y
But after that, I'm really not sure. Are there ways to create sub-groups within a group, potentially adding a new "average score" attribute for that group? (Even getting this far could be sufficient if it's not possible to omit results where the difference between first and last score averages are at least Y)
Well, you can do this in SQL using window functions and conditional aggregation:
select userid,
avg(score) filter (where seqnum_asc <= #n) as first_n_avg,
avg(score) filter (where seqnum_desc <= #n) as last_n_avg
from (select t.*,
row_number() over (partition by userid order by date) as seqnum_asc,
row_number() over (partition by userid order by date desc) as seqnum_desc,
count(*) over (partition by userid) as cnt
from t
) t
where cnt >= #M * #N
group by userid
having max(testdate) >= min(testdate) + #Y * interval '1 year'
order by (last_n_avg - first_n_avg) desc;
You can add the condition (last_n_avg - first_n_avg) >= #D to the having clause.

Calculation of weighted average counts in SQL

I have a query that I am currently using to find counts
select Name, Count(Distinct(ID)), Status, Team, Date from list
In addition to the counts, I need to calculate a goal based on weighted average of counts per status and team, for each day.
For example, if Name 1 counts are divided into 50% Status1-Team1(X) and 50% Status2-Team2(Y) yesterday, then today's goal for Name1 needs to be (X+Y)/2.
The table would look like this, with the 'Goal' field needed as the output:
What is the best way to do this in the same query?
I'm almost guessing here since you did not provide more details but maybe you want to do this:
SELECT name,status,team,data,(select sum(data)/(select count(*) from list where name = q.name)) FROM (SELECT Name, Count(Distinct(ID)) as data, Status, Team, Date FROM list) as q

Use row number in aggregate sum over UNBOUNDED FOLLOWING SQL

I would like to add a discount rate when summing Cashflows over a number of period. To do this I need to multiply each of the remaining cashflows by the discount rate, consummate with this period. I could do this, if I knew the row number of each period, but I can't use it with the window calc I am using. The example below shows the column 'Remaining Interest' which is what I am trying to calculate based on raw data of period and interest.
select Period,RemainingInterest = SUM(PeriodInterestPaid)
OVER (PARTITION BY Name ORDER BY period ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM CF A
Period Interest Remaining Interest(Query) Remaining Interest(Required)
1 1000 1000+2000 1000/1.02^1+2000/1.02^2
2 2000 2000 2000/1.02^1
hi i hope i understand Well ---
you need to get the sum of value based on the period that what i under stand from the query but u said that you need a multiply
So there's no need to make a window function just group by
select Period, SUM(PeriodInterestPaid) as RemainingInterest
FROM CF A
and if u want a multiplay you will make group by also but u will use anther exp :
Pls explan what exactly u need