Redshift querying data on dates - sql

I am trying to query data from redshift table.
I have a users table with columns such as name, age, gender and created_at for example:
|----|-----|------|----------|
|name|age |gender|created_at|
------------------------------
|X | 24 | F | some_date|
______________________________
I need to query above table, in such a way that I have additional columns such as created_this_week, created_last_week, created_last_4_week, current_month, last_month etc
Additional flag columns should be 'Y' for conditions such as data is from last week, current week, current month, last month, last 4 weeks (excluding this week) so last 4 weeks starting last week etc, something like below.
|----|-----|------|-----------|------------|---------|-----------|------------|---------|
|name|age |gender|created_at |current_week|last_week|last_4_week|current_mnth|last_mnth|
_________________________________________________________________________________________
| X | 24 | F |CURRENTDATE| Y | N | N | Y | N |
_________________________________________________________________________________________
| F | 21 | M | lst_wk_dt | N | Y | Y | Depends | depends |
_________________________________________________________________________________________
I am new to PostgresSQL and Redshift, and still in my learning phase, I spent past few hrs trying to do this myself but was unsuccessful. I'd really appreciate if someone can help me out with this one.

You would use a case expressions:
select t.*,
(case when created_at >= now() - interval '1 week' then 'Y' else 'N' end) as week1,
(case when created_at >= now() - interval '4 week' then 'Y' else 'N' end) as week4,
. . .
from t;

Related

Pgsql- How to filter report days with pgsql?

Let's say I have a table Transaction which has data as following:
Transaction
| id | user_id | amount | created_at |
|:-----------|------------:|:-----------:| :-----------:|
| 1 | 1 | 100 | 2021-09-11 |
| 2 | 1 | 1000 | 2021-09-12 |
| 3 | 1 | -100 | 2021-09-12 |
| 4 | 2 | 200 | 2021-10-13 |
| 5 | 2 | 3000 | 2021-10-20 |
| 6 | 3 | -200 | 2021-10-21 |
I want to filter this data by this: last 4days, 15days, 28days:
Note: If user click on select option 4days this will filter last 4 days.
I want this data
total commission (sum of all transaction amount * 5%)
Total Top up
Total Debut: which amount (-)
Please help me out and sorry for basic question!
Expect result:
** If user filter last 4days:
Let's say current date is: 2021-09-16
So result:
- TotalCommission (1000 - 100) * 5
- TotalTopUp: 1000
- TotalDebut: -100
I suspect you want:
SELECT SUM(amount) * 0.05 AS TotalCmomission,
SUM(amount) FILTER (WHERE amount > 0) AS TotalUp,
SUM(amount) FILTER (WHERE amount < 0) AS TotalDown
FROM t
WHERE created_at >= CURRENT_DATE - 4 * INTERVAL '1 DAY';
This assumes that there are no future created_at (which seems like a reasonable assumption). You can replace the 4 with whatever value you want.
Take a look at the aggregate functions sum, max and min. Last four days should look like this:
SELECT
sum(amount)*.05 AS TotalComission,
max(amount) AS TotalUp,
min(amount) AS TotalDebut
FROM t
WHERE created_at BETWEEN CURRENT_DATE-4 AND CURRENT_DATE;
Demo: db<>fiddle
Your description indicates specifying the number of days to process and from your expected results indicate you are looking for results by user_id (perhaps not as user 1 falls into the range). Perhaps the the best option would be to wrap the query into a SQL function. Then as all your data is well into the future you would need to parameterize that as well. So the result becomes:
create or replace
function Commissions( user_id_in integer default null
, days_before_in integer default 0
, end_date_in date default current_date
)
returns table( user_id integer
, totalcommission numeric
, totalup numeric
, totaldown numeric
)
language sql
as $$
select user_id
, sum(amount) * 0.05
, sum(amount) filter (where amount > 0)
, sum(amount) filter (where amount < 0)
from transaction
where (user_id = user_id_in or user_id_in is null)
and created_at <# daterange( (end_date_in - days_before_in * interval '1 day')::date
, end_date_in
, '[]'::text -- indicates inclusive of both dates
)
group by user_id;
$$;
See demo here. You may just want to play around with the parameters and see the results.

how to query time-series data in postgresql to find spikes

I have a table called cpu_usages and I'm trying to find spikes of cpu usage. My table stores 4 columns:
id serial
at timestamp
cpu_usage float
cpu_core int
the at column stores a timestamp of every minute ever day.
I want to select all rows where I take each timestamp and get the next 3 minutes and if any of the timestamps has a cpu_value over at least 3% higher than the starting value for that timestamp, then return it
So for example if I have these rows:
id|at|cpu_values,cpu_core
1 | 2019-01-01-00:00|1|0
2 | 2019-01-01-00:01|1|0
3 | 2019-01-01-00:02|4|0
4 | 2019-01-01-00:03|1|0
5 | 2019-01-01-00:04|1|0
6 | 2019-01-01-00:05|1|0
7 | 2019-01-01-00:06|1|0
8 | 2019-01-01-00:07|1|0
9 | 2019-01-01-00:08|6|0
10 | 2019-01-01-00:00|1|1
11 | 2019-01-01-00:01|1|1
12| 2019-01-01-00:02|4|1
13 | 2019-01-01-00:03|1|1
14 | 2019-01-01-00:04|1|1
15 | 2019-01-01-00:05|1|1
16 | 2019-01-01-00:06|1|1
17 | 2019-01-01-00:07|1|1
18 | 2019-01-01-00:08|6|1
It would return rows:
1,2,6,7,8
I am not sure how to do this because it sounds like it needs some sort of nested joins.
Can anyone assist me with this?
This answers the original version of the question.
Just use window functions. Assuming you want the larger value, then you want to look back not forward:
select t.*
from (select t.*,
max(cpu_value) over (order by timestamp
range between interval '3 minute' preceding and interval '1 second' preceding
) as previous_min
from t
) t
where previous_min * 1.03 < cpu_value;
EDIT:
Looking backwards, this would be:
select t.*
from (select t.*,
min(cpu_value) over (order by timestamp
range between interval '1 second' following and interval '3 minute' following
) as next_min
from t
) t
where cpu_value * 1.03 > next_min;

Calculate percentage of subqueries from the same table with group by and having clauses

I'm using PostGres 10.12 DB which contains various fields about tests:
|test_name|result |report_time|main_version|environment|
| A |error |29/11/2020 | 1 | john |
| A |failure|28/12/2020 | 1 | john |
| A |error |29/12/2020 | 1 | alice |
| B |passed |30/12/2020 | 2 | ben |
| C |failure|31/12/2020 | 2 | alice |
| A |error |31/12/2020 | 2 | john |
I'm trying to calculate the percentage of tests which have both 'failure/error' and 'passed' results out of all the tests that ran on the same day.
I created the following query:
SELECT s.environment, COUNT(*) AS total, COUNT(*)::float / t.total_tests * 100 as percentage
FROM (
SELECT test_name, environment
FROM tests where report_time >= now() - interval '5 day'
and main_version='1' and environment='John'
GROUP BY test_name, environment
having COUNT(case when result in ('failure', 'error') then 1 else null end) > 0
and count(case when result = 'passed' then 1 else null end) > 0
order by environment asc
) s
CROSS JOIN (
SELECT COUNT(*) AS total_tests FROM tests where report_time >= now() - interval '5 day'
and main_version='1' and environment='John'
) t
GROUP BY s.environment, t.total_tests
Which works fine for a single environment and version. When I try to combine environments, the count is wrong.
How can I correctly calculate the correct percentage per day?
I'm trying to calculate the percentage of tests which have both 'failure/error' and 'passed' results out of all the tests that ran on the same day.
I don't know what "same day" is referring to. The sample data is taking data from a five day range, so I might guess that is what you mean.
In any case, the basic idea is to use conditional aggregation:
SELECT test_name, environment,
AVG( (result = 'passed')::int ) as passed_ratio,
AVG( (result in ('failure', 'error') )::int ) as fail_error_ratio
FROM tests
WHERE report_time >= now() - interval '5 day' AND
main_version = '1' AND
environment = 'John'
GROUP BY test_name, environment;
This returns ratios between 0 and 1. If you want percentages between 0 and 100 just multiply by 100.

How to write SQL query to calculate instances where a row containing a distinct id occurs 7 days after the fist occurrence if the unique id?

I am looking to return a date, the count of unique_ids first occurrences on that date, the number unique_ids that occurred 7 days after their first occurrence and the percentage of occurrences after 7 days / number of first occurrences.
example data_import table
+---------------------+------------------+
| time | distinct_id |
+---------------------+------------------+
| 2018/10/01 | 1 | first instance of `1`
+---------------------+------------------+
| 2018/10/01 | 2 | also first instance, but does not occur 7 days later
+---------------------+------------------+
| 2018/10/02 | 1 | should be disregarded (not first instance of 1)
+---------------------+------------------+
| 2018/10/02 | 3 | first instance of `3`
+---------------------+------------------+
| 2018/10/08 | 1 | First instance 7 days after first instance of `1`
+---------------------+------------------+
| 2018/10/08 | 1 | Don't count as this is the 2nd instance of `1` on this day
+---------------------+------------------+
| 2018/10/09 | 3 | 7 days after first instance of `3`
+---------------------+------------------+
| 2018/10/09 | 1 | 7 days after non-first instance of `1`
+---------------------+------------------+
And the expected return.
+---------------------+----------------------+------------------------+---------------------------+
| time | num_of_1st_instance | num_occur_7_days_after | percent_used_7_days_after |
+---------------------+----------------------+------------------------+---------------------------+
| 2018/10/01 | 2 | 1 | .50 |
+---------------------+----------------------+------------------------+---------------------------+
| 2018/10/02 | 1 | 1 | 1.0 |
+---------------------+----------------------+------------------------+---------------------------+
| 2018/10/03 | 0 | 0 | 0 |
+---------------------+----------------------+------------------------+---------------------------+
The query I have written is close, but counts occurrences other that the first for a distinct_id.
In my example, this query would include the occurrence of distinct_id 1 on 2018/10/02 and it's occurrence seven days after 2018/10/02 on 2018/10/09. Not wanted as the 2018/10/02 occurrence of distinct_id 1 is not it's first.
SELECT
data_import.time AS date,
count(distinct data_import.distinct_id) AS num_installs_on_install_date,
count(distinct future_activity.distinct_id) AS num_occur_7_days_after,
count(distinct future_activity.distinct_id) / count(distinct data_import.distinct_id)::float AS percent_used_7_days_after
FROM data_import
LEFT JOIN data_import AS future_activity ON
data_import.distinct_id = future_activity.distinct_id
AND
DATE(data_import.time) = DATE(future_activity.time) - INTERVAL '7 days'
AND
data_import.time = ( SELECT
time
FROM
data_import
WHERE
distinct_id = future_activity.distinct_id
ORDER BY
time
limit
1 )
GROUP BY DATE(data_import.time)
I hope that I explained this clearly. Please let me know how I can change my current query or a different approach to the solution.
Hmmm. Does this do what you want?
select di.time, sum( (seqnum = 1)::int) as first_instance,
sum( flag_7day ) as num_after_7_day,
sum( (seqnum = 1)::int) * 1.0 / sum( flag_7day ) as ratio
from (select di.*,
row_number() over (partition by distinct_id order by time) as seqnum,
(case when exists (select 1 from data_import di2 where di2.distinct_id = di.distinct_id and di2.time > di.time + interval '7 day')
then 1 else 0
end) as flag_7day
from data_import di
) di
group by di.time;
This doesn't return days with no first instances. Those days seem a bit awkward with respect to the ratio, so I'm not 100% sure that you really need them. If you do, it is easy enough to include a generate_series() to generate all dates in the range that you want.

PostgreSQL query group by two "parameters"

I've been trying to figure out the following PostgreSQL query with no success for two days now.
Let's say I have the following table:
| date | value |
-------------------------
| 2018-05-11 | 0.20 |
| 2018-05-11 | -0.12 |
| 2018-05-11 | 0.15 |
| 2018-05-10 | -1.20 |
| 2018-05-10 | -0.70 |
| 2018-05-10 | -0.16 |
| 2018-05-10 | 0.07 |
And I need to find out the query to count positive and negative values per day:
| date | positives | negatives |
------------------------------------------
| 2018-05-11 | 2 | 1 |
| 2018-05-10 | 1 | 3 |
I've been able to figure out the query to extract only positives or negatives, but not both at the same time:
SELECT to_char(table.date, 'DD/MM') AS date
COUNT(*) AS negative
FROM table
WHERE table.date >= DATE(NOW() - '20 days' :: INTERVAL) AND
value < '0'
GROUP BY to_char(date, 'DD/MM'), table.date
ORDER BY table.date DESC;
Can please someone assist? This is driving me mad. Thank you.
Use a FILTER clause with the aggregate function.
SELECT to_char(table.date, 'DD/MM') AS date,
COUNT(*) FILTER (WHERE value < 0) AS negative,
COUNT(*) FILTER (WHERE value > 0) AS positive
FROM table
WHERE table.date >= DATE(NOW() - '20 days'::INTERVAL)
GROUP BY 1
ORDER BY DATE(table.date) DESC
I would simply do:
select date_trunc('day', t.date) as dte,
sum( (value < 0)::int ) as negatives,
sum( (value > 0)::int ) as positives
from t
where t.date >= current_date - interval '20 days'
group by date_trunc('day', t.date),
order by dte desc;
Notes:
I prefer using date_trunc() to casting to a string for removing the time component.
You don't need to use now() and convert to a date. You can just use current_date.
Converting a string to an interval seems awkward, when you can specify an interval using the interval keyword.