Duplicate records (rows) with modified values (Postgresql) - sql

I have a dataset that I am preparing for a pivot view (in Excel). It's customer data and I want to create a view that allows to summarize how many customers were active every month. Therefore I need to duplicate rows with a modified value for first_status_date. This is my data:
This is where I am trying to get: (The green rows are duplicated, and the cells in bold are modified.
Because I am adding a month on top of the value of the cell above, I am thinking of working with the lag function. But I am not familiar with duplicating rows. The id 123 and 356 needed 2 duplicated rows because there is a 2 month difference between last and first status date, the id 221 needs only 1 duplicated row, as there is a one month difference.

You can use generate_series() to generate the new rows - a little tweaking is needed for the end date to ensure that we do get a row for the last month, regardless of the actual month day:
select
t.id,
t.first_status,
d.first_status_date,
t.last_status,
t.last_status_date
from mytable t
cross join lateral generate_series(
t.first_status_date,
date_trunc('month', t.last_status_date) + interval '1 month',
interval '1 month'
) d(first_status_date)

Related

Where clause in a calculation

Say I have this table:
month
num_of_fruits
harvested
2022-01-01
133
3
2022-02-01
145
12
2022-03-01
123
5
2022-04-01
111
4
2022-05-01
164
9
..
..
..
I want to be able to set a new column called lost based on the month and num_of_fruits columns. To set this lost column, requires a calculation. The calculation is harvested - (num_of_fruits - num_of_fruits(last_month))
I'm having trouble in the parenthesis part - getting the last month's num_of_fruits. I have this to start:
select
id,
"month",
num_of_fruits,
harvested,
harvested - (num_of_fruits - num_of_fruits WHERE date_trunc('month', "month" - interval '1' month)) as lost,
selecting other columns..
It's giving me an error in the where clause.
Can you have a where clause inside a select statement? How would I take the last month's num_of_fruits and subtract it with this month's num_of_fruits - all while inside the select statement?
Any help or advice will greatly help me! Thank you so much in advance!
If you want to check other rows in the table, you will likely want either a subquery in your SELECT or to join the table to itself.
I think you are probably trying to do:
SELECT
harvested - (num_of_fruits - (SELECT num_of_fruits FROM mytable t2 WHERE t2.month = date_trunc('month', t1."month" - interval '1' month))) as lost
FROM mytable t1
Note that I created a whole new subquery (SELECT/FROM/WHERE) within your existing SELECT statement, instead of just adding a stray WHERE clause.
I also changed your condition so that it actually has a compares the result of DATETRUNC with something.
It's not clear to me that you actually need the DATETRUNC here (and, if you do, you might want it on both sides of the comparison), but you can use the basic idea above and fix the condition to match your needs.
An alternative (joining to self) to consider might be:
SELECT
t1.harvested - (t1.num_of_fruits - t2.num_of_fruits)
FROM mytable t1 LEFT OUTER JOIN mytable t2
ON t2.month = date_trunc('month', t1."month" - interval '1' month)))
If you know that you always have one row per month, so the previous row (ordered by month) is also the previous month, you could just use LAG:
SELECT
harvested - (num_of_fruits - LAG(num_of_fruits, 1) OVER (ORDER BY month)
FROM mytable
LAG(num_of_fruits, 1) OVER (ORDER BY month) means "the num_of_fruits from the previous row in the table when the table is ordered by month".

How to select data but without similar times?

I have a table with create_dt times and i need to get records but without the datas that have similar create_dt time (15 minutes).
So i need to get only one record instead od two records if the create_dt is in 15 minutes of the first one.
Format of the date and time is '(29.03.2019 00:00:00','DD.MM.YYYY HH24:MI:SS'). Thanks
It's a bit unclear what exactly you want, but one thing I can think of, is to round all values to the nearest "15 minute" and then only pick one row from those "15 minute" intervals:
with rounded as (
select create_dt,
date '0001-01-01' + (round((cast(create_dt as date) - date '0001-01-01') * 24 * 60 / 15) * 15 / 60 / 24) as rounded,
... other columns ....
from your_table
), numbered as (
select create_dt,
rounded,
row_number() over (partition by rounded order by create_dt) as rn
... other columns ....
from rounded
)
select *
from numbered
where rn = 1;
The expression date '0001-01-01' + (round((cast(create_dt as date) - date '0001-01-01') * 24 * 60 / 15) * 15 / 60 / 24) will return create_dt rounded up or down to the next "15 minutes" interval.
The row_number() then assigns unique numbers for each distinct 15 minutes interval and the final select then always picks the first row for that interval.
Online example: https://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=e6c7ea651c26a6f07ccb961185652de7
I'm going to walk you through this conceptually. First of all, there's a difficulty in doing this that you might not have noticed.
Let's say you wanted one record from the same hour or day. But if there are two record created on the same day, you only want one in your results. Which one?
I mention this because to the designers of SQL, there is not a single answer that they can provide SQL to pick. Then cannot show data from both records without both records being in the tabular output.
This is a common problem, but when the designers of SQL provided a feature to handle it, it can only work if there is no ambiguity of how to have one row of result for two records. That solution is GROUP BY, but it only works for showing the fields other than the timestamp if they are the same for all the records which match the time period. You have to include all the fields in your select clause and if multiple records in your time period are the same, they will create multiple records in your output. So although there is a tool GROUP BY for this problem, you might not be able to use it.
So here is the solution you want. If multiple records are close together, then don't include the records after the first one. So you want a WHERE clause which will exclude a record if another record recently proceeds it. So the test for each record in the result will involve other records in the table. You need to join the table to itself.
Let's say we have a table named error_events. If we get multiples of the same value in the field error_type very close to the time of other similar events, we only want to see the first one. The SQL will look something like this:
SELECT A.*
FROM error_events A
INNER JOIN error_events B ON A.error_type = B.error_type
WHERE ???
You will have to figure out the details of the WHERE clause, and the functions for the timestamp will depend you when RDBMS product you are using. (mysql and postgres for instance may work differently.)
You want only the records where there is no record which is earlier by less then 15 minutes. You do want the original record. That record will match itself in the join, but it will be the only record in the time period between its timestamp and 15 minutes prior.
So an example WHERE clause would be
WHERE B.create_dt BETWEEN [15 minutes before A.create_dt] and A.create_dt
GROUP BY A.*
HAVING 1 = COUNT(B.pkey)
Like we said, you will have to find out how your database product subtracts time, and how 15 minutes is represented in that difference.

Calculate stdev over a variable range in SQL Server

Table format is as follows:
Date ID subID value
-----------------------------
7/1/1996 100 1 .0543
7/1/1996 100 2 .0023
7/1/1996 200 1 -.0410
8/1/1996 100 1 -.0230
8/1/1996 200 1 .0121
I'd like to apply STDEV to the value column where date falls within a specified range, grouping on the ID column.
Desired output would like something like this:
DateRange, ID, std_v
1 100 .0232
2 100 .0323
1 200 .0423
One idea I've had that works but is clunky, involves creating an additional column (which I've called 'partition') to identify a 'group' of values over which STDEV is taken (by using the OVER function and PARTITION BY applied to 'partition' and 'ID' variables).
Creating the partition variable involves a CASE statement prior where a given record is assigned a partition based on its date falling within a given range (ie,
...
, partition = CASE
WHEN date BETWEEN '7/1/1996' AND '10/1/1996' THEN 1
WHEN date BETWEEN '10/1/1996' AND '1/1/1997' THEN 2
...
Ideally, I'd be able to apply STDEV and the OVER function partitioning on the variable ID and variable date ranges (eg, say, trailing 3 months for a given reference date). Once this works for the 3 month period described above, I'd like to be able to make the date range variable, creating an additional '#dateRange' variable at the start of the program to be able to run this for 2, 3, 6, etc month ranges.
I ended up coming upon a solution to my question.
You can join the original table to a second table, consisting of a unique list of the dates in the first table, applying a BETWEEN clause to specify desired range.
Sample query below.
Initial table, with columns (#excessRets):
Date, ID, subID, value
Second table, a unique list of dates in the previous, with columns (#dates):
Date
select d.date, er.id, STDEV(er.value)
from #dates d
inner join #excessRet er
on er.date between DATEADD(m, -36, d.date) and d.date
group by d.date, er.id
order by er.id, d.date
To achieve the desired next step referenced above (making range variable), simply create a variable at the outset and replace "36" with the variable.

analyze range and if true tell me

I want to see if the price of a stock has changed by 5% this week. I have data that captures the price everyday. I can get the rows from the last 7 days by doing the following:
select price from data where date(capture_timestamp)>date(current_timestamp)-7;
But then how do I analyze that and see if the price has increased or decreased 5%? Is it possible to do all this with one sql statement? I would like to be able to then insert any results of it into a new table but I just want to focus on it printing out in the shell first.
Thanks.
It seems odd to have only one stock in a table called data. What you need to do is bring the two rows together for last week's and today's values, as in the following query:
select d.price
from data d cross join
data dprev
where cast(d.capture_timestamp as date = date(current_timestamp) and
cast(dprev.capture_timestamp as date) )= cast(current_timestamp as date)-7 and
d.price > dprev.price * 1.05
If the data table contains the stock ticker, the cross join would be an equijoin.
You may be able to use query from the following subquery for whatever calculations you want to do. This is assuming one record per day. The 7 preceding rows is literal.
SELECT ticker, price, capture_ts
,MIN(price) OVER (PARTITION BY ticker ORDER BY capture_ts ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) AS min_prev_7_records
,MAX(price) OVER (PARTITION BY ticker ORDER BY capture_ts ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) AS max_prev_7_records
FROM data

SQL: need only 1 row per particular timestamp

i have some SQL code that is inserting values from another (non sql-based) system. one of the values i get is a timestamp.
i can get multiple inserts that have the same timestamp (albeit different values for other fields).
my problem is that i am trying to get the first insert happening every day (based upon timestamp) since a particular day (i.e. give me the first insert of each day since January 28, 2007...)
my code to get the first timestamp of every day is as follows:
SELECT MIN(my_timestamp) AS first_timestamp
FROM my_schema.my_table
WHERE my_col1 = 'WHATEVER'
AND my_timestamp > timestamp '2010-Jul-27 07:45:24' - INTERVAL '365 DAY'
GROUP BY DATE (my_timestamp);
This delivers me the list of times available. But when I join against these times, I can get several rows, as there are lots of rows that mach these times. So for 365 days, I may get 5,000 rows (I could be inserting 100 rows at 00:00:00 every day).
Assuming, in the example above, my_table has columns my_col1 and my_col2, how can I get exactly 365 rows that contain my_col1 & my_col2? it doesn't matter which row i get back if there are multiple rows for a date; any row will suffice.
it's an odd question. the overall problem is: given a timestamp, how can one get 1-row-per-timestamp even if there are multiple rows that have said timestamp (assuming there is no other priority)?
thanks for the help in advance.
EDIT:
So, let's say for example, this table has the following columns: my_col1, my_col2, and my_timestamp.
Here are example values (in order of my_col1 - my_col2 - my_timestamp):
'my_val1' - 10 - '2010-07-01 01:01:01'
'my_val2' - 11 - '2010-07-01 01:01:01'
'my_val3' - 12 - '2010-07-01 01:01:01'
'my_val4' - 13 - '2010-07-01 01:01:02'
'my_val5' - 14 - '2010-07-02 01:01:01'
'my_val6' - 15 - '2010-07-02 01:01:01'
'my_val7' - 16 - '2010-07-03 01:01:01'
in the end, i would want only 3 rows, 1 with a timestamp with '2010-07-01 01:01:01', one with '2010-07-02 01:01:01', and one with '2010-07-03 01:01:01'. the third one is easy, since there is only 1 row with that last timestamp. but the first two are the tricky ones. the sql i posted above will ignore the row with 'my_val4'.
i need a query that will return me all of the columns, not just the dates.
how would i get sql to give me either the first or last of the values that would match that timestamp (it doesn't matter either way. i just need to get 1-per first-day's timestamp matching)?
select distinct on (date(my_timestamp)) *
from my_table
order by date(my_timestamp), my_timestamp
This selects all columns, exactly one row per date(my_timestamp). The single row per day is the first row for the group, as determined by order by (so that's the row with minimal my_timestamp).
Of course you can add whatever joins, wheres etc. you need. But this is the stub you're looking for.
The solution is to use the SQL's DISTINCT statement (http://www.sql-tutorial.com/sql-distinct-sql-tutorial/):
SELECT DISTINCT MIN(my_timestamp) AS first_timestamp FROM my_schema.my_table WHERE my_col1 = 'WHATEVER' AND my_timestamp > timestamp '2010-Jul-27 07:45:24' - INTERVAL '365 DAY' GROUP BY DATE (my_timestamp);
I know you already have an answer, but I still don't understand why you have mentioned a join in your question. Why not just include the rest of the columns in your query, like this:
SELECT MIN(my_timestamp) AS first_timestamp, my_col1, my_col2
FROM my_table
GROUP BY DATE(my_timestamp);
This works in MySQL. Does it not return the expected result in PostgreSQL?