How to subtract data from a column in sql - sql

Good morning I would like if you can help me to make a query in SQL or Postgresql, I have a column with the following data
" | Old 0.00 New 50.2 F20190429093143 | Old 50.20 New 50.2 F20191111151118 | Old 50.20 New 50.2 F20191202110735 | Old 50.20 New 53.2 F20201124173459 | Old 53.2 New 158.63 F20201125093143"
What I want to get are the last 3 updates, that is, we show the last 3 " | Old ..." and I kept something like that
| Old 50.20 New 50.2 F20191202110735 | Old 50.20 New 53.2 F20201124173459 | Old 53.2 New 158.63 F20201125093143
I can't think of how to make the query to get that value back

Assuming you also have some unique key/column in your table, the following hack would achieve what you want:
with entries as (
select t.id,
x.entry,
regexp_split_to_array(x.entry, '\s+') as items
from horribly_designed_table t
cross join regexp_split_to_table(t.line, '\s{0,1}\|\s*') as x(entry)
where nullif(trim(x.entry),'') is not null
), numbered as (
select id,
entry,
to_timestamp(items[5], '"F"yyyymmddhh24miss') as ts,
row_number() over (partition by id order by to_timestamp(items[5], '"F"yyyymmddhh24miss') desc) as rn
from entries
)
select id, string_agg(entry, ' | ' order by ts) filter (where rn <= 3) as new_line
from numbered
group by id;
The first CTE splits the line into multiple rows, one for each "entry". It also creates an array of the items for each entry.
The second CTE then uses a window function to calculate a rank based on the last "item" (converted to a timestamp).
And finally the "last 3 entry" for each ID are then aggregated back into a single line.
Online example

Related

SQL Having/Where clause to compare MAX from current/another table

I have a table that has date information and is being copied to another table and trying to perform an incremental load.
date = date format
hour = int
person
date
hour
bob
2023-01-01
1
bill
2023-01-02
2
select * into test.person_copy from
(select * from original.person)
My thought process of performing the incremental load is to check on the max(date) & max(hour) from the original table against the copied table to identify what is the gap between the max values from both tables. However, I'm not entirely sure how to implement the logic as it doesn't seem straight forward with the where clause. Having clause might make more sense, but also doesn't seem correct?
select * into test.person_copy from
(select * from original.person org
Having max(org.date, org.hour) > (select max(copy.date,copy.hour) from test.person_copy copy)
)
The other variation I had in mind was to use HAVING NOT IN
Having max(org.date, org.hour) NOT IN (select max(copy.date,copy.hour) from test.person_copy copy)
Wasn't sure if logic is correct. Hour field will be of importance's, but can live with just the date fields.
Expected output would be that the logic would check for existing max(date) and only insert if it doesn't exist. Example below, 2023-01-03
| person | date | hour |
|--------|------------|------|
| bob | 2023-01-01 | 1 |
| bill | 2023-01-02 | 2 |
| test | 2023-01-03 | 2 |
Don't have access to a RedShift environment but the following query should work:
select *
into test.person_copy
from original.person org
where dateadd(hrs, org.hour, org.date) >
(select max(dateadd(hrs, cpy.hour, cpy.date))
from test.person_copy cpy
)
This assumes that when the previous hour's copy was made entire set of source rows for that date&hour was copied (the new incremental load would have all rows for the dates&hours not already copied). This means that you need additional criteria in the select to make sure that you include only completed date-hours (i.e. make sure that you don't include the rows with hour=10 while the time is still 10:30).

Incremental integer ID in Impala

I am using Impala for querying parquet-tables and cannot find a solution to increment an integer-column ranging from 1..n. The column is supposed to be used as ID-reference. Currently I am aware of the uuid() function, which
Returns a universal unique identifier, a 128-bit value encoded as a string with groups of hexadecimal digits separated by dashes.
Anyhow, this is not suitable for me since I have to pass the ID to another system which requests an ID in style of 1..n. I also already know that Impala has no auto-increment-implementation.
The desired result should look like:
-- UUID() provided as example - I want to achieve the `my_id`-column.
| my_id | example_uuid | some_content |
|-------|--------------|--------------|
| 1 | 50d53ca4-b...| "a" |
| 2 | 6ba8dd54-1...| "b" |
| 3 | 515362df-f...| "c" |
| 4 | a52db5e9-e...| "d" |
|-------|--------------|--------------|
How can I achieve the desired result (integer-ID ranging from 1..n)?
Note: This question differs from this one which specifically handles Kudu-tables. However, answers should be applicable for this question as well.
Since other Q&A's like this one only came up with uuid()-alike answers, I put some thought in it and finally came up with this solution:
SELECT
row_number() OVER (PARTITION BY "dummy" ORDER BY "dummy") as my_id
, some_content
FROM some_table
row_number() generates a continuous integer-number over a provided partition. Unlike rank(), row_number() always provides an incremented number on its partition (even if duplicates occur)
PARTITION BY "dummy" partitions the entire table into one partition. This works since "dummy" is interpreted in the execution graph as temporary column yielding only the String-value "dummy". Thus, also something analog to "dummy" works.
ORDER BY is required in order to generate the increment. Since we don't care about the order in this example (otherwise just set your respective column), also use the "dummy"-workaround.
The command creates the desired incremental ID without any nested SQL-statements or other tricks.
| my_id | some_content |
|-------|--------------|
| 1 | "a" |
| 2 | "b" |
| 3 | "c" |
| 4 | "d" |
|-------|--------------|
I used Markus's answer for a large partitioned table and found that I was getting duplicate ids. I think the ids were only unique within their partition; possibly PARTITION BY "dummy" leads Impala to think that each partition can execute row_number() on its own. I was able to get it working by specifying an actual column to order by and no partition by:
SELECT
row_number() OVER (ORDER BY actual_column) as my_id
, some_content
FROM some_table
It doesn't seem to matter whether the values in the column are unique (mine weren't), but using the actual partition key might result in the same issue as the "dummy" column.
Understandably, it took a lot longer to run than the dummy version.

How to get the set size, first and last record in a db2 ordered set with one call

I have a very big transaction table on DB2 v11, and I need to query a subset of it as efficiently as possible. All I need is the total count of the set (not known in advance, it's based on criteria, lets say 1 day) and the ID of the first record, and the ID of the last record.
The old code was fetching the entire table, then just using the 1st record ID, and the last record ID, and size, and not making use of the rest. Now this code is timing out. It's a complex query of several joins.
IS there a way to just fetch the size of the set, 1st record, last record all in one select query ?
I've read that reordering the list in order to fetch the 1st record(so fetch with Desc, then change to Asc) is not efficient.
sample table 1 TRANSACTION_RECORDS:
tdID TIMESTAMP name
-------------------------------
123 2020-03-31 john
234 2020-03-31 dan
456 2020-03-01 Eve
675 2020-04-01 joy
sample table 2 TRANSACTION_TYPE:
invoiceId tdID account
------------------------------
897 123 abc
898 123 def
877 234 mnc
899 456 opp
Sample query
select Min(tr.transaction_id), Max(tr.transaction_id)
from TRANSACTION_RECORDS TR
join TRANSACTION_TYPE TT
on TR.tdID=tt.tdID
WHERE Date(TR.TIMESTAMP) = '2020-03-31'
group by tr.tdID
order by TR.tdID ASC
This results in multiple columns, (but it requires the group by)
123,123
234,234
456,456
What I want is:
123,456
As I mentioned in the comments, for this query you don't need Group BY and neither Order by, just do:
select Min(tr.transaction_id), Max(tr.transaction_id)
from TRANSACTION_RECORDS TR
join TRANSACTION_TYPE TT
on TR.tdID=tt.tdID
WHERE Date(TR.TIMESTAMP) = '2020-03-31'
It should work as expected

Select unique records and display as category headers in rails

I have a rails 3.2 app running on PostgreSQL, and have some data I want to display in my view, which is stored in the database in this structure:
+----+--------+------------------+--------------------+
| id | name | sched_start_date | task |
+----+--------+------------------+--------------------+
| 1 | "Ben" | 2013-03-01 | "Check for debris" |
+----+--------+------------------+--------------------+
| 2 | "Toby" | 2013-03-02 | "Carry out Y1.1" |
+----+--------+------------------+--------------------+
| 3 | "Toby" | 2013-03-03 | "Check oil seals" |
+----+--------+------------------+--------------------+
I would like to display a list of tasks for each name, and for the names to be ordered ASC by the first sched_start_date they have, which should look like ...
Ben
2013-03-01 – Check for debris
Toby
2013-03-02 – Carry out Y1.1
2013-03-03 – Check oil seals
The approach I starting taking was to run a query for unique names and order them by sched_start_date ASC, then run a query for each name to get their tasks.
To get a list of unique names, the SQL would look like this.
select *
from (
select distinct on (name) name, sched_start_date
from tasks
) p
order by sched_start_date;
I would like to know if this is the correct approach (querying for unique names then running another query for all their tasks), or if there is a better rails way.
To get the data sorted like you describe, you might want to use min() as window function in the ORDER BY clause:
SELECT name, sched_start_date, task
FROM tasks
ORDER BY min(sched_start_date) OVER (PARTITION BY name), 1, 2, 3
Your original query would need an additional ORDER BY item to get the earliest date per name:
SELECT DISTINCT ON (name) name, sched_start_date, task
FROM tasks
ORDER BY 1, 2, 3;
I also added task (3) as last ORDER BY item to break ties, in case there can be more than one per date.
But the output is still ordered by name, not by date.
Getting your peculiar format with all data stuffed into one column is a bit more complex:
SELECT one_col
FROM (
WITH x AS (
SELECT name, min(sched_start_date) AS min_start
FROM tasks
GROUP BY 1
)
SELECT 2 AS rnk, name
,sched_start_date::text || ' – ' || task AS one_col
,sched_start_date, min_start
FROM tasks
JOIN x USING (name)
UNION ALL
SELECT 1 AS rnk, name, name, NULL::date, min_start
FROM x
ORDER BY min_start, name, rnk, sched_start_date, task
) y
Assuming that you have associations in your model you would be able to run
#employees = Employee.order(:name, :sched_start_date, :task).includes(:tasks)
You could then iterate over them:
#employees.each do |employee|
employee.name
employee.tasks.each do |task|
task.name
end
end
This isn't gonna exactly match your needs, but should show you where to start.

PostgreSQL calculate the top places per group and other statistics

I have a table with the following structure
|user_id | place | type_of_place | money_earned| time |
|--------+-------+---------------+-------------+------|
| | | | | |
The table is very large, several millions of rows. The data is in a PostgreSQL 9.1 database.
I want to calculate, per user_id and type_of_place: the mean, the standard deviation, and the top 5 of places (ordered by counts), and the most used hour of time (mode).
The resulting data must be in this form:
| user_id | type_of_place | avg | stddev | top5_places | mode |
+---------+---------------+-----+--------+------------------+------+
| 1 | tp1 | 10 | 1 | {p1,p2,p3,p4,p5} | 8 |
| 2 | tp1 | 3 | 2 | {p3,p4} | 23 |
| 1 | tp3 | 1 | 1 | {p1} | 4 |
etc.
Is there a for of doing this with window functions efficiently?
What if I want to grouping by week? (i.e. another column that represents the number of week)
Thank you!
A standard GROUP BY query will get you most of the way:
SELECT
user_id,
type_of_place,
avg(money_earned) AS avg,
stddev(money_earned) AS stddev
FROM
earnings -- I'm not sure what your data table is called...
GROUP BY
user_id,
type_of_place
This leaves the top5_places and mode columns. These are both also aggregates, but not ones which are defined in the standard PostgreSQL installation. Luckily, you can add them.
Here's a page discussing how to define a mode aggregate function: http://wiki.postgresql.org/wiki/Aggregate_Mode
Once you have a mode aggregate function, assuming time is a timestamp of some kind, the expression you will add to the select list will be:
SELECT
...
mode(extract(hour FROM time)) AS mode -- Add this expression
FROM
...
Assuming order by money
For top5_places, there are several approaches, but the quickest is probably to use PostgreSQL's builtin array_agg function, and take the first 5 elements:
SELECT
...
(array_agg(place ORDER BY money_earned DESC))[1:5] AS top5_places -- Add this expression
FROM
...
One alternative is to define another aggregate called (for instance) top5, which performs the same function. This could be more efficient if there are many distinct places for each user/type of place combination, since it can stop accumulating after the first 5, whereas the above expression will generally build a complete array of all places, and then truncate to the first 5.
This assumes that a place has a unique earnings entry for each user/type combination. If a place can occur more than once, and you want to sort by sum(money_earned) for each place, then you need to use a subquery like in the examples below...
Order by counts
Ok, so the places should be ordered by how often they occur. Here's a quick way, which uses a couple of subqueries -- add this as an expression to the select-clause of the above query:
(SELECT
(array_agg(place ORDER BY cnt DESC))[1:5]
FROM
(SELECT place, count(*) FROM earnings AS t2
WHERE t2.user_id = earnings.user_id AND t2.type_of_place = earnings.type_of_place
GROUP BY place) AS s (place, cnt)
) AS top5_places
The inner subquery called s evaluates to a table of each place for that user/type combination, and the number of times it occurs (which I've called cnt). These are then fed to array_agg in descending order of that count.
I suspect there could be much neater (and probably more efficient) ways of writing it. If not, then I would recommend trying to move this complicated expression into a function or aggregate, if you can...
Histrogram of places in each hour
We'll use a similar expression, which will return the array of counts, ordered by hour:
(SELECT
array_agg(cnt ORDER BY hour DESC)
FROM
(SELECT extract(hour FROM time), count(*) FROM earnings AS t2
WHERE t2.user_id = earnings.user_id AND t2.type_of_place = earnings.type_of_place
GROUP BY 1) AS s (hour, cnt)
) AS hourly_histogram
(Add that to the select-clause of the original query.)