Group by one column and substring of its own

Group by one column and substring of its own - sql

Unable to write a Sql for my problem.
I have a table with 2 columns item code and expiration date.
Itemcode. Expiration
Abc123. 2014-08-08
Abc234. 2014-07-07
Cfg345. 2014-06-06
Cfg567. 2014-07-08
The output should be based on first 3 digits of item code and minimum expirarion date like below
Abc. 2014-07-07. Abc234
Cfg. 2014-06-06. Cfg345
Thanks
EDITED:
The query goes like this which actually is joining multiple tables to fetch the itemcode and expiration.
select substr(y.itemcode,1,3),
min(x.expiration_date) expiry,
y.itemcode
from X x, Y y
where y.id = x.id
and x.number in
(select number from xyz
where id = x.id
and codec in ('C', 'M', 'T', 'H')
)
group by substr(y.itemcode,1,3), y.itemcode

I am not familiar with "m". Here is an ANSI standard SQL solution:
select substring(itemcode, 1, 3), expiration, itemcode
from (select t.*,
row_number() over (partition by substring(itemcode, 1, 3)
order by expiration desc
) as seqnum
from table t
) t
where seqnum = 1;
Most databases support this functionality. Some might have slightly different names (such as substr() or left() for the substring operation).

Related

More than one row returned by a subquery used as an expression when UPDATE on multiple rows

I'm trying to update rows in a single table by splitting them into two "sets" of rows.
The top part of the set should have a status set to X and the bottom one should have a status set to status Y.
I've tried putting together a query that looks like this
WITH x_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
LIMIT 5
), y_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
OFFSET 5
)
UPDATE people
SET status = folks.status
FROM (values
((SELECT id from x_status), 'X'),
((SELECT id from y_status), 'Y')
) as folks (ids, status)
WHERE id IN (folks.ids);
When I run this query I get the following error:
pq: more than one row returned by a subquery used as an expression
This makes sense, folks.ids is expected to return a list of IDs, hence the IN clause in the UPDATE statement, but I suspect the problem is I can not return the list in the values statement in the FROM clause as it turns into something like this:
(1, 2, 3, 4, 5, 5)
(6, 7, 8, 9, 1)
Is there a way how this UPDATE can be done using a CTE query at all? I could split this into two separate UPDATE queries, but CTE query would be better and in theory faster.

I think I understand now... if I get your problem, you want to set the status to 'X' for the oldest five records and 'Y' for everything else?
In that case I think the row_number() analytic would work -- and it should do it in a single pass, two scans, and eliminating one order by. Let me know if something like this does what you seek.
with ranked as (
select
id, row_number() over (order by date_registered desc) as rn
from people
)
update people p
set
status = case when r.rn <= 5 then 'X' else 'Y' end
from ranked r
where
p.id = r.id
Any time you do an update from another data set, it's helpful to have a where clause that defines the relationship between the two datasets (the non-ANSI join syntax). This makes it iron-clad what you are updating.
Also I believe this code is pretty readable so it will be easier to build on if you need to make tweaks.
Let me know if I missed the boat.

So after more tinkering, I've come up with a solution.
The problem with why the previous query fails is we are not grouping the IDs in the subqueries into arrays so the result expands into a huge list as I suspected.
The solution is grouping the IDs in the subqueries into ARRAY -- that way they get returned as a single result (tuple) in ids value.
This is the query that does the job. Note that we must unnest the IDs in the WHERE clause:
WITH x_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
LIMIT 5
), y_status AS (
SELECT id
FROM people
WHERE surname = 'foo'
ORDER BY date_registered DESC
OFFSET 5
)
UPDATE people
SET status = folks.status
FROM (values
(ARRAY(SELECT id from x_status), 'X'),
(ARRAY(SELECT id from y_status), 'Y')
) as folks (ids, status)
WHERE id IN (SELECT * from unnest(folks.ids));

How to select k-th record per field in a single SQL query

please help me with the following problem. I have spent already one week trying to put all the logic into one SQL query but still got no elegant result. I hope the SQL experts could give me a hint,
I have a table which has 4 fields: date, expire_month, expire_year and value. The primary key is defined on 3 first fields. Thus for a concrete date few values are present with different expire_month, expire_year. I need to chose one value from them for every date, present in the table.
For example, when I execute a query:
SELECT date, expire_month, expire_year, value FROM futures
WHERE date = ‘1989-12-01' ORDER BY expire_year, expire_month;
I get a list of values for the same date sorted by expirity (months are coded with letters):
1989-12-01 Z 1989 408.25
1989-12-01 H 1990 408.25
1989-12-01 K 1990 389
1989-12-01 N 1990 359.75
1989-12-01 U 1990 364.5
1989-12-01 Z 1990 375
The correct single value for that date is the k-th record from top. For example, of k is 2 then the «correct single» record would be:
1989-12-01 H 1990 408.25
How can I select these «correct single» values for every date in my table?

You can do it with rank():
select t.date, t.expire_month, t.expire_year, t.value from (
select *,
rank() over(partition by date order by expire_year, expire_month) rn
from futures
) t
where t.rn = 2
The column rn in the subquery, is actually the rank of the row grouped by date. Change 2 to the rank you want.

While forpas's answer is the better one (Though I think I'd use row_number() instead of rank() here), window functions are fairly recent additions to Sqlite (In 3.25). If you're stuck on an old version and can't upgrade, here's an alternative:
SELECT date, expire_month, expire_year, value
FROM futures AS f
WHERE (date, expire_month, expire_year) =
(SELECT f2.date, f2.expire_month, f2.expire_year
FROM futures AS f2
WHERE f.date = f2.date
ORDER BY f2.expire_year, f2.expire_month
LIMIT 1 OFFSET 1)
ORDER BY date;
The OFFSET value is 1 less than the Kth row - so 1 for the second row, 2 for the third row, etc.
It executes a correlated subquery for every row in the table, though, which isn't ideal. Hopefully your composite primary key columns are in the order date, expire_year, expire_month, which will help a lot by eliminating the need for additional sorting in it.

You can try the following query.
select * from
(
SELECT rownum seq, date1, expire_month, expire_year, value FROM testtable
WHERE date1 = to_date('1989-12-01','yyyy-mm-dd')
ORDER BY expire_year, expire_month
)
where seq=2

Find user online time

Having a question on how to write a self-join query.
The Online Session Table holds all user activities. Each Activity has a State ID, TimeStap to record the User login time.
It's like:
example:
State TimeStamp User
X 1300 A
Y 1700 A
X 0700 B
Z 1500 B
Y 1600 B
X 2100 C
A little Explanation: In the above table, User A logged in State X on 1300, then logged in State Y on 1700, so the User A spend 0400(assume it's 4 hours) in State X.
The Same logic applied to User B.
Then User C, since it never change sate, so we use current time - login time stamp of X.
The output should look like:
State Time User
X 0400(or 4) A
X 0800(or 8) B
Z 0100(or 1) B
X result of Now-2100 C
.......
Edit: Just make the problem clearer.Now let's assume it's in SQL Server DMBS,but it's ok to use other DBMS.
And input timestamps are store as default datetime format like YYYY-MM-DD HH:MM:SS.

You didn't mention which DBMS you're using, so I'm writing this how I'd do it in MS SQL Server (TSQL). You'll need access to the LAG function, which is not universal.
What LAG does is allow you to compare values from a previous row, based on some shared column value, in this case User. This code catches those comparisons in the prev_ fields. I'm using count() to differentiate users with more than one line from users with only one line. The single-line users are handled separately after the union all.
You'll notice that I'm not using your field names until the final output step. This is because State, Timestamp and User are all reserved words, i.e. words that do something in SQL code. I strongly recommend you use field names that are not reserved words.
This code does have a major limitation; it doesn't work for the now-minus-time portion if it's not the same day. So in your example, it would have to be between 21:01 and 23:59 the same day for it to work. If you wanted to do this robustly you'd use datetime format for your times, which would make this a lot easier and eliminate the limitation. But this answer is for your data, so:
SELECT
b.prev_state AS [State]
,b.Online_time - b.prev_time AS [Time]
,b.U_ID as [User]
FROM
(SELECT
t.Online_state
,t.U_ID
,t.Online_time
,LAG(t.online_time) OVER (PARTITION BY t.U_ID ORDER BY t.U_ID, t.online_time) AS prev_time
,LAG(t.online_state) OVER (PARTITION BY t.U_ID ORDER BY t.U_ID, t.online_time) AS prev_state
FROM online_t AS t
inner join
(SELECT
U_ID,
count(U_ID) AS tot
FROM online_t
GROUP BY U_ID) AS a
on t.U_ID = a.U_ID
WHERE tot > 1) AS b
WHERE prev_time is not null
union all
SELECT
t.Online_state AS [State]
,concat(datepart(hh,getdate()),'00') - t.Online_time AS [Time]
,t.U_ID AS [USER]
FROM online_t AS t
inner join
(SELECT
U_ID
,count(U_ID) as tot
FROM online_t
GROUP BY U_ID) as a
on t.U_ID = a.U_ID
WHERE tot = 1

I have a solution using Oracle analytical functions, which may not be available to you. I'm also using timestamps as oracle varchars.
I'm using LEAD() in a subquery to return the "next user" and the "next time".
Then using a CASE statement to handle the different scenarios.
SELECT M.THESTATE,
CASE
WHEN M.USERID = M2.NEXT_USER THEN M2.NEXT_TIME-M.THETIME
WHEN M.USERID <> M2.NEXT_USER THEN NULL
ELSE M.THETIME-0 END AS TOTALTIME
,M.USERID
FROM MYTEST M
JOIN
(
SELECT USERID, THESTATE, THETIME
,LEAD(THETIME) OVER (ORDER BY USERID, THETIME) AS NEXT_TIME
,LEAD(USERID) OVER (ORDER BY USERID, THETIME) AS NEXT_USER
FROM MYTEST
ORDER BY USERID
) M2 ON M2.USERID = M.USERID AND M2.THESTATE=M.THESTATE
WHERE
CASE
WHEN M.USERID = M2.NEXT_USER THEN M2.NEXT_TIME-M.THETIME
WHEN M.USERID <> M2.NEXT_USER THEN NULL
ELSE M.THETIME-0 END
IS NOT NULL;

Including your input in a WITH clause (I use the TIMESTAMP type for your "timestamp"; and some databases don't like if you use reserved words ("user", "timestamp") for column names), try this:
WITH
-- input, don't use in query
input(state,"timestamp","user") AS (
SELECT 'X',TIMESTAMP '2017-03-15 13:00:00','A'
UNION ALL SELECT 'Y',TIMESTAMP '2017-03-15 17:00:00','A'
UNION ALL SELECT 'X',TIMESTAMP '2017-03-15 07:00:00','B'
UNION ALL SELECT 'Z',TIMESTAMP '2017-03-15 15:00:00','B'
UNION ALL SELECT 'Y',TIMESTAMP '2017-03-15 16:00:00','B'
UNION ALL SELECT 'X',TIMESTAMP '2017-03-15 21:00:00','C'
)
,
-- start real query here, comma above would
-- be the WITH keyword
state_duration_user AS (
SELECT
state
, IFNULL(
LEAD("timestamp") OVER(ORDER BY "timestamp")
, CURRENT_TIMESTAMP
) - "timestamp"
AS "time"
, "user"
FROM input
)
SELECT
state
, CAST(SUM("time") AS TIME(0)) AS "time"
, "user"
FROM state_duration_user
GROUP BY
state
, "user"
;
state|time |user
Y |04:00:00|A
Y |01:00:00|B
Z |01:00:00|B
X |02:00:00|A
X |06:00:00|B
X |07:59:19|C

Merge continuous rows with Postgresql

I have a slots table like this :
Column | Type |
------------+-----------------------------+
id | integer |
begin_at | timestamp without time zone |
end_at | timestamp without time zone |
user_id | integer |
and I like to select merged rows for continuous time. Let's say I have (simplified) data like :
(1, 5:15, 5:30, 1)
(2, 5:15, 5:30, 2)
(3, 5:30, 5:45, 2)
(4, 5:45, 6:00, 2)
(5, 8:15, 8:30, 2)
(6, 8:30, 8:45, 2)
I would like to know if it's possible to select rows formatted like :
(5:15, 5:30, 1)
(5:15, 6:00, 2) // <======= rows id 2,3 and 4 merged
(8:15, 8:45, 2) // <======= rows id 5 and 6 merged
EDIT:
Here's the SQLfiddle
I'm using Postgresql, version 9.3!
Thank you!

Here is one method for solving this problem. Create a flag that determines if a one record does not overlap with the previous one. This is the start of a group. Then take the cumulative sum of this flag and use that for grouping:
select user_id, min(begin_at) as begin_at, max(end_at) as end_at
from (select s.*, sum(startflag) over (partition by user_id order by begin_at) as grp
from (select s.*,
(case when lag(end_at) over (partition by user_id order by begin_at) >= begin_at
then 0 else 1
end) as startflag
from slots s
) s
) s
group by user_id, grp;
Here is a SQL Fiddle.

Gordon Linoff already provided the answer (I upvoted).
I've used the same approach, but wanted to deal with tsrange type.
So I came up with this construct:
SELECT min(id) b_id, min(begin_at) b_at, max(end_at) e_at, grp, user_id
FROM (
SELECT t.*, sum(g) OVER (ORDER BY id) grp
FROM (
SELECT s.*, (NOT r -|- lag(r,1,r)
OVER (PARTITION BY user_id ORDER BY id))::int g
FROM (SELECT id,begin_at,end_at,user_id,
tsrange(begin_at,end_at,'[)') r FROM slots) s
) t
) u
GROUP BY grp, user_id
ORDER BY grp;
Unfortunately, on the top level one has to use min(begin_at) and max(end_at), as there're no aggregate functions for the range-based union operator +.
I create ranges with exclusive upper bounds, this allows me to use “is adjacent to” (-|-) operator. I compare current tsrange with the one on the previous row, defaulting to the current one in case there's no previous. Then I negate the comparison and cast to integer, which gives me 1 in cases when new group starts.

Using CORR() function on query with table join

I get a null value when using the CORR() function on a table join query. However, on a query without a join the CORR() function returns a value. I get values for the other fields. I have tried giving the fields aliases, or no aliases, but I cant seem to get a value for correlation in Query 2.
Thanks in advance.
Query 1
Returns a value for correlation. Query and result json link below.
select DATE(Time ) as date, ROUND(AVG(Price),2) as price, ROUND(SUM(amount),2) as volume, CORR(price, amount) as correlation
from
ds_5.tb_4981, ds_5.tb_4978, ds_5.tb_4967
where YEAR(Time) = 2014
group by date
order by date ASC
Query 1 result json: https://json.datadives.com/64cbd7a4a5aba3a864b17a719148620f.json
Query 2
Null value for correlation. Query and result json link below.
select bitcoin.date as date, bitcoin.btcprice, blockchain.trans_vol, CORR(bitcoin.btcprice,blockchain.trans_vol) as correlation
from
(select DATE(time) as date, AVG(price) as btcprice
from
ds_5.tb_4981, ds_5.tb_4978, ds_5.tb_4967
where YEAR(Time) = 2014
group by date) as bitcoin
JOIN
(select
DATE(blocktime) as date, SUM(vout.value) as trans_vol
from ds_14.tb_7917, ds_14.tb_7918, ds_14.tb_7919, ds_14.tb_7920, ds_14.tb_7921, ds_14.tb_7922, ds_14.tb_7923, ds_14.tb_7924, ds_14.tb_7925, ds_14.tb_7926, ds_14.tb_7927, ds_14.tb_7928, ds_14.tb_7934, ds_14.tb_7972, ds_14.tb_8016, ds_14.tb_8086, ds_14.tb_9743, ds_14.tb_9888, ds_14.tb_10084, ds_14.tb_10136, ds_14.tb_10500, ds_14.tb_10601
where YEAR(blocktime) = 2014
group by Date) as blockchain
on bitcoin.date = blockchain.date
group each by date, bitcoin.btcprice, blockchain.trans_vol
order by date ASC
Query 2 result json: https://json.datadives.com/9427dc9f51ba36add5f008403def7b6d.json

I took the CSV you linked and left it here: https://bigquery.cloud.google.com/table/fh-bigquery:public_dump.datadivescsv
(I'm not sure why would you prefer to share the csv by file, instead of creating a public dataset in BigQuery and sharing the link)
So this works:
SELECT CORR(btc_price, trans_vol)
FROM [fh-bigquery:public_dump.datadivescsv]
-0.004957046970769512
But this doesn't:
SELECT CORR(btc_price, trans_vol)
FROM [fh-bigquery:public_dump.datadivescsv]
GROUP BY date
null
null
...
null
And that's expected!
Why: To compute a correlation we need sets of more than 2 numbers. Grouping by date on the second query leaves us with n-groups of 1 element, hence correlation is non computable.
(Side note: Correlation between 2 elements is always 1 or -1. We really need at least 3 elements, and way more for the results to be significant)
SELECT CORR(x, y)
FROM (SELECT 1 x, 2 y)
null
SELECT CORR(x, y)
FROM (SELECT 1 x, 2 y), (SELECT 3 x, 8 y)
1.0
SELECT CORR(x, y)
FROM (SELECT 1 x, 2 y), (SELECT 3 x, 8 y), (SELECT 7 x, 1 y)
-0.3170147297373293
... and so on

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Group by one column and substring of its own - sql

Related

More than one row returned by a subquery used as an expression when UPDATE on multiple rows

How to select k-th record per field in a single SQL query

Find user online time

Merge continuous rows with Postgresql

Using CORR() function on query with table join

Categories

Resources