Transitive SQL query on same table - sql

Hey. consider d following table and data...
in_timestamp | out_timestamp | name | in_id | out_id | in_server | out_server | status
timestamp1 | timestamp2 | data1 |id1 | id2 | others-server1 | my-server1 | success
timestamp2 | timestamp3 | data1 | id2 | id3 | my-server1 | my-server2 | success
timestamp3 | timestamp4 | data1 | id3 | id4 | my-server2 | my-server3 | success
timestamp4 | timestamp5 | data1 | id4 | id5 | my-server3 | others-server2 | success
the above data represent log of a execution flow of some data across servers.
e.g. some data has flowed from some 'outside-server1' to bunch of 'my-servers' and finally to destined 'others-server2'.
Question :
1) I need to give this log in representable form to client where he doesn't need to know anything about the bunch of 'my-servers'. All i am supposed to give is timestamp of the data entered my infrastructure and when it left; drilling down to following info.
in_timestamp (of 'others_server1' to 'my-server1')
out_timestamp (of 'my-server3' to 'others-server2')
name
status
I want to write sql for the same! Can someone help?
NOTE : there might not be 3 'my-servers' all the time. It differs from situation to situation. e.g. there might be 4 'my-server' involved for, say, data2!
2) Are there any other alternatives to SQL? I mean stored procs/etc?
3) Optimizations? (The records are huge in number! As of now, it is around 5 million a day. And we are supposed to show records that are upto a week old.)
In advance, THANKS FOR THE HELP! :)

WITH RECURSIVE foo AS
(
SELECT *, in_timestamp AS timestamp1, 1 AS hop, ARRAY[in_id] AS hops
FROM log_parsing.log_of_sent_mails
WHERE in_server = 'other-server1'
UNION ALL
SELECT t_alias2.*, foo.timestamp1, foo.hop + 1, hops || in_id
FROM foo
JOIN log_parsing.log_of_sent_mails t_alias2
ON t_alias2.in_id = (foo.t_alias1).out_id
)
SELECT *
FROM foo
ORDER BY
hop DESC
LIMIT 1

Your table has a heirarchical structure (adjacency lists). This can be queried efficiently in PostgreSQL v8.4 and later using recursive CTEs. Quassnoi has written a blog post about how to implement it. It is a quite complex query that you need to write but he explains it well with examples very similar to what you need. Especially if you look at his last example, he demonstrates a query than gets the complete path from the first node to the last by using an array.

One way of doing it - if the data is STABLE (e.g. never changes onc inserted) is to compute the transitive relationships ON THE FLY (e.g. via a trigger or by the app which does the insertion) at the insert time.
E.g. you have a new column "start_ts" in your table; when you insert a record:
in_timestamp | out_timestamp | name | in_id | out_id | in_server | out_server | status
timestamp3 | timestamp4 | data1 | id3 | id4 | my-server2 | my-server3 | success
... then your logic automatically finds the record with name=data1 and out_id=id3 and clones its start_ts into the newly inserted record. You may need some special logic around propagating last status as well depending on how you compute those transitive values.
BTW, you need not necessarily have to look up the previous (name=data1 and out_id=id3) record - you can persist the start_ts value in the data record's meta data itself while processing.
Then the final report is simply select start_ts, out_ts from T where out_server=others_server2 (of course more complicated as far as out_server and status, but still a single simple select)
A second option is of course the more straightforward loop computing the resulting report - google or "stack" (is that an accepted verb now?) for SQL BFS implementations if you're not sure how.

#Other Readers :
Refer to 1st answer posted by Mark Byers first. I used 'answering' rather than 'commenting' his post since i needed to use tables/links etc. which is not available while commenting on answers. :)
#Mark Byers :
Thanks for the link... It really helped me and i was able to figure out the way to generate the path between the servers... Have a look # what i was able to do.
in_id | in_timestamp | out_timestmap | name | hops_count | path |
id1 | timestamp1 | timestamp2 | data1 | 1 | {id1} |
id2 | timestamp2 | timestamp3 | data1 | 2 | {id1,id2} |
id3 | timestamp3 | timestamp4 | data1 | 3 | {id1,id2,id3} |
id4 | timestamp4 | timestamp2 | data1 | 4 | {id1,id2,id3,id4} |
* path is generated using 'in_id'
I used the following query...
WITH RECURSIVE foo AS
(
SELECT t_alias1, 1 AS hops_count, ARRAY[in_id] AS hops
FROM log_parsing.log_of_sent_mails t_alias1
WHERE in_server = 'other-server1'
UNION ALL
SELECT t_alias2, foo.hops_count + 1 AS hops_count, hops || in_id
FROM foo
JOIN log_parsing.log_of_sent_mails t_alias2
ON t_alias2.in_id = (foo.t_alias1).out_id
)
SELECT (foo.t_alias1).in_id,
(foo.t_alias1).name,
(foo.t_alias1).in_timestamp,
hops_count,
hops::VARCHAR AS path
FROM foo ORDER BY hops
But i could not reach the ultimate stage yet. Here is what i wish to get ultimately...
in_id | in_timestamp | out_timestmap | name | hops_count | path |
id4 | timestamp1 | timestamp5 | data1 | 4 | {id1,id2,id3,id4}|
* observe the timestamp. Its required since i do not wish the client to know about the internal infrastructure. So for him the time-lag between timestamp1 and timestamp5 is what matters.
Any clue how possibly i could achieve it!?
p.s. I would try contacting Quassnoi too. :)

Related

SQL structure for multiple queries of the same table (using window function, case, join)

I have a complex production SQL question. It's actually PrestoDB Hadoop, but conforms to common SQL.
I've got to get a bunch of metrics from a table, a little like this (sorry if the tables are mangled):
+--------+--------------+------------------+
| device | install_date | customer_account |
+--------+--------------+------------------+
| dev 1 | 1-Jun | 123 |
| dev 1 | 4-Jun | 456 |
| dev 1 | 10-Jun | 789 |
| dev 2 | 20-Jun | 50 |
| dev 2 | 25-Jun | 60 |
+--------+--------------+------------------+
I need something like this:
+--------+------------------+-------------------------+
| device | max_install_date | previous_account_number |
+--------+------------------+-------------------------+
| dev 1 | 10-Jun | 456 |
| dev 2 | 25-Jun | 50 |
+--------+------------------+-------------------------+
I can do two separate queries to get max install date and previous account number, like this:
select device, max(install_date) as max_install_date
from (select [a whole bunch of stuff], dense_rank() over(partition by device order by [something_else]) rnk
from some_table a
)
But how do you combine them into one query to get one line for each device? I have rank, with statements, case statements, and one join. They all work individually but I'm banging my head to understand how to combine them all.
I need to understand how to structure big queries.
ps. any good books you recommend on advanced SQL for data analysis? I see a bunch on Amazon but nothing that tells me how to construct big queries like this. I'm not a DBA. I'm a data guy.
Thanks.
You can use correlated subquery approach :
select t.*
from table t
where install_date = (select max(install_date) from table t1 where t1.device = t.device);
This assumes install_date has resonbale date format.
I think you want:
select t.*
from (select t.*, max(install_date) over (partition by device) as max_install_date,
lag(customer_account) over (partition by device order by install-date) as prev_customer_account
from t
) t
where install_date = max_install_date;

Recursive self join over file data

I know there are many questions about recursive self joins, but they're mostly in a hierarchical data structure as follows:
ID | Value | Parent id
-----------------------------
But I was wondering if there was a way to do this in a specific case that I have where I don't necessarily have a parent id. My data will look like this when I initially load the file.
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,...
2 | *,record,abc,efg,hij,...
3 | ,,1,x,y,z,...
4 | ,,2,q,r,s,...
5 | 3,Formula,5,6,7,8,...
6 | *,record,lmn,opq,rst,...
7 | ,,1,t,u,v,...
8 | ,,2,l,m,n,...
Essentially, its a CSV file where each row in the table is a line in the file. Lines 1 and 5 identify an object header and lines 3, 4, 7, and 8 identify the rows belonging to the object. The object header lines can have only 40 attributes which is why the object is broken up across multiple sections in the CSV file.
What I'd like to do is take the table, separate out the record # column, and join it with itself multiple times so it achieves something like this:
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,5,6,7,8,...
2 | *,record,abc,efg,hij,lmn,opq,rst
3 | ,,1,x,y,z,t,u,v,...
4 | ,,2,q,r,s,l,m,n,...
I know its probably possible, I'm just not sure where to start. My initial idea was to create a view that separates out the first and second columns in a view, and use the view as a way of joining in a repeated fashion on those two columns. However, I have some problems:
I don't know how many sections will occur in the file for the same
object
The file can contain other objects as well so joining on the first two columns would be problematic if you have something like
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,...
2 | *,record,abc,efg,hij,...
3 | ,,1,x,y,z,...
4 | ,,2,q,r,s,...
5 | 3,Formula,5,6,7,8,...
6 | *,record,lmn,opq,rst,...
7 | ,,1,t,u,v,...
8 | ,,2,l,m,n,...
9 | ,4,Data,1,2,3,4,...
10 | *,record,lmn,opq,rst,...
11 | ,,1,t,u,v,...
In the above case, my plan could join rows from the Data object in row 9 with the first rows of the Formula object by matching the record value of 1.
UPDATE
I know this is somewhat confusing. I tried doing this with C# a while back, but I had to basically write a recursive decent parser to parse the specific file format and it simply took to long because I had to get it in the database afterwards and it was too much for entity framework. It was taking hours just to convert one file since these files are excessively large.
Either way, #Nolan Shang has the closest result to what I want. The only difference is this (sorry for the bad formatting):
+----+------------+------------------------------------------+-----------------------+
| ID | header | x | value
|
+----+------------+------------------------------------------+-----------------------+
| 1 | 3,Formula, | ,1,2,3,4,5,6,7,8 |3,Formula,1,2,3,4,5,6,7,8 |
| 2 | ,, | ,1,x,y,z,t,u,v | ,1,x,y,z,t,u,v |
| 3 | ,, | ,2,q,r,s,l,m,n | ,2,q,r,s,l,m,n |
| 4 | *,record, | ,abc,efg,hij,lmn,opq,rst |*,record,abc,efg,hij,lmn,opq,rst |
| 5 | ,4, | ,Data,1,2,3,4 |,4,Data,1,2,3,4 |
| 6 | *,record, | ,lmn,opq,rst | ,lmn,opq,rst |
| 7 | ,, | ,1,t,u,v | ,1,t,u,v |
+----+------------+------------------------------------------+-----------------------------------------------+
I agree that it would be better to export this to a scripting language and do it there. This will be a lot of work in TSQL.
You've intimated that there are other possible scenarios you haven't shown, so I obviously can't give a comprehensive solution. I'm guessing this isn't something you need to do quickly on a repeated basis. More of a one-time transformation, so performance isn't an issue.
One approach would be to do a LEFT JOIN to a hard-coded table of the possible identifying sub-strings like:
3,Formula,
*,record,
,,1,
,,2,
,4,Data,
Looks like it pretty much has to be human-selected and hard-coded because I can't find a reliable pattern that can be used to SELECT only these sub-strings.
Then you SELECT from this artificially-created table (or derived table, or CTE) and LEFT JOIN to your actual table with a LIKE to get all the rows that use each of these values as their starting substring, strip out the starting characters to get the rest of the string, and use the STUFF..FOR XML trick to build the desired Line.
How you get the ID column depends on what you want, for instance in your second example, I don't know what ID you want for the ,4,Data,... line. Do you want 5 because that's the next number in the results, or do you want 9 because that's the ID of the first occurrance of that sub-string? Code accordingly. If you want 5 it's a ROW_NUMBER(). If you want 9, you can add an ID column to the artificial table you created at the start of this approach.
BTW, there's really nothing recursive about what you need done, so if you're still thinking in those terms, now would be a good time to stop. This is more of a "Group Concatenation" problem.
Here is a sample, but has some different with you need.
It is because I use the value the second comma as group header, so the ,,1 and ,,2 will be treated as same group, if you can use a parent id to indicated a group will be better
DECLARE #testdata TABLE(ID int,Line varchar(8000))
INSERT INTO #testdata
SELECT 1,'3,Formula,1,2,3,4,...' UNION ALL
SELECT 2,'*,record,abc,efg,hij,...' UNION ALL
SELECT 3,',,1,x,y,z,...' UNION ALL
SELECT 4,',,2,q,r,s,...' UNION ALL
SELECT 5,'3,Formula,5,6,7,8,...' UNION ALL
SELECT 6,'*,record,lmn,opq,rst,...' UNION ALL
SELECT 7,',,1,t,u,v,...' UNION ALL
SELECT 8,',,2,l,m,n,...' UNION ALL
SELECT 9,',4,Data,1,2,3,4,...' UNION ALL
SELECT 10,'*,record,lmn,opq,rst,...' UNION ALL
SELECT 11,',,1,t,u,v,...'
;WITH t AS(
SELECT *,REPLACE(SUBSTRING(t.Line,LEN(c.header)+1,LEN(t.Line)),',...','') AS data
FROM #testdata AS t
CROSS APPLY(VALUES(LEFT(t.Line,CHARINDEX(',',t.Line, CHARINDEX(',',t.Line)+1 )))) c(header)
)
SELECT MIN(ID) AS ID,t.header,c.x,t.header+STUFF(c.x,1,1,'') AS value
FROM t
OUTER APPLY(SELECT ','+tb.data FROM t AS tb WHERE tb.header=t.header FOR XML PATH('') ) c(x)
GROUP BY t.header,c.x
+----+------------+------------------------------------------+-----------------------------------------------+
| ID | header | x | value |
+----+------------+------------------------------------------+-----------------------------------------------+
| 1 | 3,Formula, | ,1,2,3,4,5,6,7,8 | 3,Formula,1,2,3,4,5,6,7,8 |
| 3 | ,, | ,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v | ,,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v |
| 2 | *,record, | ,abc,efg,hij,lmn,opq,rst,lmn,opq,rst | *,record,abc,efg,hij,lmn,opq,rst,lmn,opq,rst |
| 9 | ,4, | ,Data,1,2,3,4 | ,4,Data,1,2,3,4 |
+----+------------+------------------------------------------+-----------------------------------------------+

How to search for a text in all rows, without specifying each column separately

E.g.
Given the following table and data, find the rows that contain the word 'on' (case insensitive)
create table t (i int,dt date,s1 string,s2 string,s3 string)
;
insert into t
select inline
(
array
(
struct(1,date '2017-03-15','Now we take our time','so nonchalant','And spend our nights so bon vivant')
,struct(2,date '2017-03-16','Quick as a wink','She changed her mind','She stood on the tracks')
,struct(3,date '2017-03-17','But I’m talking a Greyhound','On the Hudson River Line','I’m in a New York state of mind')
)
)
;
select * from t
;
+-----+------------+-----------------------------+--------------------------+------------------------------------+
| t.i | t.dt | t.s1 | t.s2 | t.s3 |
+-----+------------+-----------------------------+--------------------------+------------------------------------+
| 1 | 2017-03-15 | Now we take our time | so nonchalant | And spend our nights so bon vivant |
| 2 | 2017-03-16 | Quick as a wink | She changed her mind | She stood on the tracks |
| 3 | 2017-03-17 | But I’m talking a Greyhound | On the Hudson River Line | I’m in a New York state of mind |
+-----+------------+-----------------------------+--------------------------+------------------------------------+
The easy (but limited ) solution
This solution is relevant to tables that contain "primitive" types only
(no structs, arrays, maps etc.).
The problem with that solution is that all the columns are concatenated without separator (no, concat_ws(*) yields an exception) so words in the boundaries become a single word, e.g. -
Greyhound and On become GreyhoundOn
select i
,regexp_replace(concat(*),'(?i)on','==>$0<==') as rec
from t
where concat(*) rlike '(?i)on'
;
+---+-----------------------------------------------------------------------------------------------------------+
| | rec |
+---+-----------------------------------------------------------------------------------------------------------+
| 1 | 12017-03-15Now we take our timeso n==>on<==chalantAnd spend our nights so b==>on<== vivant |
| 2 | 22017-03-16Quick as a winkShe changed her mindShe stood ==>on<== the tracks |
| 3 | 32017-03-17But I’m talking a Greyhound==>On<== the Huds==>on<== River LineI’m in a New York state of mind |
+---+-----------------------------------------------------------------------------------------------------------+
The complex (but agile) solution
This solution is relevant to tables that contain "primitive" types only
(no structs, arrays, maps etc.).
I pushed the envelope here but succeeded to generate a delimited string with all columns.
Now it is possible to look for whole words.
(?ix) http://www.regular-expressions.info/modifiers.html
select i
,regexp_replace(concat(*),'(?ix)\\b on \\b','==>$0<==') as delim_rec
from (select i
,printf(concat('%s',repeat('|||%s',field(unhex(1),*,unhex(1))-2)),*) as delim_rec
from t
) t
where delim_rec rlike '(?ix)\\b on \\b'
;
+---+------------------------------------------------------------------------------------------------------------------+
| i | delim_rec |
+---+------------------------------------------------------------------------------------------------------------------+
| 2 | 22|||2017-03-16|||Quick as a wink|||She changed her mind|||She stood ==>on<== the tracks |
| 3 | 33|||2017-03-17|||But I’m talking a Greyhound|||==>On<== the Hudson River Line|||I’m in a New York state of mind |
+---+------------------------------------------------------------------------------------------------------------------+
Using additional external table
create external table t_ext (rec string)
row format delimited
fields terminated by '0'
location '/user/hive/warehouse/t'
;
select cast(split(rec,'\\x01')[0] as int) as i
,regexp_replace(regexp_replace(rec,'(?ix)\\b on \\b','==>$0<=='),'\\x01','|||') as rec
from t_ext
where rec rlike '(?ix)\\b on \\b'
;
+---+-----------------------------------------------------------------------------------------------------------------+
| i | rec |
+---+-----------------------------------------------------------------------------------------------------------------+
| 2 | 2|||2017-03-16|||Quick as a wink|||She changed her mind|||She stood ==>on<== the tracks |
| 3 | 3|||2017-03-17|||But I’m talking a Greyhound|||==>On<== the Hudson River Line|||I’m in a New York state of mind |
+---+-----------------------------------------------------------------------------------------------------------------+

Split row into multiple rows in SQL Server for insert

I am trying to create a SQL query that will insert a single row as 2 rows in another table.
My data looks like this:
size | indemnity_factor | monitoring_factor
--------------------------------------------
0 | 1.00 | 1.5
The end data looks like this:
id | claim_component_type_code | size | adjustment_factor | valid_from_date
------------------------------------------------------------------------------
1 | Indemnity | 0 | 2.5000000 | 2014-01-01
1 | Monitoring | 1 | 1.5000000 | 2014-01-01
I want to add an entry of Indemnity and Monitoring for every row in the first data source. I haven't really got an idea how to go about it, would be very appreciative if someone could help. Sorry for the rough data but I can't post images with my reputation apparently.
Thanks in advance.
Use unpivot
select * from
(select size, indemnity_factor as indemnity, monitoring_factor as monitoring
from yourtable) src
unpivot (adjustment_factor for claim_component_type_code in (indemnity, monitoring) ) u

How do I query messages stored in a table such that I get messages grouped by sender and the groups sorted by time?

Overall Scenario: I am storing conversations in a table, I need to retrieve the messages for a particular location, such that they're grouped into conversations, and the groups of conversations are sorted by the most recent message received in that group. This is analogous to how text messages are organized on a phone or facebook's newsfeed ordering. I'm storing the messages in the following schema:
Location_id | SentByUser | Customer | Messsage | Time
1 | Yes | 555-123-1234 | Hello world | 2013-12-01 10:00:00
1 | No | 555-123-1234 | Thank you | 2013-12-01 12:00:00
1 | Yes | 999-999-9999 | Winter is coming | 2013-12-03 11:00:20
1 | Yes | 555-123-1234 | Foo Bar | 2013-12-02 11:00:00
1 | No | 999-999-9999 | Thank you | 2013-12-04 13:00:00
1 | Yes | 111-111-1111 | Foo Foo Bar | 2013-12-05 01:00:00
In this case, if I was building the conversation tree for location id, I'd want the following output:
Location_id | SentByUser | Customer | Messsage | Time
1 | Yes | 111-111-1111 | Foo Foo Bar | 2013-12-05 01:00:00
1 | Yes | 999-999-9999 | Winter is coming | 2013-12-03 11:00:20
1 | No | 999-999-9999 | Thank you | 2013-12-04 13:00:00
1 | Yes | 555-123-1234 | Hello world | 2013-12-01 10:00:00
1 | No | 555-123-1234 | Thank you | 2013-12-01 12:00:00
1 | Yes | 555-123-1234 | Foo Bar | 2013-12-02 11:00:00
So what I'd like to do is group all the conversations by the Customer field, and then order the groups by Time, and lastly order the messages within each group also. This is because I'm building out an interface that's similar to text messages. For each location there may be hundreds of conversations, and I'm only going to show a handful at a time. If I ensure that my query output is ordered, I dont have to worry about server maintaining any state. The client can simply say give me the next 100 messages etc.
My question is two fold:
1. Is there a simple way to sub order results? Is there an easy way without doing a complex join back on the table itself or creating a new table to maintain some order.
2. Is the way I'm approaching this a good practice? As in, is there a better way to store and retrieve messages such that the server doesn't have to maintain state? As in, is there a better pattern that I should consider?
I looked at various questions and answers, and the best one I could find was What is the most efficient/elegant way to parse a flat table into a tree?, but it doesnt seem fully applicable to my case because the author is talking about multi branch trees.
It seems like you want two different queries. This is written in T-SQL for SQL Server, but could easily be adapted for SQLite or MySQL or whatever you're working with.
1) Show me the Customer groups ordered by most recent
select Location_id, Customer, Max(Time) as LatestMessageTime from #Table
group by Location_id, Customer order by LatestMessageTime desc
This would be similar to the first view of your text message application.
2) Show me the Messages in order given a Location_id and Customer
declare #Location int, #Customer varchar(900)
set #Location = 1
set #Customer = '999-999-9999'
select * from #Table where Location_id = #Location and Customer = #Customer
order by Time desc
If you just wanted the sample output, you don't need anything too complex:
select t.*, g.MostRecentTime from #Table t LEFT OUTER JOIN
(select Location_id, Customer, Max(Time) as MostRecentTime from #Table
group by Location_id, Customer) g on g.Location_id = t.Location_id and g.Customer = t.Customer
order by MostRecentTime desc, Location_id, Customer, Time
Here's a SQLFiddle of it: http://sqlfiddle.com/#!6/ae3f8/1/0
I think this is an acceptable way to store the information. As far as retrieving it I'd have two different stored procedures: Give me the 'summary' (1 above), and then give me the 'messages' given a certain location and customer (2 above). I'd also order by .... Customer, Time desc so that the most recent messages are the first returned, and then it goes 'back' into the past rather than loading the oldest first.