sequencing data using hive functions

sequencing data using hive functions - hive

I have a hive table
create table abc ( id int, channel string, time int );
insert into table abc values
(1,'a', 12),
(1,'c', 10),
(1,'b', 15),
(2,'a', 15),
(2,'c', 12),
(2,'c', 7);
I want resultant table to look something like this -
id , journey
1, c->a->b
2, c->c->a
journey column is arranged in ascending order by time per id
I have tried
select id , concat_ws(">", collect_list(channel)) as journey
from abc
group by id
but it does not preserve order.

Use subquery and order by time(to preserve order) then in the outer query use collect_list with group by clause.
hive> select id , concat_ws("->", collect_list(channel)) as journey from
(
select * from abc order by time
)t
group by id;
+-----+----------------+--+
| id | journey |
+-----+----------------+--+
| 1 | 'c'->'a'->'b' |
| 2 | 'c'->'c'->'a' |
+-----+----------------+--+

Related

speeding up recursive query/ looping through values

Suppose table with structure like this:
create table tab1
(
id int,
valid_from timestamp
)
I need to build query such that in case there is a duplicity over pair (id,valid_from), e.g.
id valid_from
1 2000-01-01 12:00:00
1 2000-01-01 12:00:00
then one second needs to be added to subsequent rows to valid_from column.
For example if there are three duplicate rows, the result should be as follows
id valid_from
1 2000-01-01 12:00:00
1 2000-01-01 12:00:01
1 2000-01-01 12:00:02
Tried running a recursive cte query but since for some cases there is a large number of duplicate values (for current data set about 160 for some cases of (id,valid_from)), it is really slow.
Thanks

If "next second" is not occupied, then:
WITH TAB (id, valid_from) AS
(
VALUES
(1, TIMESTAMP('2000-01-01 12:00:00'))
, (1, TIMESTAMP('2000-01-01 12:00:00'))
, (1, TIMESTAMP('2000-01-01 12:00:00'))
, (2, TIMESTAMP('2000-01-01 12:00:00'))
, (2, TIMESTAMP('2000-01-01 12:00:01'))
, (2, TIMESTAMP('2000-01-01 12:00:01'))
)
SELECT ID, VALID_FROM
, VALID_FROM + (ROWNUMBER() OVER (PARTITION BY ID, VALID_FROM) - 1) SECOND AS VALID_FROM2
FROM TAB
ORDER BY ID, VALID_FROM2;
The result is:
|ID |VALID_FROM |VALID_FROM2 |
|-----------|--------------------------|--------------------------|
|1 |2000-01-01-12.00.00.000000|2000-01-01-12.00.00.000000|
|1 |2000-01-01-12.00.00.000000|2000-01-01-12.00.01.000000|
|1 |2000-01-01-12.00.00.000000|2000-01-01-12.00.02.000000|
|2 |2000-01-01-12.00.00.000000|2000-01-01-12.00.00.000000|
|2 |2000-01-01-12.00.01.000000|2000-01-01-12.00.01.000000|
|2 |2000-01-01-12.00.01.000000|2000-01-01-12.00.02.000000|

You can use window functions:
select id,
valid_from + (row_number() over (partition by id order by valid_from) - 1) second
from t;

SQL - Create an array based on the values of two other columns

I have the following data:
-----------------------------------------
| client_id | link_hash_a | link_hash_b |
-----------------------------------------
| 1 | abc | xyz |
| 2 | def | xyz |
| 3 | def | uvw |
-----------------------------------------
I would like to create an array of client_id that are linked with the two hash values from the columns link_hash_a and link_hash_b using SQL.
In the current situation, the result would be a unique array with the value {1,2,3} because the clients 1 and 2 are linked with the value xyz of the link_hash_b column and the client 2 and 3 are linked with the value def of the link_hash_a column.
Is there a way to do that with an SQL query? Thank you really much for your input.

As alternative can be used this way:
SELECT groupUniqArrayArray(client_ids) client_ids
FROM (
SELECT link_hash, groupArray(client_id) client_ids
FROM (
SELECT DISTINCT client_id, arrayJoin([link_hash_a, link_hash_b]) as link_hash
FROM (
/* test data */
SELECT data.1 client_id, data.2 link_hash_a, data.3 link_hash_b
FROM (
SELECT arrayJoin([
(1, 'abc', 'xyz'),
(2, 'def', 'xyz'),
(3, 'def', 'uvw')]) data)))
GROUP BY link_hash
HAVING count() = 2)
/* result
┌─client_ids─┐
│ [2,1,3] │
└────────────┘
*/

I think I found a way through. I used another column which is the club_id of which the clients are part of. In this case, the clients 1, 2 and 3 are all part of the club_id 1 for example.
Here is my code using ClickHouse SQL and taken into account that input_table is the table of data as shown in the question:
SELECT club_id, arrayConcat( clt_a, clt_b ) as tot_clt_arr, arrayUniq( arrayConcat( clt_a, clt_b ) ) as tot_clt
FROM
(
SELECT club_id, clt_a
FROM
(
SELECT club_id, link_hash_a, groupUniqArray(client_id) as clt_a
FROM input_table
GROUP BY club_id, link_hash_a
)
WHERE length(clt_a) >= 2
) JOIN
(
SELECT club_id, clt_b
FROM
(
SELECT club_id, link_hash_b, groupUniqArray(client_id) as clt_b
FROM input_table
GROUP BY club_id, link_hash_b
)
WHERE length(clt_b) >= 2
)
USING club_id
GROUP BY club_id, tot_clt_arr;
It returns the array of client_id as well as the number of unique client_id in the tot_clt column.
Thank you #TomášZáluský for your help.

SQLite query - filter name where each associated id is contained within a set of ids

I'm trying to work out a query that will find me all of the distinct Names whose LocationIDs are in a given set of ids. The catch is if any of the LocationIDs associated with a distinct Name are not in the set, then the Name should not be in the results.
Say I have the following table:
ID | LocationID | ... | Name
-----------------------------
1 | 1 | ... | A
2 | 1 | ... | B
3 | 2 | ... | B
I'm needing a query similar to
SELECT DISTINCT Name FROM table WHERE LocationID IN (1, 2);
The problem with the above is it's just checking if the LocationID is 1 OR 2, this would return the following:
A
B
But what I need it to return is
B
Since B is the only Name where both of its LocationIDs are in the set (1, 2)

You can try to write two subquery.
get count by each Name
get count by your condition.
then join them by count amount, which means your need to all match your condition count number.
Schema (SQLite v3.17)
CREATE TABLE T(
ID int,
LocationID int,
Name varchar(5)
);
INSERT INTO T VALUES (1, 1,'A');
INSERT INTO T VALUES (2, 1,'B');
INSERT INTO T VALUES (3, 2,'B');
Query #1
SELECT t2.Name
FROM
(
SELECT COUNT(DISTINCT LocationID) cnt
FROM T
WHERE LocationID IN (1, 2)
) t1
JOIN
(
SELECT COUNT(DISTINCT LocationID) cnt,Name
FROM T
WHERE LocationID IN (1, 2)
GROUP BY Name
) t2 on t1.cnt = t2.cnt;
| Name |
| ---- |
| B |
View on DB Fiddle

You can just use aggregation. Assuming no duplicates in your table:
SELECT Name
FROM table
WHERE LocationID IN (1, 2)
GROUP BY Name
HAVING COUNT(*) = 2;
If Name/LocationID pairs can be duplicated, use HAVING COUNT(DISTINCT LocationID) = 2.

List the names of people that have never scored above 3

From a table like this:
Name | Score
------ | ------
Bill | 1
Bill | 2
Bill | 1
Steve | 1
Steve | 4
Steve | 1
Return the names of people that have never scored above 3
Answer would be:
Name |
------ |
Bill |

The key is to get the maximum score for each person, then filter to those whose maximum is less than 3. To get the maximum you need to do an aggregate (GROUP BY and MAX). Then to apply filters to aggregates you must use HAVING rather than WHERE. So you would end up with:
SELECT Name, MAX(Score) AS HighScore
FROM Table
GROUP BY Name
HAVING MAX(Score) <= 3;

one solution would be:
SELECT DISTINCT name
FROM mytable
WHERE Name NOT IN
( SELECT Name
FROM mytable
WHERE score > 3
)

sample table :
DECLARE #Table1 TABLE
(Name varchar(5), Score int)
;
INSERT INTO #Table1
(Name, Score)
VALUES
('Bill', 1),
('Bill', 2),
('Bill', 1),
('Steve', 1),
('Steve', 4),
('Steve', 1)
;
Script :
;with CTE AS (
select Name,Score from #Table1
GROUP BY Name,Score
HAVING (Score) > 3 )
Select
NAME,
Score
from #Table1 T
where not EXISTS
(select name from CTE
where name = T.Name )
Result :
NAME Score
Bill 1
Bill 2
Bill 1

SELECT name
FROM table_name
WHERE score < 3

SQL Ignore duplicate primary keys

Imagine you have a string of results from a SELECT statement:
ID (pk) Name Address
1 a b
1 c d
1 e f
2 a b
3 a d
2 a d
Is it possible to alter the SQL statement to get one record ONLY for the record with ID 1?
I have a SELECT statement that displays multiple values which can have the same primary key. I want to only take one of those records, if say, I have 5 records with the same primary key.
SQL: http://pastebin.com/cFCBA2Uy
Screenshot: http://i.imgur.com/UlMBZhC.png
What I want is to show only one file which is for e.g. File Number: 925, 890

You stated that no matter which row to choose when there are more than one row for the same Id, you just want one row for each id.
The following query does what you asked for:
DECLARE #T table
(
id int,
name varchar(50),
address varchar(50)
)
INSERT INTO #T VALUES
(1, 'a', 'b'),
(1, 'c', 'd'),
(1, 'e', 'f'),
(2, 'a', 'b'),
(3, 'a', 'd'),
(2, 'a', 'd');
WITH A AS
(
SELECT
t.id, t.name, t.address,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY (SELECT NULL)) AS RowNumber
FROM
#T t
)
SELECT
A.id, A.name, A.address
FROM
A
WHERE
A.RowNumber = 1
But I think there should be a criteria. If you find one, express your criteria as the ORDER BY inside the OVER clause.
EDIT:
Here you have the result:
+----+------+---------+
| id | name | address |
+----+------+---------+
| 1 | a | b |
| 2 | a | b |
| 3 | a | d |
+----+------+---------+
Disclaimer: the query I wrote is non-deterministic, different conditions (indexes, statistics, etc) might lead to different results.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

sequencing data using hive functions - hive

Related

speeding up recursive query/ looping through values

SQL - Create an array based on the values of two other columns

SQLite query - filter name where each associated id is contained within a set of ids

List the names of people that have never scored above 3

SQL Ignore duplicate primary keys

Categories

Resources