Hive - collect_list with multiple columns? - sql

Say my table looks like this:
Name,Subject,Score
Jon,English,80
Amy,Geography,70
Matt,English,90
Jon,Math,100
Jon,History,60
Amy,French,90
Is there a way to use collect_list so that I can get my query as such:
Jon: English:80; Math:100; History:60
Amy: Geography:70; French:90
Matt: English:90
EDIT:
The complication here is that the collect_list UDF permits only one argument, i.e. one column.
Something like
SELECT name, collect_list(subject), collect_list(score) from mytable group by name
results in
Jon | [English,Math,History] | [80,100,60]
Amy | [Geography,French] | [70,90]
Matt | [English] | [90]

Not sure if this is what you needed.
select * from t0;
+-------+------------+-------+--+
| t0.a | t0.b | t0.c |
+-------+------------+-------+--+
| Jon | English | 80 |
| Amy | Geography | 70 |
| Matt | English | 90 |
| Jon | Math | 100 |
| Jon | History | 60 |
| Amy | French | 90 |
+-------+------------+-------+--+
select a, collect_list(concat_ws(':',b,cast(c as string))) from t0 group by a;
+-------+-----------------------------------------+--+
| a | _c1 |
+-------+-----------------------------------------+--+
| Amy | ["Geography:70","French:90"] |
| Jon | ["English:80","Math:100","History:60"] |
| Matt | ["English:90"] |
+-------+-----------------------------------------+--+

Related

Teradata SQL code to join against a string

I have a table A that has the below values
+----+----------+-----------------------+
| ID | Date | Name |
+----+----------+-----------------------+
| 1 | 1/4/2019 | Kara,Sara,John |
| 2 | 3/2/2018 | Sara |
| 3 | 4/3/2019 | Lynn,John,Chris,Agnes |
| 4 | 2/1/2020 | Phillip, Anton |
| 5 | 5/1/2020 | Quinn |
| 6 | 7/6/2020 | Idie,John |
+----+----------+-----------------------+
And a table B that has the below values
+-------+
| Name |
+-------+
| John |
| Sara |
| Chris |
+-------+
I would like the output to be as below:
+----+----------+-----------------------+--------+-----------------+
| ID | Date | Name | B.Name | Exists in List? |
+----+----------+-----------------------+--------+-----------------+
| 1 | 1/4/2019 | Kara,Sara,John | Sara | Yes |
| 1 | 1/4/2019 | Kara, Sara, John | John | Yes |
| 2 | 3/2/2018 | Sara | Sara | Yes |
| 3 | 4/3/2019 | Lynn,John,Chris,Agnes | John | Yes |
| 3 | 4/3/2019 | Lynn,John,Chris,Agens | Chris | Yes |
| 4 | 2/1/2020 | Phillip, Anton | | No |
| 5 | 5/1/2020 | Quinn | | No |
| 6 | 7/6/2020 | Idie,John | John | Yes |
+----+----------+-----------------------+--------+-----------------+
I tried using CONTAINS but looks like teradata sql does not accept it. Tried CSVLD to convert text to column.However since there is no fixed number of commas that the string can accept, I cannot use CSVLD function if I do not know precisely how many columns I need to re-create from the text beforehand.
Wondering if there is any alternative to join a column against a string of values? Appreciate your kind input.
You should really fix your data model -- storing multiple values in a string is a bad, bad data design. SQL has a great way of storing lists -- it is called a table.
Assuming you are stuck with someone else's really, really bad data model, you can use a left join:
select a.*, b.name,
(case when b.name is not null then 'Yes' else 'No' end) as in_list
from a left join
b
on ',' || a.name || ',' like '%,' || b.name || ',%';

SQL query to find a partial string match that could include special characters

SQL query with special character ()
The original query (big thanks to GMB) can find any items in address (users table) that have a match in address (address_effect table).
The query works fine if address contains ',' but I can't seem to make it work if there is '()' in the address field.
Here is the sql query that's not working:
UPDATE users u
SET u.COUNT = (
SELECT COUNT(*) FROM address_effect a
WHERE FIND_IN_SET(a.address, REPLACE(u.address, ', ', ','')'))
)
Fyi, I'm testing this on my local system with XAMPP (using MariaDB).
I tried to identify '()' as an escape character by prepending it with backslash '' but it doesn't help.
user table
+--------+-------------+---------------+--------------------------+--------+
| ID | firstname | lastname | address | count |
| | | | | |
+--------------------------------------------------------------------------+
| 1 | john | doe |james street, idaho, usa | |
| | | | | |
+--------------------------------------------------------------------------+
| 2 | cindy | smith |rollingwood av,lyn, canada| |
| | | | | |
+--------------------------------------------------------------------------+
| 3 | rita | chatsworth |arajo ct, alameda, cali | |
| | | | | |
+--------------------------------------------------------------------------+
| 4 | randy | plies |smith spring, lima, (peru)| |
| | | | | |
+--------------------------------------------------------------------------+
| 5 | Matt | gwalio |park lane, (atlanta), usa | |
| | | | | |
+--------------------------------------------------------------------------+
address_effect table
+---------+----------------+
|address |effect |
+---------+----------------+
|idaho |potato, tater |
+--------------------------+
|canada |cold, tundra |
+--------------------------+
|fremont | crowded |
+--------------------------+
|peru |alpaca |
+--------------------------+
|atlanta |peach, cnn |
+--------------------------+
|usa |big, hard |
+--------+-----------------+
I would suggest using regular expressions for this. It seems more general than fiddling with the string:
update users u
set count = (select count(*)
from address_effect ae
where u.address regexp concat('[[:<:]]', ae.address, '[[:>:]]'))
);
The funky character class is MySQL's way of delineating a word boundary (I am more used to \W but MySQL doesn't support that).
Here is a db<>fiddle.
Just like you replace the space after each comma with just a comma, use REPLACE() to remove the chars '(' and ')':
FIND_IN_SET(a.address, REPLACE(REPLACE(REPLACE(u.address, ', ', ','), '(', ''), ')', ''))
See the demo.
Results:
| ID | firstname | lastname | address | count |
| --- | --------- | ---------- | -------------------------- | ----- |
| 1 | john | doe | james street, idaho, usa | 2 |
| 2 | cindy | smith | rollingwood av,lyn, canada | 1 |
| 3 | rita | chatsworth | arajo ct, alameda, cali | 0 |
| 4 | randy | plies | smith spring, lima, (peru) | 1 |
| 5 | Matt | gwalio | park lane, (atlanta), usa | 2 |

Updating table based on the results of previous query

How can I update the table based on the results of the previous query?
The original query (big thanks to GMB) can find any items in address (users table) that have a match in address (address_effect table).
From the result of this query, I want to find the count of address in the address_effect table and add it into a new column in the table “users”. For example, john doe has a match with idaho and usa in the address column so it’ll show a count of ‘2’ in the count column.
Fyi, I'm testing this on my local system with XAMPP (using MariaDB).
user table
+--------+-------------+---------------+--------------------------+--------+
| ID | firstname | lastname | address | count |
| | | | | |
+--------------------------------------------------------------------------+
| 1 | john | doe |james street, idaho, usa | |
| | | | | |
+--------------------------------------------------------------------------+
| 2 | cindy | smith |rollingwood av,lyn, canada| |
| | | | | |
+--------------------------------------------------------------------------+
| 3 | rita | chatsworth |arajo ct, alameda, cali | |
| | | | | |
+--------------------------------------------------------------------------+
| 4 | randy | plies |smith spring, lima, peru | |
| | | | | |
+--------------------------------------------------------------------------+
| 5 | Matt | gwalio |park lane, atlanta, usa | |
| | | | | |
+--------------------------------------------------------------------------+
address_effect table
+---------+----------------+
|address |effect |
+---------+----------------+
|idaho |potato, tater |
+--------------------------+
|canada |cold, tundra |
+--------------------------+
|fremont | crowded |
+--------------------------+
|peru |alpaca |
+--------------------------+
|atlanta |peach, cnn |
+--------------------------+
|usa |big, hard |
+--------+-----------------+
Use a correlated subquery which returns the number of matches:
UPDATE user u
SET u.count = (
SELECT COUNT(*)
FROM address_effect a
WHERE FIND_IN_SET(a.address, REPLACE(u.address, ', ', ','))
)
See the demo.
Results:
> ID | firstname | lastname | address | count
> -: | :-------- | :--------- | :------------------------- | ----:
> 1 | john | doe | james street, idaho, usa | 2
> 2 | cindy | smith | rollingwood av,lyn, canada | 1
> 3 | rita | chatsworth | arajo ct, alameda, cali | 0
> 4 | randy | plies | smith spring, lima, peru | 1
> 5 | Matt | gwalio | park lane, atlanta, usa | 2
Notice: I checked it in MySQL, but not in MariaDB.
The count column of users table may be able to be updated using UPDATE statement with INNER JOIN. Then you can use a query that modifies the original query to use "GROUP BY".
UPDATE users AS u
INNER JOIN
(
-- your original query modified
SELECT u.ID AS ID, count(u.ID) AS count
FROM users u
INNER JOIN address_effect a
ON FIND_IN_SET(a.address, REPLACE(u.address, ', ', ','))
GROUP BY u.ID
) AS c ON u.ID=c.ID
SET u.count=c.count;

SQL, query to check and list distinct entries that occur in another table within a specific time frame

I'm using Oracle.
I have two tables. One contains users and the other is an access log of sorts. I need to list all users whose latest log entry appears in the log within a specified time frame including the timestamp of the latest entry. A single user can have several entries in the log.
Here are simplified versions of the tables:
Users
|----------------------------------|
| userid| username | name |
|----------------------------------|
| 1 | josm | John Smith |
| 2 | lajo | Laura Jones |
| 3 | miwi | Mike Williams |
| 4 | subo | Susan Brown |
| 5 | peda | Peter Davis |
| 6 | jami | Jane Miller |
|----------------------------------|
Log
|----------------------------------|
| userid| action | timestamp |
|----------------------------------|
| 3 | a | 20-01-2020 |
| 2 | v | 19-11-2019 |
| 2 | y | 02-11-2019 |
| 4 | b | 15-09-2019 |
| 1 | a | 23-05-2019 |
| 6 | y | 22-05-2019 |
| 3 | b | 16-04-2019 |
| 2 | a | 07-01-2019 |
| 5 | v | 18-11-2018 |
| 6 | a | 12-09-2018 |
|----------------------------------|
Desired result if the time frame is set to last six months:
|---------------------------------------|
| username | name | timestamp |
|--------------------------|------------|
| miwi | Mike Williams | 20-01-2020 |
| lajo | Laura Jones | 19-11-2019 |
| subo | Susan Brown | 15-09-2019 |
|---------------------------------------|
Any help will be greatly appreciated.
You can use aggregation:
select u.username, u.userid, max(l.timestamp)
from logs l join
users u
on l.userid = u.userid
group by u.username, u.userid
having max(l.timestamp) >= add_months(sysdate, -6)

SQL only select rows with max date within each user

SQL beginner here. I've got a simple test that users take, and each row is the answer to one of their questions. They're allowed to take the exam once per day, so some people take it a second time on another day, and thus will have many rows with different test dates. What I'm basically trying to do is get each user's most recent score.
Here is what my data looks like (table name is dumdum):
+----------+----------------+----------+------------------+
| USERNAME | CORRECT_ANSWER | RESPONSE | DATE_TAKEN |
+----------+----------------+----------+------------------+
| matt | 1 | 1 | 3/23/15 1:04:26 |
| matt | 2 | 2 | 3/23/15 1:04:28 |
| matt | 3 | 3 | 3/23/15 1:04:23 |
| david | 1 | 3 | 3/20/15 1:04:25 |
| david | 2 | 2 | 3/20/15 1:04:28 |
| david | 3 | 1 | 3/20/15 1:04:30 |
| david | 1 | 1 | 3/21/15 11:03:14 |
| david | 2 | 3 | 3/21/15 11:03:17 |
| david | 3 | 2 | 3/21/15 11:03:19 |
| chris | 1 | 2 | 3/17/15 12:45:52 |
| chris | 2 | 2 | 3/17/15 12:45:56 |
| chris | 3 | 3 | 3/17/15 12:45:59 |
| peter | 1 | 1 | 3/19/15 2:45:33 |
| peter | 2 | 3 | 3/19/15 2:45:35 |
| peter | 3 | 2 | 3/19/15 2:45:38 |
| peter | 1 | 1 | 3/20/15 12:32:04 |
| peter | 2 | 2 | 3/20/15 12:32:05 |
| peter | 3 | 3 | 3/20/15 12:32:05 |
+----------+----------------+----------+------------------+
and what I'm trying to get in the end...
+----------+------------------+-------+
| USERNAME | MOST_RECENT_TEST | SCORE |
+----------+------------------+-------+
| matt | 3/23/2015 | 100 |
| david | 3/21/2015 | 33 |
| chris | 3/17/2015 | 67 |
| peter | 3/20/2015 | 100 |
+----------+------------------+-------+
I ran into some trouble because I need to go by day, and not by day/time, so I had to do a weird maneuver where I went to character and back to date... This is what I have so far, but I can't figure out how to use only the scores from the most recent test (right now it's factoring in all scores from every test ever taken)...
SELECT username, to_date(substr(max(test_date),1,9),'dd-MON-yy') as most_recent_test, round((sum(case when response=correct_answer then 1 end)/3)*100,0) as score
FROM dumdum group by username
Any help would be appreciated! Thanks!
There are several solutions to this problem this one uses the WITH clause and the RANK function.
It also uses the TRUNC function rather than to_date(substr(
with mxDate as
(SELECT USERNAME,
TRUNC(DATE_TAKEN) as MOST_RECENT_TEST,
CASE WHEN CORRECT_ANSWER = RESPONSE THEN 1 else 0 END as SCORE,
RANK () OVER (PARTITION BY USERNAME
ORDER BY TRUNC(DATE_TAKEN) DESC) Rk
FROM dumdum)
SELECT
USERNAME,
MOST_RECENT_TEST,
SUM(SCORE)/3 * 100
FROM
mxDate
WHERE
rk = 1
GROUP BY
USERNAME,
MOST_RECENT_TEST
Demo