Impala pivot from column to row, column names disappear - sql

I am kind of new to impala, and to sql in general. I am trying to do some pivot operations in order to start with this table.
Input:
Name table: MyName
+-----------+---------------------+-----------+
| Column A  | Column B | Column C |
+-----------+---------------------+-----------+
| a1 | b1 | c1 |
| a2 | b2 | c2 |
| a3 | b3 | c3 |
+-----------+---------------------+-----------+
And to obtain this other table trasposed, where b1, b2, b3 goes from column to row.
output:
+-----------+---------------------+-----------+
| b1   | b2 | b3 |
+-----------+---------------------+-----------+
| a1 | a2 | a3 |
| c1 | c2 | c3 |
+-----------+---------------------+-----------+
This is the code I came up so far:
select b_column,
max(case where b_column='b%' then column_a, column_c end) column_a, column_c
from MyName
group by b_column;
But it's not working and I am feeling pretty stuck.
Can anyone give me a hint/suggestion on how to solve the issue?
Thanks so much in advance!

If you are trying to do a pivot in imapla in general, you can't per the 6.1 documentation, PIVOT is not a current functionality.
https://www.cloudera.com/documentation/enterprise/6/6.1/topics/impala_reserved_words.html

select b_column,
max(case when b_column like 'b%' then column_a end) column_a,
max(case when b_column like 'c%' then column_c end) column_c
from MyName
group by b_column;

Related

SQL query for many to many exclusive IN query

I have a table Table1 with columns A and B (many to many table).
|---------------------|------------------|
| ColumnA | ColumnB |
|---------------------|------------------|
| a1 | b1 |
|---------------------|------------------|
| a1 | b2 |
|---------------------|------------------|
| a2 | b1 |
|---------------------|------------------|
| a2 | b3 |
|---------------------|------------------|
| a3 | b2 |
|---------------------|------------------|
I want a list of As whose Bs are ONLY in list of Bs.
So, from above table, if list is [b1, b2]
Expected [a1, a3]
Not including a2as it is associated with b3 also.
You can use aggregation and having:
select a
from ab
group by a
having sum(case when b not in ('b1', 'b2') then 1 else 0 end) = 0;
The having clause is checking the number of rows that are not in the list. The = 0 says there are none.
Assuming there are not any nulls in ColumnB you can use NOT EXISTS:
select t.*
from tablename t
where not exists (select 1 from tablename where ColumnA = t.ColumnA and ColumnB not in ('b1', 'b2'))
If you want only the distinct values of ColumnA:
select distinct t.ColumnA
from tablename t
where not exists (select 1 from tablename where ColumnA = t.ColumnA and ColumnB not in ('b1', 'b2'))
See the demo.

Conditional Joins in SQL Server - How do I make SQL first to see if the conditions are met , then do the JOIN after

I am trying to do fuzzy string matching to get as more matches as I can. First I execute the Levenshtein Distance Algorithm (http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx) and store it as "distance" in my dbo.
My first table (t1) looks like this:
Name | Synonym
A | A1
A | A2
A | A3
B | B1
B | B2
My second table (t2) looks like this:
The ID field may look like Name and Synonym very much
ID | Description
A | XXX
B | YYY
My goal is to make left joins either on the Name or its Synonyms when the distance between 2 strings from each table (t1 and t2) are smaller than 2.
Here is my current work:
SELECT *
FROM (
SELECT t2.ID, ROW_NUMBER() over (partition by ID order by ID) as rn
FROM table1 as t1
LEFT JOIN table2 as t2
ON (upper(trim(t1.Name)) = upper(trim(t2.ID)) OR upper(trim(t1.Synonym)) = upper(trim(t2.ID)))
WHERE (dbo.distance(t1.Name,t2.ID)<=2 OR dbo.distance(t1.Synonym,t2.ID)<=2)
) temp
WHERE rn=1
Ideally, as long as their distance is smaller than 2, we will still doing the join.
It should get more matches by adding that condition, however it doesn't.
Am I missing anything here?
I was wondering if my problem is coming from this:
My intention is to see if the conditions meet, if so then just do the join.
But my code here probably tells SQL to "join first", and the filter it afterwards.
Is there a way to let it see if the condition qualifies and then do the join "after"?
I have tried DIFFERENCE function just for demo purpose. It finds if two strings are similar and then returns 4 (lowest possible difference) and goes down to 0(lowest possible difference). You can try similar logic using your distance function.
DECLARE #table1 TABLE(Name varchar(10), synon varchar(10))
DECLARE #table2 TABLE(Name varchar(10), synon varchar(10))
INSERT INTO #table1
VALUES ('A','A1'),('A','A2'),('A','A3'),('B','B1'),('B','B2'),('B','B3')
INSERT INTO #table2
VALUES ('A','A1'),('A','A2'),('C','C1'),('B','B2'),('B','B3')
SELECT t1.name, t1.synon, t2.Name,t2.synon
FROM #table1 as t1
CROSS APPLY (SELECT T2.Name, t2.synon FROM #table2 as t2 WHERE DIFFERENCE(t2.Name,t1.Name) = 4 OR DIFFERENCE(t2.synon,t1.synon) = 4) as t2
+------+-------+------+-------+
| name | synon | Name | synon |
+------+-------+------+-------+
| A | A1 | A | A1 |
| A | A2 | A | A1 |
| A | A3 | A | A1 |
| A | A1 | A | A2 |
| A | A2 | A | A2 |
| A | A3 | A | A2 |
| B | B1 | B | B2 |
| B | B2 | B | B2 |
| B | B3 | B | B2 |
| B | B1 | B | B3 |
| B | B2 | B | B3 |
| B | B3 | B | B3 |
+------+-------+------+-------+

How can I transform a column containing some strings into multi columns using postgresql?

I have the following table (the data type of the column value is varchar, some values such as c2 and c4 are missing) :
__________________________
id | value
__________________________
1 | {{a1,b1,c1},{a2,b2,}}
__________________________
2 | {{a3,b3,c3},{a4,b4}}
__________________________
and I would like to obtain something like:
id | A | B | C
__________________
1 | a1 | b1 | c1
__________________
1 | a2 | b2 |
__________________
2 | a3 | b3 | c3
__________________
2 | a4 | b4 |
I am trying to use regexp_split_to_array, without any success so far.
How can this be achieved?
Thank you!
This assumes you know what the possible values are (e.g. a*, b*) because otherwise generating the appropriate columns for the value types will require dynamic sql.
Setup:
CREATE TABLE t (id INTEGER, value VARCHAR);
INSERT INTO t
VALUES
(1, '{{a1,b1,c1},{a2,b2,}}'),
(2, '{{a3,b3,c3},{a4,b4}}')
;
Query:
SELECT
id,
NULLIF(r[1], '') AS a,
NULLIF(r[2], '') AS b,
NULLIF(r[3], '') AS c
FROM (
SELECT id, regexp_split_to_array(r[1], ',') AS r
FROM (
SELECT id, regexp_matches(value, '{([^{][^}]+)}', 'g') AS r
FROM t
) x
) x;
Result:
| id | a | b | c |
| --- | --- | --- | --- |
| 1 | a1 | b1 | c1 |
| 1 | a2 | b2 | |
| 2 | a3 | b3 | c3 |
| 2 | a4 | b4 | |
Note that if it's possible for earlier values to be missing, e.g. {b1,c1} where a1 is missing, then the query would have to be different.
You can use string_to_array to convert string to array and then explode it in multiple rows with unnest:
EXAMPLE
SELECT unnest(string_to_array('{1 2 3},{4 5 6},{7 8 9}', ','));
{1 2 3}
{4 5 6}
{7 8 9}

T-SQL - Remove All Duplicates Except Most Recent (SQL Server 2005)

I have a T-SQL function that will pull all records inserted into a main table within the last 60 minutes and insert them into a table variable. I've then got some more code that will filter that set into another table variable to be returned.
In this set I'm expecting some records to have multiple occurrences but they will have a unique date time. I would like to delete every record that has greater than or equal to 3 occurrences, but keep the one with the most recent datetime value.
EDIT: Sorry, I thought I was more clear than it appears I actually was.
This data is error log data from a legacy system, so duplicates can be expected. The idea is that if they cross a certain threshold they need to be reported.
For example, the below is what should end up in #table_variable_2:
| ColA | ColB | DateTimeColumn | ColC |
---------------------------------------------------
1 | A | B | 2015-08-24 11:06:14.000 | C |
2 | A | B | 2015-08-24 11:18:58.000 | C |
3 | A | B | 2015-08-24 12:07:45.000 | C |
4 | A2 | B2 | 2015-08-24 12:17:24.000 | C2 |
5 | A2 | B2 | 2015-08-24 13:25:32.000 | C2 |
6 | A3 | B3 | 2015-08-24 14:52:10.000 | C3 |
7 | A3 | B3 | 2015-08-24 14:52:34.000 | C3 |
8 | A3 | B3 | 2015-08-24 14:52:45.000 | C3 |
9 | A3 | B3 | 2015-08-24 14:53:15.000 | C3 |
10 | A3 | B3 | 2015-08-24 14:53:32.000 | C3 |
This is what I expect to be returned:
| ColA | ColB | DateTimeColumn | ColC |
---------------------------------------------------
1 | A | B | 2015-08-24 12:07:45.000 | C |
2 | A2 | B2 | 2015-08-24 12:09:35.000 | C2 |
3 | A2 | B2 | 2015-08-24 13:25:32.000 | C2 |
4 | A3 | A3 | 2015-08-24 14:53:32.000 | C3 |
It's okay to have some duplicates, there's just the chance of having a lot of them.
EDIT 2: Solved without the CTE function
DELETE #rtrn_tbl FROM #rtrn_tbl
AS a
INNER JOIN
(
SELECT ColA, ColB, MAX(DateTimeColumn) AS MaxDate, ColC FROM #rtrn_tbl
GROUP BY ColA, ColB, ColC
HAVING COUNT(*) > 2
) AS b
ON a.ColA = b.ColA AND a.ColB=a.ColB and a.ColC = b.ColC
WHERE a.DateTimeColumn <> b.MaxDate;
I think you have to use PARTITION BY ColA, ColB, ColC ORDER BY DateTimeColumn DESC instead, then you can delete all but one (the most recent):
WITH cte AS
(
SELECT ColA, ColB, DateTimeColumn, ColC,
ROW_NUMBER() OVER (PARTITION BY ColA, ColB, ColC ORDER BY DateTimeColumn DESC) AS r_count
FROM #table_variable_2
)
DELETE
FROM cte
WHERE r_count > 1
WITH cte AS (SELECT ColA, ColB, DateTimeColumn, ColC,
ROW_NUMBER() OVER (PARTITION BY ColA, ColB, DateTimeColumn,ColC
ORDER BY ColA, DateTimeColumn desc) AS r_count
FROM #table_variable_2)
, cte1 as (select * from cte where r_count >= 3)
DELETE FROM cte1
WHERE r_count <> 1
You can do one more cte to select all records with r_count>=3.And then delete to preserve the latest record.

SQL query transposing columns

I have a table in the following structure:
id | att1 | att2 | att3
-----------------------
1 | a1 | b1 | c1
2 | a2 | b2 | c2
3 | a3 | b3 | c3
And I want to transpose the columns to become rows for each id. Like this:
id | attname | value
------------------------
1 | att1 | a1
1 | att2 | b1
1 | att3 | c1
2 | att1 | a2
2 | att2 | b2
2 | att3 | c2
3 | att1 | a3
3 | att2 | b3
3 | att3 | c3
I was reading up on the PIVOT function and wasn't sure if it would do the job or how to use it. Any help would be appreciated.
You can use unpivot for this task.
SELECT *
FROM table
UNPIVOT (
value FOR att_name
IN (att1 as 'att1', att2 as 'att2', att3 as 'att3')
);
Here is another method :
SELECT id,attname, value
FROM Yourtable
CROSS APPLY ( VALUES ('att1',att1), ('att2',att2), ('att3',att3)) t(attname, value)
Do UNION ALL:
select id, 'att1' as attname, att1 as value from tablename
union all
select id, 'att2' as attname, att2 as value from tablename
union all
select id, 'att3' as attname, att3 as value from tablename
Note that VALUE is a reserved word in SQL, so if that's your real column name you need to double quote it, "value".