SQL query showing just few DISTINCT records - sql

I have the following records:
Col_1 | Col_2 | Col_3
------+-------+------
A | A | XYZ01
A | A | XYZ02
A | A | XYZ03
A | B | XYZ04
B | B | XYZ05
B | B | XYZ06
B | B | XYZ07
B | B | XYZ08
I need a query which will return maximum of 2 records where Col_1 and Col_2 are distinct (regardless of Col_3) (that should be like 2 records sample of each distinct col_1,col_2 combination).
So this query should return:
Col_1 | Col_2 | Col_3
------+-------+------
A | A | XYZ01
A | A | XYZ02
A | B | XYZ04
B | B | XYZ05
B | B | XYZ06

SELECT *
FROM (
SELECT col_1
,col_2
,col_3
,row_number() OVER (
PARTITION BY col_1
,col_2 ORDER BY col_1
) AS foo
FROM TABLENAME
) bar
WHERE foo < 3
Top command will not work because you want to 'group by' multiple columns. What will help is partitioning the data and assigning a row number to the partition data. By partitioning on col_1 and col_2 we can create 3 different groupings.
1.All rows with 'a' in col_1
2.All rows with 'b' in col_2
3 All rows with 'a' and 'b' in col_1,col_2
we will order by col_1 (I picked this because your result set was ordered by a). Then for each row in that grouping we will count the rows and display the row number.
We will use this information as a derived table, and select * from this derived table where the rownumber is less than 3. This will get us the first two elements in each grouping

as far as Oracle The use of Rank would work.
Select * From (SELECT Col_1 ,
Col_2 ,
Col_3 ,
RANK() OVER (PARTITION BY Col_1, Col_2 ORDER BY Col_1 ) part
FROM someTable ) st where st.part < 2;
Since I was reminded that you can't use the alias in the original where clause, I made a change that will still work, though may not be the most elegant.

Related

How do I find the closest number across columns?

I have this table:
col_1 | col_2 | col_3 | compare
------+-------+-------+--------
1.1 | 2.1 | 3.1 | 2
------+-------+-------+--------
10 | 9 | 1 | 15
I want to derive a new column choice indicating the column closest to the compare value:
col_1 | col_2 | col_3 | compare | choice
------+-------+-------+---------+-------
1.1 | 2.1 | 3.1 | 2 | col_2
------+-------+-------+---------+-------
10 | 9 | 1 | 15 | col_1
Choice refers to the column where cell value is closest to the compare value.
I think the simplest method is apply:
select t.*, v.which as choice
from t cross apply
(select top (1) v.*
from (values ('col_1', col_1), ('col_2', col_2), ('col_3', col_3)
) v(which, val)
order by abs(v.val - t.compare)
) v;
In the event of ties, this returns an arbitrary closest column.
You can also use case expressions, but that gets complicated. With no NULL values:
select t.*,
(case when abs(compare - col_1) <= abs(compare - col_3) and
abs(compare - col_1) <= abs(compare - col_3)
then 'col_1'
when abs(compare - col_2) <= abs(compare - col_3)
then 'col_2'
else 'col_3'
end) as choice
from t;
In the event of ties, this returns the first column.

How to combine multiple maps in Hive?

Is there a Hive UDF that creates a map with unique values?
For ex:
col_1 | col_2
-------------
a | x
a | y
b | y
b | y
c | z
c | NULL
d | NULL
This should return a map as follows
{ a : [x,y], b : [y], c:[z] }
I'm looking for something similar to presto's multimap_aggfunction
Use collect_set to remove duplicate col_2 per col_1 and then use map on this output.
select map(col_1,uniq_col_2)
from (select col_1,collect_set(col_2) as uniq_col2
from tbl
where col_2 is not null
group by col_1
) t

how to keep a column from a specific data set using data step in sas

I've 2 tables who has a same column of same attribute. I want to select that column and other columns from both tables. example
table_1
ID | column_1 | column_2
1 | col_1 | col_2
2 | col_1 | col_2
3 | col_1 | col_2
table_2
ID | column_3 | column_4
4 | col_3 | col_4
5 | col_3 | col_4
6 | col_3 | col_4
I want to create a table as
Required
ID | column_1 | column_4
1 | col_1 | col_4
2 | col_1 | col_4
3 | col_1 | col_4
I want to do it using data step
data required;
set table_1 table_2;
keep ID column_1 column_4;
run;
but it's giving me 6 rows.
I can get my table using proc sql
proc sql noprint;
create table required as
select t1.Id, t1.column_1, t2.column_4
from table_1 as t1, table_2 as t2;
quit;
I'm looking to do same with data step
If you use a SET then the datasets are read sequentially. If you want to combine the datasets record by record then use a MERGE instead. Normally you would use a BY statement to combine the records based on some key variables. But if you leave off the BY statement then SAS will combine the records in order.
Also watch out for variable name conflicts. Both of your inputs have an ID variable. It looks like you only want to keep the one from the first dataset. You can use the KEEP= dataset option to tell SAS which variables to include from a dataset.
data required;
merge table_1 (keep=id column_1) table_2 (keep=column_4);
run;

Find and update non duplicated record based on one of the column

I want to find all the non duplicated records and update one of the column.
Ex.
Col_1 | Col_2 | Col_3 | Col_4 | Col_5
A | AA | BB | 1 |
A | AB | BC | 2 |
A | AC | BD | 3 |
B | BB | CC | 1 |
B | BB | CC | 2 |
C | CC | DD | 1 |
My query has to group by Col_1, and I want to find out not unique record based on Col_2 and Col3 and then update the Col_5.
Basically output should be as below,
Col_1 | Col_2 | Col_3 | Col_4 | Col_5
A | AA | BB | 1 | 1
A | AB | BC | 2 | 1
A | AC | BD | 3 | 1
B | BB | CC | 1 | 0
B | BB | CC | 2 | 0
C | CC | DD | 1 | 0
Does anyone have an idea how can I achieve this? This is a large database, so performance is also a key factor.
Thanks heaps,
There are plenty ways to do it. This solution comes from postgres for which I have access to, but I bet it will be working also on tsql as should have common syntax.
;WITH
cte_1 AS (
SELECT col_1 FROM some_table GROUP BY col_1 HAVING count(*) > 1
),
cte_2 AS (
SELECT col_1 FROM some_table GROUP BY col_1, col_2, col_3 HAVING count(*) > 1
),
cte_3 AS (
SELECT cte_1.col_1 FROM cte_1
LEFT JOIN cte_2 ON cte_1.col_1 = cte_2.col_1
WHERE cte_2.col_1 IS NULL
)
UPDATE some_table SET col_5 = 1
FROM cte_3 WHERE cte_3.col_1 = some_table.col_1;
So, what happens above?
First we build three CTE semi-tables which allow us to split logic into smaller parts:
cte_1 which extracts rows which can have multiple col2 and col_3 rows
cte_2 which selects those which have non-unique col_2 and col_3
cte_3 which returns those col_1 which have unique col_2 and col_3 just by LEFT JOIN
Using the last cte_3 structure we are able to update some_table correctly
I assume that your table is called some_table here. If you're worring about a performance you should provide some primary key here and also it would be good to have indexes on col_2 and col_3 (standalone but it may help if those would be composite on (col_1, col_2) and so on).
Also you may want to move it from CTE to use temporary tables (which could be also indexed to gain efficiency.
Please also note that this query works fine with your example but without real data it may be just guessing. I mean that what will happen when you would have for col_1=A some unique and non-uniqe col_2 in the same time?
But I believe it's good point to start.
;WITH
cte_1 AS (
SELECT col_1, count(*) as items FROM some_table GROUP BY col_1 HAVING count(*) > 1
),
cte_2 AS (
SELECT col_1, count(*) as items FROM some_table GROUP BY col_1, col_2, col_3 HAVING count(*) > 1
),
cte_3 AS (
SELECT cte_1.col_1 FROM cte_1
LEFT JOIN cte_2 ON cte_1.col_1 = cte_2.col_1
WHERE cte_2.col_1 IS NULL OR cte_1.items > cte_2.items
GROUP BY cte_1.col_1
)
UPDATE some_table SET col_5 = 1
FROM cte_3 WHERE cte_3.col_1 = some_table.col_1;

Cannot use group by and over(partition by) in the same query?

I have a table myTable with 3 columns. col_1 is an INTEGER and the other 2 columns are DOUBLE. For example, col_1={1, 2}, col_2={0.1, 0.2, 0.3}. Each element in col_1 is composed of all the values of col_2 and col_2 has repeated values for each element in col_1. The 3rd column can have any value as shown below:
col_1 | col_2 | Value
----------------------
1 | 0.1 | 1.0
1 | 0.2 | 2.0
1 | 0.2 | 3.0
1 | 0.3 | 4.0
1 | 0.3 | 5.0
2 | 0.1 | 6.0
2 | 0.1 | 7.0
2 | 0.1 | 8.0
2 | 0.2 | 9.0
2 | 0.3 | 10.0
What I want is to use an aggregate-function SUM() on the Value column partition by col_1 and grouped by col_2. The Above table should look like this:
col_1 | col_2 | sum_value
----------------------
1 | 0.1 | 1.0
1 | 0.2 | 5.0
1 | 0.3 | 9.0
2 | 0.1 | 21.0
2 | 0.2 | 9.0
2 | 0.3 | 10.0
I tried the following SQL query:
SELECT col_1, col_2, sum(Value) over(partition by col_1) as sum_value
from myTable
GROUP BY col_1, col_2
But on DB2 v10.5 it gave the following error:
SQL0119N An expression starting with "Value" specified in a SELECT
clause, HAVING clause, or ORDER BY clause is not specified in the
GROUP BY clause or it is in a SELECT clause, HAVING clause, or ORDER
BY clause with a column function and no GROUP BY clause is specified.
Can you kindly point out what is wrong. I do not have much experience with SQL.
Thank you.
Yes, you can, but you should be consistent regarding the grouping levels.
That is, if your query is a GROUP BY query, then in an analytic function
you can only use "detail" columns from the "non-analytic" part of your selected
columns.
Thus, you can use either the GROUP BY columns or the non-analytic aggregates,
like this example:
select product_id, company,
sum(members) as No_of_Members,
sum(sum(members)) over(partition by company) as TotalMembership
From Product_Membership
Group by Product_ID, Company
Hope that helps
SELECT col_1, col_2, sum(Value) over(partition by col_1) as sum_value
-- also try changing "col_1" to "col_2" in OVER
from myTable
GROUP BY col_2,col_1
I found the solution.
I do not need to use OVER(PARTITION BY col_1) because it is already in the GROUP BY clause. Thus, the following query gives me the right answer:
SELECT col_1, col_2, sum(Value) as sum_value
from myTable GROUP BY col_1, col_2
since I am already grouping w.r.t col_1 and col_2.
Dave, thanks, I got the idea from your post.