SQL Joining 2 Tables - sql

I would like to merge two tables into one and also add a counter next to that. What i have now is
SELECT [CUCY_DATA].*, [DIM].[Col1], [DIM].[Col2],
(SELECT COUNT([Cut Counter]) FROM [MSD]
WHERE [CUCY_DATA].[Cut Counter] = [MSD].[Cut Counter]
) AS [Nr Of Errors]
FROM [CUCY_DATA] FULL JOIN [DIM]
ON [CUCY_DATA].[Cut Counter] = [DIM].[Cut Counter]
This way the data is inserted but where the values don't match nulls are inserted. I want for instance this
Table CUCY_DATA
|_Cut Counter_|_Data1_|_Data2_|
| 1 | 12 | 24 |
| 2 | 13 | 26 |
| 3 | 10 | 20 |
| 4 | 11 | 22 |
Table DIM
|_Cut Counter_|_Col1_|_Col2_|
| 1 | 25 | 40 |
| 3 | 50 | 45 |
And they need to be merged into:
|_Cut Counter_|_Data1_|_Data2_|_Col1_|_Col2_|
| 1 | 12 | 24 | 25 | 40 |
| 2 | 13 | 26 | 25 | 40 |
| 3 | 10 | 20 | 50 | 45 |
| 4 | 11 | 22 | 50 | 45 |
SO THIS IS WRONG:
|_Cut Counter_|_Data1_|_Data2_|_Col1__|_Col2__|
| 1 | 12 | 24 | 25 | 40 |
| 2 | 13 | 26 | NULL | NULL |
| 3 | 10 | 20 | 50 | 45 |
| 4 | 11 | 22 | NULL | NULL |
Kind regards, Bob

How are you getting the col1 and col2 values where there is no corresponding row in your DIM table? (Rows 2 and 4). Your "wrong" result is exactly correct, that's what the outer join does.

Related

Incremental/Update in hive

I have a hive external table with data say, (version less than 0.14)
+--------+------+------+------+
| id | A | B | C |
+--------+------+------+------+
| 10011 | 10 | 3 | 0 |
| 10012 | 9 | 0 | 40 |
| 10015 | 10 | 3 | 0 |
| 10017 | 9 | 0 | 40 |
+--------+------+------+------+
And I have a delta file having data given below.
+--------+------+------+------+
| id | A | B | C |
+--------+------+------+------+
| 10012 | 50 | 3 | 10 | --> update
| 10013 | 29 | 0 | 40 | --> insert
| 10014 | 10 | 3 | 0 | --> update
| 10013 | 19 | 0 | 40 | --> update
| 10015 | 70 | 3 | 0 | --> update
| 10016 | 17 | 0 | 40 | --> insert
+--------+------+------+------+
How can I update my hive table with the delta file, without using sqoop. Any help on how to proceed will be great! Thanks.
This is because there is duplicates in the file. How do you know which you should keep? The last one?
In that case you can use, for example, the row_number and then get the maximum value. Something like that.
SELECT coalesce(tmp.id,initial.id) as id,
coalesce(tmp.A, initial.A) as A,
coalesce(tmp.B,initial.B) as B,
coalesce(tmp.C, initial.C) as C
FROM
table_a initial
FULL OUTER JOIN
( SELECT *, row_number() over( partition by id ) as row_num
,COUNT(*) OVER (PARTITION BY id) AS cnt
FROM temp_table
) tmp
ON initial.id=tmp.id
WHERE row_num=cnt
OR row_num IS NULL;
Output:
+--------+-----+----+-----+--+
| id | a | b | c |
+--------+-----+----+-----+--+
| 10011 | 10 | 3 | 0 |
| 10012 | 50 | 3 | 10 |
| 10013 | 19 | 0 | 40 |
| 10014 | 10 | 3 | 0 |
| 10015 | 70 | 3 | 0 |
| 10016 | 17 | 0 | 40 |
| 10017 | 9 | 0 | 40 |
+--------+-----+----+-----+--+
You can load the file to a temporary table in hive and then execute a FULL OUTER JOIN between the two tables.
Query Example:
SELECT coalesce(tmp.id,initial.id) as id,
coalesce(tmp.A, initial.A) as A,
coalesce(tmp.B,initial.B) as B,
coalesce(tmp.C, initial.C) as C
FROM
table_a initial
FULL OUTER JOIN
temp_table tmp on initial.id=tmp.id;
Output
+--------+-----+----+-----+--+
| id | a | b | c |
+--------+-----+----+-----+--+
| 10011 | 10 | 3 | 0 |
| 10012 | 50 | 3 | 10 |
| 10013 | 29 | 0 | 40 |
| 10013 | 19 | 0 | 40 |
| 10014 | 10 | 3 | 0 |
| 10015 | 70 | 3 | 0 |
| 10016 | 17 | 0 | 40 |
| 10017 | 9 | 0 | 40 |
+--------+-----+----+-----+--+

Pandas - Grouping Rows With Same Value in Dataframe

Here is the dataframe in question:
|City|District|Population| Code | ID |
| A | 4 | 2000 | 3 | 21 |
| A | 8 | 7000 | 3 | 21 |
| A | 38 | 3000 | 3 | 21 |
| A | 7 | 2000 | 3 | 21 |
| B | 34 | 3000 | 6 | 84 |
| B | 9 | 5000 | 6 | 84 |
| C | 4 | 9000 | 1 | 28 |
| C | 21 | 1000 | 1 | 28 |
| C | 32 | 5000 | 1 | 28 |
| C | 46 | 20 | 1 | 28 |
I want to regroup the population counts by city to have this kind of output:
|City|Population| Code | ID |
| A | 14000 | 3 | 21 |
| B | 8000 | 6 | 84 |
| C | 15020 | 1 | 28 |
df = df.groupby(['City', 'Code', 'ID'])['Population'].sum()
You can make a group by 'City', 'Code' and 'ID then make sum of 'population'.

How to make rows shuffle?

There is a table with over 10+ rows, and now needed to shuffle all rows randomly and create a new table on it. any ideas ?
Using select * from table order by random() seems slow.
raw table is like,and the target column is separated into two parts:
+--------+------+--------+------+-----+--------+
| cst_id | name | salary | fund | age | target |
+--------+------+--------+------+-----+--------+
| 1 | a | 100 | Y | 33 | 0 |
| 2 | b | 200 | Y | 21 | 0 |
| 3 | c | 300 | Y | 45 | 0 |
| 4 | d | 400 | N | 26 | 0 |
| 5 | e | 500 | N | 37 | 0 |
| 6 | f | 600 | Y | 56 | 0 |
| 7 | g | 700 | Y | 44 | 0 |
| 8 | h | 800 | N | 22 | 1 |
| 9 | i | 900 | N | 38 | 1 |
| 10 | j | 1000 | Y | 61 | 1 |
| 11 | k | 1100 | N | 51 | 1 |
| 12 | l | 1200 | N | 21 | 1 |
| 13 | m | 1300 | Y | 32 | 1 |
| 14 | n | 1400 | N | 17 | 1 |
+--------+------+--------+------+-----+--------+
after:
+--------+------+--------+------+-----+--------+
| cst_id | name | salary | fund | age | target |
+--------+------+--------+------+-----+--------+
| 1 | a | 100 | Y | 33 | 0 |
| 2 | b | 200 | Y | 21 | 0 |
| 8 | h | 800 | N | 22 | 1 |
| 9 | i | 900 | N | 38 | 1 |
| 3 | c | 300 | Y | 45 | 0 |
| 13 | m | 1300 | Y | 32 | 1 |
| 14 | n | 1400 | N | 17 | 1 |
| 5 | e | 500 | N | 37 | 0 |
| 6 | f | 600 | Y | 56 | 0 |
| 7 | g | 700 | Y | 44 | 0 |
| 10 | j | 1000 | Y | 61 | 1 |
| 11 | k | 1100 | N | 51 | 1 |
| 4 | d | 400 | N | 26 | 0 |
+--------+------+--------+------+-----+--------+
Following explanation is to create NEW table from existing one with same data as in old one(same schema) with shuffled rows.
Create a new table and import all those rows and records from first table, randomly selected and ordered by the RAND() SQL function:
CREATE TABLE new_table SELECT * FROM old_table ORDER BY RAND()
Or if you have created a table identical to the structure of the old one, use INSERT INTO instead:
INSERT INTO new_table SELECT * FROM old_table ORDER BY RAND()
That is of course if you want to preserve the primary key identification of each row, which is most likely what you want to do with old tables because of the legacy code and data entity relationships. However, if you want a grand new table with all the shuffled records completely rearranged in order as if it’s for a different application, you can ignore the primary key or ID by not importing the ID field of the old table.
For instance, you got ID, col1 and col2 in the old table as data fields. To create a grand new reordered or shuffled rows version of old table:
CREATE TABLE new_table SELECT col1, col2 FROM old_table ORDER BY RAND()
And a new primary key ID will be automatically assigned to each of the rows in the new table.
But in SQL, Relations have no order. Rows in a relational database are not sorted. You may get different order while retrieving.

Why is this WHERE NOT EXISTS() condition returning false pos/neg? (Itzik Ben-Gan identifying gaps)

I am learning Ben Itzik-Gans' 'solution 1' method of finding gaps to apply to a business problem.
Code here: http://rextester.com/CPXO58771
First, you start with a list of sequence values, some that break the sequence:
| A.seqval |
|----------|
| 2 |
| 3 |
| 11 |
| 12 |
| 13 |
| 31 |
| 33 |
| 34 |
| 35 |
| 42 |
And this query produces the results below:
SELECT
seqval + 1 AS start_range
,(SELECT MIN(B.seqval)
FROM dbo.NumSeq AS B
WHERE B.seqval > A.seqval) - 1 AS end_range
FROM
dbo.NumSeq AS A
WHERE 1=1
AND NOT EXISTS (SELECT *
FROM dbo.NumSeq AS B
WHERE B.seqval = A.seqval + 1)
AND seqval < (SELECT MAX(seqval) FROM dbo.NumSeq)
| start_range | end_range |
|-------------|-----------|
| 4 | 10 |
| 14 | 30 |
| 32 | 32 |
| 36 | 41 |
I am struggling to understand how the WHERE NOT EXISTS() filter is working. I went row-by-row to check the logic in the NOT EXISTS() subquery:
| A.seqval | A.seqval+1 | | B.seqval | | NOT EXISTS (SELECT * FROM NumSeq AS B WHERE B.seqval=A.seqval+1) |
|----------|------------|---|----------|---|-------------------------------------------------------------------------|
| 2 | 3 | | 2 | | TRUE- there is no A.seqval+1 = 2 |
| 3 | 4 | | 3 | | FALSE- there is a A.seqval+1 row = 3 |
| 11 | 12 | | 11 | | TRUE- there is no A.seqval+1 = 11 |
| 12 | 13 | | 12 | | FALSE- there is a A.seqval+1 row = 12 |
| 13 | 14 | | 13 | | FALSE- there is a A.seqval+1 row = 13 |
| 31 | 32 | | 31 | | ??? -there is no A.seqval+1 = 31, but this returns as TRUE in the query |
| 33 | 34 | | 33 | | TRUE- there is no A.seqval+1 = 33 |
| 34 | 35 | | 34 | | FALSE- there is a A.seqval+1 row = 34 |
| 35 | 36 | | 35 | | FALSE- there is a A.seqval+1 row = 35 |
| 42 | 43 | | 42 | | TRUE- there is no A.seqval+1 = 42 |
If NOT EXISTS() is returning TRUE for rows where a value doesn't exist, shouldn't they be included instead of excluded in the output?
How is seqval=31 identified as a true positive, when it is returned as FALSE?
I don't think you're handling the not in the not exists right in your stepthru.
Consider the first record 2+1 = 3. 3 exists in the data set. so NOT exists 3 = false not true. Set {A} and set {B} get compared on Set A.SeqNum+1 = B.SeqNum
A B Exists Not Exists
2 NEVER Needed as we always look A.SEQVAL+1
2+1 = 3 3 True False
3+1 = 4 NULL False True
31+1 = 32 NULL False True
42 NULL (never evaluated because of `where seqval < (SELECT MAX(seqval) FROM dbo.NumSeq)`
Carry this on to 31... 31+1 = 32.. 32 not exists in set .. TRUE. as it should be.
if we asked does 32 exist in the set it would be false. but since it's a not exists.... True.
+----------+------------+-------------------------------------+----------------------------------+--------+------------+-----------------------------------------+
| A.SeqVal | A.SeqVal+1 | Does A.SeqVal+1 exist in A.SEQ Val? | Since it exists do we return it? | SeqVal | Now Add 1 | Now subtract 1 from next value in A.Seq |
+----------+------------+-------------------------------------+----------------------------------+--------+------------+-----------------------------------------+
| 2 | 3 | Yes | No | | | |
| 3 | 4 | No | Yes | 3 | 4 | 10 |
| 11 | 12 | Yes | No | | | |
| 12 | 13 | Yes | No | | | |
| 13 | 14 | No | Yes | 12 | 13 | 30 |
| 31 | 32 | No | yes | 31 | 32 | 32 |
| 33 | 34 | Yes | No | | | |
| 34 | 35 | Yes | No | | | |
| 35 | 36 | No | Yes | 35 | 36 | 41 |
| 42 | 43 | No | Yes | 42 | Ignore max | |
+----------+------------+-------------------------------------+----------------------------------+--------+------------+-----------------------------------------+
Now Add one (why? because we know the next value in sequence is the first in a range missing)
Now subtract one from the next value in a.Seqval (why? because we know that value exists; but the one before it doesn't)

SQL Select and Group By clause

I have data as per the table below, I pass in a list of numbers and need the raceId where all the numbers appear in the the data column for that race.
+-----+--------+------+
| Id | raceId | data |
+-----+--------+------+
| 14 | 1 | 1 |
| 12 | 1 | 2 |
| 13 | 1 | 3 |
| 16 | 1 | 8 |
| 47 | 2 | 1 |
| 43 | 2 | 2 |
| 46 | 2 | 6 |
| 40 | 2 | 7 |
| 42 | 2 | 8 |
| 68 | 3 | 3 |
| 69 | 3 | 6 |
| 65 | 3 | 7 |
| 90 | 4 | 1 |
| 89 | 4 | 2 |
| 95 | 4 | 6 |
| 92 | 4 | 7 |
| 93 | 4 | 8 |
| 114 | 5 | 1 |
| 116 | 5 | 2 |
| 117 | 5 | 3 |
| 118 | 5 | 8 |
| 138 | 6 | 2 |
| 139 | 6 | 6 |
| 140 | 6 | 7 |
| 137 | 6 | 8 |
+-----+--------+------+
Example I pass in 1,2,7 I would get the following Id's:
2 and 4
I have tried the simple statement
SELECT * FROM table WHERE ((data = 1) or (data = 2) or (data = 7))
But I don't really understand the grouping by clause or indeed if it is the correct way of doing this.
select raceId
from yourtable
where data in (1,2,7)
group by raceId
having count(raceId) = 3 /* length(1,2,7) */
This is assuming raceId, data pair is unique. If it's not the you should use
select raceId
from (select distinct raceId, data
from yourtable
where data in(1,2,7))
group by raceId
having count(raceId) = 3
SELECT DISTINCT raceId WHERE data IN (1, 2, 7)
This is an example of a "set-within-sets" query. I like to solve these with group by and having.
select raceid
from races
where data in (1, 2, 7)
group by raceid
having count(*) = 3;