How to make rows shuffle? - sql

There is a table with over 10+ rows, and now needed to shuffle all rows randomly and create a new table on it. any ideas ?
Using select * from table order by random() seems slow.
raw table is like,and the target column is separated into two parts:
+--------+------+--------+------+-----+--------+
| cst_id | name | salary | fund | age | target |
+--------+------+--------+------+-----+--------+
| 1 | a | 100 | Y | 33 | 0 |
| 2 | b | 200 | Y | 21 | 0 |
| 3 | c | 300 | Y | 45 | 0 |
| 4 | d | 400 | N | 26 | 0 |
| 5 | e | 500 | N | 37 | 0 |
| 6 | f | 600 | Y | 56 | 0 |
| 7 | g | 700 | Y | 44 | 0 |
| 8 | h | 800 | N | 22 | 1 |
| 9 | i | 900 | N | 38 | 1 |
| 10 | j | 1000 | Y | 61 | 1 |
| 11 | k | 1100 | N | 51 | 1 |
| 12 | l | 1200 | N | 21 | 1 |
| 13 | m | 1300 | Y | 32 | 1 |
| 14 | n | 1400 | N | 17 | 1 |
+--------+------+--------+------+-----+--------+
after:
+--------+------+--------+------+-----+--------+
| cst_id | name | salary | fund | age | target |
+--------+------+--------+------+-----+--------+
| 1 | a | 100 | Y | 33 | 0 |
| 2 | b | 200 | Y | 21 | 0 |
| 8 | h | 800 | N | 22 | 1 |
| 9 | i | 900 | N | 38 | 1 |
| 3 | c | 300 | Y | 45 | 0 |
| 13 | m | 1300 | Y | 32 | 1 |
| 14 | n | 1400 | N | 17 | 1 |
| 5 | e | 500 | N | 37 | 0 |
| 6 | f | 600 | Y | 56 | 0 |
| 7 | g | 700 | Y | 44 | 0 |
| 10 | j | 1000 | Y | 61 | 1 |
| 11 | k | 1100 | N | 51 | 1 |
| 4 | d | 400 | N | 26 | 0 |
+--------+------+--------+------+-----+--------+

Following explanation is to create NEW table from existing one with same data as in old one(same schema) with shuffled rows.
Create a new table and import all those rows and records from first table, randomly selected and ordered by the RAND() SQL function:
CREATE TABLE new_table SELECT * FROM old_table ORDER BY RAND()
Or if you have created a table identical to the structure of the old one, use INSERT INTO instead:
INSERT INTO new_table SELECT * FROM old_table ORDER BY RAND()
That is of course if you want to preserve the primary key identification of each row, which is most likely what you want to do with old tables because of the legacy code and data entity relationships. However, if you want a grand new table with all the shuffled records completely rearranged in order as if it’s for a different application, you can ignore the primary key or ID by not importing the ID field of the old table.
For instance, you got ID, col1 and col2 in the old table as data fields. To create a grand new reordered or shuffled rows version of old table:
CREATE TABLE new_table SELECT col1, col2 FROM old_table ORDER BY RAND()
And a new primary key ID will be automatically assigned to each of the rows in the new table.
But in SQL, Relations have no order. Rows in a relational database are not sorted. You may get different order while retrieving.

Related

Using a Group BY or PARTITION on multiple columns where records are grouped together if ANY of the columns return a match

Let's say I have the following table:
Record_ID | Match_criteria_1 | Match_criteria_2 | Match_criteria_3 | Dollars
1 | A | V | F | 10
2 | A | W | G | 20
3 | B | W | H | 30
4 | B | X | I | 40
5 | C | Y | F | 50
6 | C | Z | J | 60
If I try to use a 'GROUP BY' or 'Over (PARTITION BY)' on Match_criteria_1, Match_criteria_2, and Match_criteria_3, I would end up with separate 6 groups/partitions.
SELECT *, sum(Dollars) OVER (PARTITION BY Match_criteria_1, Match_criteria_2, Match_Criteria_3) AS Total_Dollars
FROM My_table
Record_ID | Match_criteria_1 | Match_criteria_2 | Match_criteria_3 | Dollars | Total_Dollars
1 | A | V | F | 10 | 10
2 | A | W | G | 20 | 20
3 | B | W | H | 30 | 30
4 | B | X | I | 40 | 40
5 | C | Y | F | 50 | 50
6 | C | Z | J | 60 | 60
As you can see, none of the records have the same Match_criteria_1, Match_criteria_2, and Match_criteria_3.
But what if I wanted to group records that have the same Match_criteria_1, Match_criteria_2 OR Match_criteria_3?
So using my example, Record 1 matches with Record 2 due to Match_criteria_1, Record 2 matches with Record 3 due to Match_criteria_2, Record 3 matches with Record 4 due to Match_criteria_1, Record 5 matches with Record 1 due to Match_Criteria_3, and Record 6 matches with Record 5 due to Match_criteria 1 (so sort of a transitive property thing going on). The desired result then is :
Record_ID | Match_criteria_1 | Match_criteria_2 | Match_criteria_3 | Dollars | Total_Dollars
1 | A | V | F | 10 | 210
2 | A | W | G | 20 | 210
3 | B | W | H | 30 | 210
4 | B | X | I | 40 | 210
5 | C | Y | F | 50 | 210
6 | C | Z | J | 60 | 210
where Total_dollars is the sum of every record due to the fact that all six policies match with each other due to transitivity. So Records 1 and 6 may have no match criteria in common but they are still grouped together because they both match with Record 5.
Perhaps I am understanding this incorrectly, but I think you could just get the total dollars for all rows that match other rows and join to that? I'm not sure exactly what your expected output should be if a row doesn't match. This answer will still include that row, but the total dollars would not include the amount in that row.
SELECT test.*, total_dollars
FROM test,
(SELECT sum(dollars) as total_dollars
FROM (select distinct test.* FROM test
JOIN test test2 ON (ARRAY[test.match_criteria_1, test.match_criteria_2, test.match_criteria_3] && ARRAY[test2.match_criteria_1, test2.match_criteria_2, test2.match_criteria_3])
AND test.record_id != test2.record_id order by 1
) sub)
sub2;
I added another row that shouldn't match any of the others:
insert into test VALUES (7, 'M', 'N', 'O', 100);
and here are the results:
record_id | match_criteria_1 | match_criteria_2 | match_criteria_3 | dollars | total_dollars
-----------+------------------+------------------+------------------+---------+---------------
1 | A | V | F | 10 | 210
2 | A | W | G | 20 | 210
3 | B | W | H | 30 | 210
4 | B | X | I | 40 | 210
5 | C | Y | F | 50 | 210
6 | C | Z | J | 60 | 210
7 | M | N | O | 100 | 210
(7 rows)

Incremental/Update in hive

I have a hive external table with data say, (version less than 0.14)
+--------+------+------+------+
| id | A | B | C |
+--------+------+------+------+
| 10011 | 10 | 3 | 0 |
| 10012 | 9 | 0 | 40 |
| 10015 | 10 | 3 | 0 |
| 10017 | 9 | 0 | 40 |
+--------+------+------+------+
And I have a delta file having data given below.
+--------+------+------+------+
| id | A | B | C |
+--------+------+------+------+
| 10012 | 50 | 3 | 10 | --> update
| 10013 | 29 | 0 | 40 | --> insert
| 10014 | 10 | 3 | 0 | --> update
| 10013 | 19 | 0 | 40 | --> update
| 10015 | 70 | 3 | 0 | --> update
| 10016 | 17 | 0 | 40 | --> insert
+--------+------+------+------+
How can I update my hive table with the delta file, without using sqoop. Any help on how to proceed will be great! Thanks.
This is because there is duplicates in the file. How do you know which you should keep? The last one?
In that case you can use, for example, the row_number and then get the maximum value. Something like that.
SELECT coalesce(tmp.id,initial.id) as id,
coalesce(tmp.A, initial.A) as A,
coalesce(tmp.B,initial.B) as B,
coalesce(tmp.C, initial.C) as C
FROM
table_a initial
FULL OUTER JOIN
( SELECT *, row_number() over( partition by id ) as row_num
,COUNT(*) OVER (PARTITION BY id) AS cnt
FROM temp_table
) tmp
ON initial.id=tmp.id
WHERE row_num=cnt
OR row_num IS NULL;
Output:
+--------+-----+----+-----+--+
| id | a | b | c |
+--------+-----+----+-----+--+
| 10011 | 10 | 3 | 0 |
| 10012 | 50 | 3 | 10 |
| 10013 | 19 | 0 | 40 |
| 10014 | 10 | 3 | 0 |
| 10015 | 70 | 3 | 0 |
| 10016 | 17 | 0 | 40 |
| 10017 | 9 | 0 | 40 |
+--------+-----+----+-----+--+
You can load the file to a temporary table in hive and then execute a FULL OUTER JOIN between the two tables.
Query Example:
SELECT coalesce(tmp.id,initial.id) as id,
coalesce(tmp.A, initial.A) as A,
coalesce(tmp.B,initial.B) as B,
coalesce(tmp.C, initial.C) as C
FROM
table_a initial
FULL OUTER JOIN
temp_table tmp on initial.id=tmp.id;
Output
+--------+-----+----+-----+--+
| id | a | b | c |
+--------+-----+----+-----+--+
| 10011 | 10 | 3 | 0 |
| 10012 | 50 | 3 | 10 |
| 10013 | 29 | 0 | 40 |
| 10013 | 19 | 0 | 40 |
| 10014 | 10 | 3 | 0 |
| 10015 | 70 | 3 | 0 |
| 10016 | 17 | 0 | 40 |
| 10017 | 9 | 0 | 40 |
+--------+-----+----+-----+--+

Avoid repeated values in a join

I have two tables - a header and a matrix/details.
*Header Table* *Matrix / Details Table*
+----+--------+-----+ +----+--------+------+
| ID | Parent | Qty | | ID | Child | Qty |
+----+--------+-----+ +----+--------+------+
| 1 | A | 10 | | 1 | X | 100 |
| 2 | B | 20 | | 1 | Y | 1000 |
| 3 | C | 30 | | 2 | X | 200 |
+----+--------+-----+ | 2 | Y | 2000 |
| 3 | X | 30 |
| 3 | Y | 300 |
| 3 | Z | 3000 |
+----+--------+------+
I'm Joining these two tables based on ID.
I don't want the result to have duplicated values from header table.
I expect a result like following:
*Current Result* *Expected Result*
+----+--------+-----+-------+------+ +----+--------+-----+-------+------+
| ID | Parent | Qty | Child | Qty | | ID | Parent | Qty | Child | Qty |
+----+--------+-----+-------+------+ +----+--------+-----+-------+------+
| 1 | A | 10 | X | 100 | | 1 | A | 10 | X | 100 |
| 1 | A | 10 | Y | 1000 | | | | | Y | 1000 |
| 2 | B | 20 | X | 200 | | 2 | B | 20 | X | 200 |
| 2 | B | 20 | Y | 2000 | | | | | Y | 2000 |
| 3 | C | 30 | X | 30 | | 3 | C | 30 | X | 30 |
| 3 | C | 30 | Y | 300 | | | | | Y | 300 |
| 3 | C | 30 | Z | 3000 | | | | | Z | 3000 |
+----+--------+-----+-------+------+ +----+--------+-----+-------+------+
Is this possible? If not any, alternate solution available?
Thanks in advance...
If you are using SQL Server,Try with the below query.
;WITH CTE_1
AS
(SELECT *,ROW_NUMBER()OVER(PARTITION BY ID,Parent,Quantity ORDER BY ID ) RNO
FROM Header H
JOIN [Matrix / Details] M
ON H.ID=M.ID)
SELECT CASE WHEN RNO=1 THEN CAST(ID as VARCHAR(50)) ELSE '' END ID,
CASE WHEN RNO=1 THEN Parent ELSE '' END Parent,
CASE WHEN RNO=1 THEN cast(Quantity as VARCHAR(50)) ELSE '' END Quantity,
Child,Qty
FROM CTE_1
ORDER BY ID,Parent,Quantity

How do I select columns whenever they change?

I'm trying to create a slowly changing dimension (type 2 dimension) and am a bit lost on how to logically write it out. Say that we have a source table with a grain of Person | Country | Department | Login Time. I want to create this dimension table with Person | Country | Department | Eff Start time | Eff End Time.
Data could look like this:
Person | Country | Department | Login Time
------------------------------------------
Bob | CANADA | Marketing | 2009-01-01
Bob | CANADA | Marketing | 2009-02-01
Bob | USA | Marketing | 2009-03-01
Bob | USA | Sales | 2009-04-01
Bob | MEX | Product | 2009-05-01
Bob | MEX | Product | 2009-06-01
Bob | MEX | Product | 2009-07-01
Bob | CANADA | Marketing | 2009-08-01
What I want in the Type 2 dimension would look like this:
Person | Country | Department | Eff Start time | Eff End Time
------------------------------------------------------------------
Bob | CANADA | Marketing | 2009-01-01 | 2009-03-01
Bob | USA | Marketing | 2009-03-01 | 2009-04-01
Bob | USA | Sales | 2009-04-01 | 2009-05-01
Bob | MEX | Product | 2009-05-01 | 2009-08-01
Bob | CANADA | Marketing | 2009-08-01 | NULL
Assume that Bob's name, Country and Department hasn't been updated since 2009-08-01 so it's left as NULL
What function would work best here? This is on Netezza, which uses a flavor of Postgres.
Obviously GROUP BY would not work here because of same groupings later on (I added in Bob | CANADA | Marketing at the last row to show this.
EDIT
Including a hash column on Person, Country, and Department, would make sense, correct? Thinking of using logic of
SELECT PERSON, COUNTRY, DEPARTMENT
FROM table t1
where
person = person
AND t1.hash <> hash_function(person, country, department)
Answer
create table so (
person varchar(32)
,country varchar(32)
,department varchar(32)
,login_time date
) distribute on random;
insert into so values ('Bob','CANADA','Marketing','2009-01-01');
insert into so values ('Bob','CANADA','Marketing','2009-02-01');
insert into so values ('Bob','USA','Marketing','2009-03-01');
insert into so values ('Bob','USA','Sales','2009-04-01');
insert into so values ('Bob','MEX','Product','2009-05-01');
insert into so values ('Bob','MEX','Product','2009-06-01');
insert into so values ('Bob','MEX','Product','2009-07-01');
insert into so values ('Bob','CANADA','Marketing','2009-08-01');
/* ************************************************************************** */
with prm as ( --Create an ordinal primary key.
select
*
,row_number() over (
partition by person
order by login_time
) rwn
from
so
), chn as ( --Chain events to their previous and next event.
select
cur.rwn
,cur.person
,cur.country
,cur.department
,cur.login_time cur_login
,case
when
cur.country = prv.country
and cur.department = prv.department
then 1
else 0
end prv_equal
,case
when
(
cur.country = nxt.country
and cur.department = nxt.department
) or nxt.rwn is null --No next record should be equivalent to matching.
then 1
else 0
end nxt_equal
,case prv_equal
when 0 then cur_login
else null
end eff_login_start_sparse
,case
when eff_login_start_sparse is null
then max(eff_login_start_sparse) over (
partition by cur.person
order by rwn
rows unbounded preceding --The secret sauce.
)
else eff_login_start_sparse
end eff_login_start
,case nxt_equal
when 0 then cur_login
else null
end eff_login_end
from
prm cur
left outer join prm nxt on
cur.person = nxt.person
and cur.rwn + 1 = nxt.rwn
left outer join prm prv on
cur.person = prv.person
and cur.rwn - 1 = prv.rwn
), grp as ( --Group by login starts.
select
person
,country
,department
,eff_login_start
,max(eff_login_end) eff_login_end
from
chn
group by
person
,country
,department
,eff_login_start
), led as ( --Change the effective end to be the next start, if desired.
select
person
,country
,department
,eff_login_start
,case
when eff_login_end is null
then null
else
lead(eff_login_start) over (
partition by person
order by eff_login_start
)
end eff_login_end
from
grp
)
select * from led order by eff_login_start;
This code returns the following table.
PERSON | COUNTRY | DEPARTMENT | EFF_LOGIN_START | EFF_LOGIN_END
--------+---------+------------+-----------------+---------------
Bob | CANADA | Marketing | 2009-01-01 | 2009-03-01
Bob | USA | Marketing | 2009-03-01 | 2009-04-01
Bob | USA | Sales | 2009-04-01 | 2009-05-01
Bob | MEX | Product | 2009-05-01 | 2009-08-01
Bob | CANADA | Marketing | 2009-08-01 |
Explanation
I must have solved this four or five times in the past few years and keep neglecting to write it down formally. I'm glad to have the chance to do it, so this is a great question.
When attempting this, I like writing down the problem in matrix form. Here's the input, presuming that all values have the same key in the SCD.
Cv | Ce
----|----
A | 10
A | 11
B | 14
C | 16
D | 18
D | 25
D | 34
A | 40
Where Cv is the value that we'll need to compare against (again, presuming that the key value for the SCD is equal in this data; we'll be partitioning over the key value the entire time so it's irrelevant to the solution) and Ce is the event time.
First, we need an ordinal primary key. I've designated this Ck in the table. This will allow us to join the table to itself to get the previous and next events. I've called these columns Pk (previous key), Nk (next key), Pv, and Nv.
Cv | Ce | Ck | Pk | Pv | Nk | Nv |
----|----|----|----|----|----|----|
A | 10 | 1 | | | 2 | A |
A | 11 | 2 | 1 | A | 3 | B |
B | 14 | 3 | 2 | A | 4 | C |
C | 16 | 4 | 3 | B | 5 | D |
D | 18 | 5 | 4 | C | 6 | D |
D | 25 | 6 | 5 | D | 7 | D |
D | 34 | 7 | 6 | D | 8 | A |
A | 40 | 8 | 7 | D | | |
Now we need some columns to see if we're at the beginning or end of a contiguous event block. I'll call these Pc and Nc, for contiguous. Pc is defined as Pv = Cv => true. 1 represents true and 0 represents false. Nc is defined similarly, except that the null case defaults to true (we'll see why in a minute)
Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc |
----|----|----|----|----|----|----|----|----|
A | 10 | 1 | | | 2 | A | 0 | 1 |
A | 11 | 2 | 1 | A | 3 | B | 1 | 0 |
B | 14 | 3 | 2 | A | 4 | C | 0 | 0 |
C | 16 | 4 | 3 | B | 5 | D | 0 | 0 |
D | 18 | 5 | 4 | C | 6 | D | 0 | 1 |
D | 25 | 6 | 5 | D | 7 | D | 1 | 1 |
D | 34 | 7 | 6 | D | 8 | A | 1 | 0 |
A | 40 | 8 | 7 | D | | | 0 | 1 |
Now you can start to see how the 1,1 combination of Pc,Nc is a completely useless record. We know this intuitively, since Bob's Mex/Product combination on the 6th row is pretty much useless information when building an SCD.
So let's get rid of the useless information. I'll add two new columns here: an almost-complete effective start time called Sn and an actually-complete effective end time called Ee. Sn is is populated with Ce when Pc is 0 and Ee is populated with Ce when Nc is 0.
Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc | Sn | Ee |
----|----|----|----|----|----|----|----|----|----|----|
A | 10 | 1 | | | 2 | A | 0 | 1 | 10 | |
A | 11 | 2 | 1 | A | 3 | B | 1 | 0 | | 11 |
B | 14 | 3 | 2 | A | 4 | C | 0 | 0 | 14 | 14 |
C | 16 | 4 | 3 | B | 5 | D | 0 | 0 | 16 | 16 |
D | 18 | 5 | 4 | C | 6 | D | 0 | 1 | 18 | |
D | 25 | 6 | 5 | D | 7 | D | 1 | 1 | | |
D | 34 | 7 | 6 | D | 8 | A | 1 | 0 | | 34 |
A | 40 | 8 | 7 | D | | | 0 | 1 | 40 | |
This looks really close, but we still have the problem that we can't group by Cv (person/country/department). What we need is for Sn to populate all those nulls with the previous value of Sn. You could join this table to itself on rwn < rwn and get the maximum, but I'm going to be lazy and use Netezza's analytic functions and the rows unbounded preceding clause. It's a shortcut to the method I just described. So we're going to create another column called Es, efffective start, defined as follows.
case
when Sn is null
then max(Sn) over (
partition by k --key value of the SCD
order by Ck
rows unbounded preceding
)
else Sn
end Es
With that definition, we get this.
Cv | Ce | Ck | Pk | Pv | Nk | Nv | Pc | Nc | Sn | Ee | Es |
----|----|----|----|----|----|----|----|----|----|----|----|
A | 10 | 1 | | | 2 | A | 0 | 1 | 10 | | 10 |
A | 11 | 2 | 1 | A | 3 | B | 1 | 0 | | 11 | 10 |
B | 14 | 3 | 2 | A | 4 | C | 0 | 0 | 14 | 14 | 14 |
C | 16 | 4 | 3 | B | 5 | D | 0 | 0 | 16 | 16 | 16 |
D | 18 | 5 | 4 | C | 6 | D | 0 | 1 | 18 | | 18 |
D | 25 | 6 | 5 | D | 7 | D | 1 | 1 | | | 18 |
D | 34 | 7 | 6 | D | 8 | A | 1 | 0 | | 34 | 18 |
A | 40 | 8 | 7 | D | | | 0 | 1 | 40 | | 40 |
The rest is trivial. Group by Es and grab the max of Ee to obtain this table.
Cv | Es | Ee |
----|----|----|
A | 10 | 11 |
B | 14 | 14 |
C | 16 | 16 |
D | 18 | 34 |
A | 40 | |
If you want to populate the effective end time with the next start, join the table again to itself or use the lead() window function to grab it.

SQL Joining 2 Tables

I would like to merge two tables into one and also add a counter next to that. What i have now is
SELECT [CUCY_DATA].*, [DIM].[Col1], [DIM].[Col2],
(SELECT COUNT([Cut Counter]) FROM [MSD]
WHERE [CUCY_DATA].[Cut Counter] = [MSD].[Cut Counter]
) AS [Nr Of Errors]
FROM [CUCY_DATA] FULL JOIN [DIM]
ON [CUCY_DATA].[Cut Counter] = [DIM].[Cut Counter]
This way the data is inserted but where the values don't match nulls are inserted. I want for instance this
Table CUCY_DATA
|_Cut Counter_|_Data1_|_Data2_|
| 1 | 12 | 24 |
| 2 | 13 | 26 |
| 3 | 10 | 20 |
| 4 | 11 | 22 |
Table DIM
|_Cut Counter_|_Col1_|_Col2_|
| 1 | 25 | 40 |
| 3 | 50 | 45 |
And they need to be merged into:
|_Cut Counter_|_Data1_|_Data2_|_Col1_|_Col2_|
| 1 | 12 | 24 | 25 | 40 |
| 2 | 13 | 26 | 25 | 40 |
| 3 | 10 | 20 | 50 | 45 |
| 4 | 11 | 22 | 50 | 45 |
SO THIS IS WRONG:
|_Cut Counter_|_Data1_|_Data2_|_Col1__|_Col2__|
| 1 | 12 | 24 | 25 | 40 |
| 2 | 13 | 26 | NULL | NULL |
| 3 | 10 | 20 | 50 | 45 |
| 4 | 11 | 22 | NULL | NULL |
Kind regards, Bob
How are you getting the col1 and col2 values where there is no corresponding row in your DIM table? (Rows 2 and 4). Your "wrong" result is exactly correct, that's what the outer join does.