What is the best option to split table with 6 billion records? - sql

We have tables on a HANA database that contains 6 billion records. The PK in this table is 5 columns of type varchar(30). We would like to divide this set of 6 billion records into up to 50 million batches, which we could replicate with an external tool. So the task is to divide the set into 50 million batches, so that we refer to each of the batches in the Where clause. Example:
select * from bigTable where partition = 1; -- about 50 million records
select * from bigTable where partition = 2; -- about 50 million records
Is there any function, way in HANA that we can use ?

Thank you for answer. To be honest I wrote query in T-SQL which doing what I want finally but do not know If it is possible to rewrite it in HANA:
SELECT
CONVERT(INT, SUBSTRING(HASHBYTES('MD5', PKcol1 + PKcol2 + PKcol3 + PKcol4 + PKcol5), 1, 1)) % 120
, COUNT(*)
FROM bigTable
GROUP BY CONVERT(INT, SUBSTRING(HASHBYTES('MD5', PKcol1 + PKcol2 + PKcol3 + PKcol4 + PKcol5), 1, 1)) % 120;

If you would ORDER BY all primary key fields, you can navigate through all entries with OFFSET and LIMIT as long as reading happens in one session so that the cursor stays stable.

Related

How to update 4 000 000 random numbers of records in SQL Server

I have in total 10 million rows and I want to update random records with some values.
I want to update 4 million random rows with the Datakey = 1, the next 4 million random rows with Datakey = 2 and the last 2 million with Datakey = 3.
I tried to learn some functions like Rand() and I want to use it in my request but I couldn't. I could only use the top 4 million to change them but I want random records not the first 4 million.
Here is my SQL statement:
update top (4 000 000) [FACT_INTERNATIONAL]
set [DateKey] = 2
But I want the random 4 000 000 not the top. I am using SQL Server 2017.
You can use row_number():
with toupdate as (
select t.*, row_number() over (order by newid()) as seqnum
from t
)
update toupdate
set datekey = (case when seqnum <= 4000000 then 1
when seqnum <= 8000000 then 2
else 3
end);
Note: updating 10,000,000 records is an expensive operation so this will probably take a lot of time. It is often more efficient to create a new table copying the existing data over. However, your question is not about performance.
You can use NEWID() in ORDER BY like this:
UPDATE X
SET [DateKey] = 2
FROM (SELECT TOP (4000000) *
FROM [FACT_INTERNATIONAL] AS X
ORDER BY NEWID()) AS X
4 Million records have too much cost in update, and sql may give an error after a while. Try with fewer records in my opinion.

How to do a SQL count only up to x number?

I want to know whether a given query has more than x elements in a performant way. Let's say I have a query that outputs 2 billion rows but I only want to know if the result set is bigger than 10k, how would I do this without the SQL engine counting up to the 2 billion?
I tried this
SELECT 1
WHERE EXISTS (SELECT COUNT(1) FROM mytable
WHERE somefilter = 58
HAVING COUNT(1) < 10000)
but it seems just as slow (or more) than a simple count
SELECT COUNT(1)
WHERE somefilter = 58
This is for SQL Server 2016.
Any ideas?
You can do:
select 1
where (select count(*)
from (select top (10000) 1
from mytable
where somefilter = 58
) x
) < 10000
The innermost subquery returns at most 10,000 rows. If there are more than 10,000 then the query stops at the first 10,000 and returns 10,000.
Note: If there are fewer than 10,000 rows, then this will not affect performance, because all rows will need to be generated. You may need to add indexes or partitions to really improve performance.

SQL Server 2012 : update a row with unique number

I have a table with 50k records. Now I want to update one column of the table with a random number. The number should be 7 digits.
I don't want to do that with procedure or loop.
PinDetailId PinNo
--------------------
783 2722692
784 9888648
785 6215578
786 7917727
I have tried this code but not able to succeed. I need 7 digit number.
SELECT
FLOOR(ABS(CHECKSUM(NEWID())) / 2147483647.0 * 3 + 1) rn,
(FLOOR(2000 + RAND() * (3000 - 2000) )) AS rn2
FROM
[GeneratePinDetail]
Random
For a random number, you can use ABS(CHECKSUM(NewId())) % range + lowerbound:
(source: How do I generate random number for each row in a TSQL Select?)
INSERT INTO ResultsTable (PinDetailId, PinNo)
SELECT PinDetailId,
(ABS(CHECKSUM(NewId())) % 1000000 + 1000000) AS `PinNo`
FROM GeneratePinDetail
ORDER BY PinDetailId ASC;
Likely Not Unique
I cannot guarantee these will be unique; but it should be evenly distributed (equal chance of any 7 digit number). If you want to check for duplicates you can run this:
SELECT PinDetailId, PinNo
FROM ResultsTable result
INNER JOIN (
SELECT PinNo
FROM ResultsTable
GROUP BY PinNo
HAVING Count(1) > 1
) test
ON result.PinNo = test.PinNo;
You can create a sequence object and update your fields - it should automatically increment per update.
https://learn.microsoft.com/en-us/sql/t-sql/functions/next-value-for-transact-sql
Updated based on comment:
After retrieving the 'next value for' in the sequence, you can do operations on it to randomize. The sequence can basically be used then to create a unique seed for your randomization function.
If you don't want to create a function yourself, SQL Server has the RAND function build in already.
https://learn.microsoft.com/en-us/sql/t-sql/functions/rand-transact-sql

Giving Range to the SQL Column

I have SQL table in which I have column and Probability . I want to select one row from it with randomly but I want to give more chances to the more waighted probability. I can do this by
Order By abs(checksum(newid()))
But the difference between Probabilities are too much so it gives more chance to highest probability.Like After picking 74 times that value it pick up another value for once than again around 74 times.I want to reduce this .Like I want 3-4 times to it and than others and all. I am thinking to give Range to the Probabilies.Its Like
Row[i] = Row[i-1]+Row[i]
How can I do this .Do I need to create function?Is there any there any other way to achieve this.I am neewby.Any help will be appriciated.Thank You
EDIT:
I have solution of my problem . I have one question .
if I have table as follows.
Column1 Column2
1 50
2 30
3 20
can i get?
Column1 Column2 Column3
1 50 50
2 30 80
3 20 100
Each time I want to add value with existing one.Is there any Way?
UPDATE:
Finally get the solution after 3 hours,I just take square root of my probailities that way I can narrow the difference bw them .It is like I add column with
sqrt(sqrt(sqrt(Probability)))....:-)
I'd handle it by something like
ORDER BY rand()*pow(<probability-field-name>,<n>)
for different values of n you will distort the linear probabilities into a simple polynomial. Small values of n (e.g. 0.5) will compress the probabilities to 1 and thus make less probable choices more probable, big values of n (e.g. 2) will do the opposite and further reduce probability of already inprobable values.
Since the difference in probabilities is too great, you need to add a computed field with a revised weighting that has a more even probability distribution. How you do that depends on your data and preferred distribution. One way to do it is to "normalize" the weighting to an integer between 1 and 10 so that the lowest probability is never more than ten times smaller than the highest.
Answer to your recent question:
SELECT t.Column1,
t.Column2,
(SELECT SUM(Column2)
FROM table t2
WHERE t2.Column1 <= t.Column1) Column3
FROM table t
Here is a basic example how to select one row from the table with taking into account the assigned row weights.
Suppose we have table:
CREATE TABLE TableWithWeights(
Id int NOT NULL PRIMARY KEY,
DataColumn nvarchar(50) NOT NULL,
Weight decimal(18, 6) NOT NULL -- Weight column
)
Let's fill table with sample data.
INSERT INTO TableWithWeights VALUES(1, 'Frequent', 50)
INSERT INTO TableWithWeights VALUES(2, 'Common', 30)
INSERT INTO TableWithWeights VALUES(3, 'Rare', 20)
This is the query that returns one random row with taking into account given row weights.
SELECT * FROM
(SELECT tww1.*, -- Select original table data
-- Add column with the sum of all weights of previous rows
(SELECT SUM(tww2.Weight)- tww1.Weight
FROM TableWithWeights tww2
WHERE tww2.id <= tww1.id) as SumOfWeightsOfPreviousRows
FROM TableWithWeights tww1) as tww,
-- Add column with random number within the range [0, SumOfWeights)
(SELECT RAND()* sum(weight) as rnd
FROM TableWithWeights) r
WHERE
(tww.SumOfWeightsOfPreviousRows <= r.rnd)
and ( r.rnd < tww.SumOfWeightsOfPreviousRows + tww.Weight)
To check query results we can run it for 100 times.
DECLARE #count as int;
SET #count = 0;
WHILE ( #count < 100)
BEGIN
-- This is the query that returns one random row with
-- taking into account given row weights
SELECT * FROM
(SELECT tww1.*, -- Select original table data
-- Add column with the sum of all weights of previous rows
(SELECT SUM(tww2.Weight)- tww1.Weight
FROM TableWithWeights tww2
WHERE tww2.id <= tww1.id) as SumOfWeightsOfPreviousRows
FROM TableWithWeights tww1) as tww,
-- Add column with random number within the range [0, SumOfWeights)
(SELECT RAND()* sum(weight) as rnd
FROM TableWithWeights) r
WHERE
(tww.SumOfWeightsOfPreviousRows <= r.rnd)
and ( r.rnd < tww.SumOfWeightsOfPreviousRows + tww.Weight)
-- Increase counter
SET #count += 1
END
PS The query was tested on SQL Server 2008 R2. And of course the query can be optimized (it's easy to do if you get the idea)

SQL Server SQL Select: How do I select rows where sum of a column is within a specified multiple?

I have a process that needs to select rows from a Table (queued items) each row has a quantity column and I need to select rows where the quantities add to a specific multiple. The mulitple is the order of between around 4, 8, 10 (but could in theory be any multiple. (odd or even)
Any suggestions on how to select rows where the sum of a field is of a specified multiple?
My first thought would be to use some kind of MOD function which I believe in SQL server is the % sign. So the criteria would be something like this
WHERE MyField % 4 = 0 OR MyField % 8 = 0
It might not be that fast so another way might be to make a temp table containing say 100 values of the X times table (where X is the multiple you are looking for) and join on that