Split string values in equal and same partition - apache-spark-sql

I need to split my data into 80 partitions regardless of what is the key of the data and each time the data should retrun the same partition value. Is there any alogorithm which can be used to implement the same.
The key is combination of multiple fields.
I am planning to generate a surrogate Key for the key cobmination and apply range functon using min and max values split the data into desired number of parititons . But if the same key arrives tommorow i have to look back to get the surrogate key so that same keys fall on the same partition.
Is there any existing algorithm/formula pyspark function where i pass a string value it will return a same number each time and it make sure the it distributes the string value equally?
df_1=spark.sql("select column_1,column_2,column_2,hash(column_1) % 20 as part from temptable")
df_1.createOrReplaceTempView("test")
spark.sql("select part,count(*) from test group by part").show(160,False)

If you can't use a numeric key and just take a modulus, then...
Use a stable hash on a string value to a number, such as the python hash() built in and do a mod 80 on it. It will sort neatly into 80 buckets (numbered 0 - 79).
e.g. something like this:
bucket = abs(hash(key_string) % 80)

Related

Set number of records in in a file while unloading in athena to S3

I am using CTAS command and I was wondering if there is a way to set number of records in a file in S3. All I have got now is to set the size in this link:
https://aws.amazon.com/premiumsupport/knowledge-center/set-file-number-size-ctas-athena/
However, I will not know the size beforehand.
I will be using this command
CREATE TABLE "historic_climate_gz_20_files"
WITH (
external_location = 's3://awsexamplebucket/historic_climate_gz_20_files/',
format = 'TEXTFILE',
bucket_count=20,
bucketed_by = ARRAY['yearmonthday']
) as
select * from historic_climate_gz
I dont see any option to set number of records.
How can I do this ?
Thanks in advance
Nobody knows beforehand how many buckets exactly should be created to get the desired size or the number of rows per file.
This is why (as described in the link you provided) they are measuring total size, divide it on bucket size required to get the number of buckets. And you cannot specify the size, only the number of buckets when you creating the table. After you calculated the number of buckets, create new bucketed table and reload the data from initial table.
So, if you want your files to contain the desired number of rows each instead of being of desired size, you need to calculate the number of buckets using the same approach.
Calculate the total row number in the initial dataset:
select count(*) as cnt from historic_climate_gz
For example your table contains 1000000 (1M) rows. And you want to have buckets with 10K rows. Calculate bucket_count = 1000000/10000 = 100.
Create new bucketed table with 100 buckets and reload the data, each file in it will approximately contain 10K rows (if the bucket key is evenly distributed and have enough cardinaity).
CREATE TABLE "historic_climate_gz_20_files"
WITH (
external_location = 's3://awsexamplebucket/historic_climate_gz_20_files/',
format = 'TEXTFILE',
bucket_count=100, ---100 buckets
bucketed_by = ARRAY['yearmonthday']
) as
select * from historic_climate_gz
You see, in case of bucketed table you can only control the number_of_buckets and the bucket key. Size or number of rows is an approximation (expectation) and is not accurate. For not evenly distributed bucket key of course you will get different size of buckets. Some of them bigger, some of them smaller. For evenly distributed key you will have approximately the same number of rows.
Which row comes to which bucket is decided by this function:
`hash(bucket_key) MOD number_of_buckets` where hash is integer.
MOD generates integer bucket numbers in a range [0, number_of_buckets-1]. Number of buckets and bucket key is what you can specify before load.
Rows with the same values will be written in the same bucket. If you have skew in bucket key distribution, then you will have buckets with size and number of rows different (skewed accordingly).

Increment Redis counter used as a value only when the key is unique

I have to count unique entries from a stream of transactions using Redis. There will be at least 1K jobs trying to concurrently check if the transaction is unique and if it is, put the the transaction type as key and the value is an incremented counter. This counter is again shared by all threads.
If all threads do
Check if key exists. exists(transactionType)
Increment the counter. val count = incr(counter)
Set the new value. setnx(transactionType, count)
This creates two problems.
Increments the counter unnecessarily, as the count can be updated by one of the threads.
Have to perform an exists, increment and then insert. (3 operations)
Is there a better way of doing this increment and update of counter if the value does not exist.
private void checkAndIncrement(String transactionType, Jedis redisHandle) {
if(transactionType != null) {
if(redisHandle.exists(transactionType) ^ Boolean.TRUE) {
long count = redisHandle.incr("t_counter");
redisHandle.setnx(transactionType, "" + count);
}
}
}
EDIT:
Once a value is created as say T1 = 100, the transaction should also be identifiable with the number 100. I would have to store another map with counter as key and transaction type as value.
Two options:
Use a hash, HSETNX to add keys to the hash (just set the value to 1 or "" or anything), and HLEN to get the count of keys in the hash. You can always start over with HDEL. You could also use HINCRBY instead of HSETNX to additionally find out how many times each key appears.
Use a hyperloglog. Use PFADD to insert elements and PFCOUNT to retrieve the count. HyperLogLog is a probabilistic algorithm; the memory usage for a HLL doesn't go up with the number of unique items the way a hash does, but the count returned is only approximate (usually within about 1% of the true value).

Subtract the mean of a group for a column away from a column value

I have a companies dataset with 35 columns. The companies can belong to one of 8 different groups. How do I for each group create a new dataframe which subtract the mean of the column for that group away from the original value?
Here is an example of part of the dataset.
So for example for row 1 I want to subtract the mean of BANK_AND_DEP for Consumer Markets away from the value of 7204.400207. I need to do this for each column.
I assume this is some kind of combination of a transform and a lambda - but cannot hit the syntax.
Although it might seem counter-intuitive for this to involve a loop at all, looping through the columns themselves allows you to do this as a vectorized operation, which will be quicker than .apply(). For what to subtract by, you'll combine .groupby() and .transform() to get the value you need to subtract from a column. Then, just subtract it.
for column in df.columns:
df['new_'+column] = df[column]-df.groupby('Cluster')['column'].transform('mean')

Delete rows based on char in the index string

I have the following dataframe:
df = pd.DataFrame(np.random.randn(4, 1), index=['mark13', 'luisgimenez', 'miguel72', 'luis34'],columns=['probability'])
probability
mark13 -1.054687
luisgimenez 0.081224
miguel72 -0.893619
luis34 -1.576941
I would like to remove the rows where the last character in the index string does not contain a number .
The desired output would look something like this :
(dropping the row where the index does not finishes with a number)
probability
mark13 -1.054687
miguel72 -0.893619
luis34 -1.576941
I am sure the direction I need to get is the boolean indexing but I do not know how could I reference the last character in the index name
#use isdigt to check last char of your index to be used as a mask array to filter rows.
df[[e[-1].isdigit() for e in df.index]]
Out[496]:
probability
mark13 -0.111338
miguel72 0.548725
luis34 0.682949
You can use the str accessor to check if the last character is a number:
df[df.index.str[-1].str.isdigit()]
Out:
probability
mark13 -0.350466
miguel72 1.220434
luis34 -0.962123

Find index of an ordered set of N elements

Problem description:
A set of lists of N integers i1,i2,....,iN with 0<= i1<=i2<=i3<=....<=iN <=M, is created by starting with one integer 0<=i1<=M, and repeatedly adding one integer that is greater or equal to the last integer added.
When adding the last integer to get the final set of lists, the index runs starting from 0 to BinomialC[M+N,N)]-1.
For example, for M=3, i1=0,1,2,3
so the lists are
{0},{1},...,{3}.
Adding another integer i2>=i1 will result in
{0,0},{0,1},{0,2},{0,3},
{1,1},{1,2},{1,3},
{2,2},{2,3}
{3,3}
with indices
0,1,2,3,
4,5,6,
7,8,
9.
This index can be represented in terms of i1,i2,...,iN and M. If the conditions >= were not present, then it would be simply i1*(M+1)^(N-1)+i2*(M+1)^(N-2)+...+iN*(M+1)^(N-N). But, in the case above, there is a negative shift in the index due to the restrictions. For example, N=2 the shift is -i1(i1+1)/2 and index is i = i1*(M+1)^1 + i2*(M+1)^0 -i1(i1+1)/2.
Question:
Does anyone especially from mathematics background knows how to write the index for general N element case? or just the final expression? Any help would be appriciated!
Thanks!