I have numeric values stored in a column of a Spark SQL database as strings. I store these numeric values as strings since they can potentially overflow all numeric datatypes (>128bit numbers).
So far, I could use the normal SUM() function to sum up values. I am curious if it's always safe to do this and how I can deal with cases where it doesn't work.
My thoughts:
I think internally numeric string values are casted to real numeric datatypes during the summation. In cases where this internal casting fails, the whole summation will fail.
Decimal type has maximum of 38 digits. It is not mentioned in Spark docs actually, but you get an error trying to create a type with bigger precision. Databricks docs state this explicitly.
Still, 38 digits allow you to store and safely add 126-bit (log2(10^38)) numbers.
val df = Seq(
"123456789012345678901234567890",
"1",
"100000100000100000100000100001"
).toDF("x")
df.withColumn("x", $"x".cast(new DecimalType(38))).agg(sum($"x")).show(false)
+------------------------------+
|sum(x) |
+------------------------------+
|223456889012445679001234667892|
+------------------------------+
If you need even more you can create a custom aggregate function operating on strings.
import org.apache.spark.sql.{Encoder, Encoders}
import org.apache.spark.sql.expressions.Aggregator
object BigSum extends Aggregator[String, String, String] {
def zero: String = "0"
def reduce(buffer: String, x: String): String = (BigInt(buffer) + BigInt(x)).toString
def merge(b1: String, b2: String): String = (BigInt(b1) + BigInt(b2)).toString
def finish(b: String): String = b
def bufferEncoder: Encoder[String] = Encoders.STRING
def outputEncoder: Encoder[String] = Encoders.STRING
}
val big_sum = udaf(BigSum)
df.agg(big_sum($"x")).show(false)
+------------------------------+
|bigsum$(x) |
+------------------------------+
|223456889012445679001234667892|
+------------------------------+
PS. My BigSum is not super efficient due to converting back and forth, it would be better to have buffer as BigInt - I just didn't know how to write Encoder[BigInt]...
Related
I have a org.apache.spark.sql.DataFrame and I would like to convert it into a column: org.apache.spark.sql.Column.
So basically, this is my dataframe:
val filled_column2 = x.select(first(("col1"),ignoreNulls = true).over(window)) that I want to convert, it into an sql spark column. Anyone could help on that ?
Thank you,
#Jaime Caffarel: this is exactly what i am trying to do, this will give you more visibility. You may also check the error msg in the 2d screenshot
From the documentation of the class org.apache.spark.sql.Column
A column that will be computed based on the data in a DataFrame. A new
column is constructed based on the input columns present in a
dataframe:
df("columnName") // On a specific DataFrame.
col("columnName") // A generic column no yet associcated
with a DataFrame. col("columnName.field") // Extracting a
struct field col("a.column.with.dots") // Escape . in column
names. $"columnName" // Scala short hand for a named
column. expr("a + 1") // A column that is constructed
from a parsed SQL Expression. lit("abc") // A
column that produces a literal (constant) value.
If filled_column2 is a DataFrame, you could do:
filled_column2("col1")
******** EDITED AFTER CLARIFICATION ************
Ok, it seems to me that what you are trying to do is a JOIN operation. Assuming that the product_id is a unique key per each row, I would do something like this:
val filled_column = df.select(df("product_id"), last(("last_prev_week_nopromo"), ignoreNulls = true) over window)
This way, you are also selecting the product_id that you will use as key. Then, you can do the following
val promo_txn_cnt_seas_df2 = promo_txn_cnt_seas_df1
.join(filled_column, promo_txn_cnt_seas_df1("product_id") === filled_column("driver_id"), "inner")
// orderBy("product_id", "week")... (the rest of the operations)
Is this what you are trying to achieve?
I'm using this for adding commas into number.
val commaNumber = NumberFormat.getNumberInstance(Locale.US).format(floatValue)
floatValue is 8.1E-7 , but commaNumber shows just 0.
How to covert Float to String with comma without Rounding?
For the US locale, the maximumFractionalDigits of the number format is 3, therefore, it will try to format 8.1e-7 with only 3 fractional digits, which makes it 0.000. Since the minimumFractionalDigits is also 0, it tries to remove all the unnecessary 0s, making the final result "0".
You should set maximumFractionalDigits to at least 7 if you want to precisely display the number 8.1e-7.
val numberFormat = NumberFormat.getNumberInstance(Locale.US)
numberFormat.maximumFractionDigits = 7
val commaNumber = numberFormat.format(floatValue)
As for the commas, that is already part of the US locale. It uses grouping, and , is its grouping separator. The commas will be inserted if they are needed.
I need to read a csv file from S3 ,it has string,double data but i will read as string which will provide a dynamic frame of only string. I want to do below for each row
concatenate few columns and create new columns
Add new columns
Convert value in 3rd column from string to date
Convert values of column 4,5,6 individually from string to decimal
Storename,code,created_date,performancedata,accumulateddata,maxmontlydata
GHJ 0,GHJ0000001,2020-03-31,0015.5126-,0024.0446-,0017.1811-
MULT,C000000001,2020-03-31,0015.6743-,0024.4533-,0018.0719-
Below is the code that I have written so far
def ConvertToDec(myString):
pattern = re.compile("[0-9]{0,4}[\\.]?[0-9]{0,4}[-]?")
myString=myString.strip()
doubleVal="";
if myString and not pattern.match(myString):
doubleVal=-9999.9999;
else:
doubleVal=-Decimal(myString);
return doubleVal
def rowwise_function(row):
row_dict = row.asDict()
data='d';
if not row_dict['code']:
data=row_dict['code']
else:
data='CD'
if not row_dict['performancedata']:
data= data +row_dict['performancedata']
else:
data=data + 'HJ'
// new columns
row_dict['LC_CODE']=data
row_dict['CD_CD']=123
row_dict['GBL']=123.345
if rec["created_date"]:
rec["created_date"]= convStr =datetime.datetime.strptime(rec["created_date"], '%Y-%m-%d')
if rec["performancedata"]
rec["performancedata"] = ConvertToDec(rec["performancedata"])
newrow = Row(**row_dict)
return newrow
store_df = spark.read.option("header","true").csv("C:\\STOREDATA.TXT", sep="|")
ratings_rdd = store_df.rdd
ratings_rdd_new = ratings_rdd.map(lambda row: rowwise_function(row))
updatedDF=spark.createDataFrame(ratings_rdd_new)
Basically, I am creating almost new DataFrame. My questions are below -
is this right approach ?
Since i am my changing schema mostly is there any other approach
Use Spark dataframes/sql, why use rdd? You don't need to perform any low level data operations, all are column level so dataframes are easier/efficient to use.
To create new columns - .withColumn(<col_name>, <expression/value>) (refer)
All the if's can be made .filter (refer)
The whole ConvertToDec can be written better using strip and ast module or float.
I have a dataframe which has a column called hexa which has hex values like this. They are of dtype object.
hexa
0 00802259AA8D6204
1 00802259AA7F4504
2 00802259AA8D5A04
I would like to remove the first and last bits and reverse the values bitwise as follows:
hexa-rev
0 628DAA592280
1 457FAA592280
2 5A8DAA592280
Please help
I'll show you the complete solution up here and then explain its parts below:
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
list_of_bits = [i+j for i, j in zip(trimmed_bits[::2], trimmed_bits[1::2])]
reversed_bits = [list_of_bits[-i] for i in range(1,len(list_of_bits)+1)]
return ''.join(reversed_bits)
df['hexa-rev'] = df['hexa'].apply(lambda x: reverse_bits(x))
There are possibly a couple ways of doing it, but this way should solve your problem. The general strategy will be defining a function and then using the apply() method to apply it to all values in the column. It should look something like this:
df['hexa-rev'] = df['hexa'].apply(lambda x: reverse_bits(x))
Now we need to define the function we're going to apply to it. Breaking it down into its parts, we strip the first and last bit by indexing. Because of how negative indexes work, this will eliminate the first and last bit, regardless of the size. Your result is a list of characters that we will join together after processing.
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
The second line iterates through the list of characters, matches the first and second character of each bit together, and then concatenates them into a single string representing the bit.
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
list_of_bits = [i+j for i, j in zip(trimmed_bits[::2], trimmed_bits[1::2])]
The second to last line returns the list you just made in reverse order. Lastly, the function returns a single string of bits.
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
list_of_bits = [i+j for i, j in zip(trimmed_bits[::2], trimmed_bits[1::2])]
reversed_bits = [list_of_bits[-i] for i in range(1,len(list_of_bits)+1)]
return ''.join(reversed_bits)
I explained it in reverse order, but you want to define this function that you want applied to your column, and then use the apply() function to make it happen.
if we write 12wkd3, how to choose/filter 123 as integer in octave?
example in octave:
A = input("A?\n")
A?
12wkd3
A = 123
while 12wkd3 is user keyboard input and A = 123 is the expected answer.
assuming that the general form you're looking for is taking an arbitrary string from the user input, removing anything non-numeric, and storing the result it as an integer:
A = input("A? /n",'s');
A = int32(str2num(A(isdigit(A))));
example output:
A?
324bhtk.p89u34
A = 3248934
to clarify what's written above:
in the input statement, the 's' argument causes the answer to get stored as a string, otherwise it's evaluated by Octave first. most inputs would produce errors, others may be interpreted as functions or variables.
isdigit(A) produces a logical array of values for A with a 1 for any character that is a 0-9 number, and 0 otherwise.
isdigit('a1 3 b.') = [0 1 0 1 0 0 0]
A(isdigit(A)) will produce a substring from A using only those values corresponding to a 1 in the logical array above.
A(isdigit(A)) = 13
that still returns a string, so you need to convert it into a number using str2num(). that, however, outputs a double precision number. so finally to get it to be an integer you can use int32()