How to load big double numbers in a PySpark DataFrame and persist it back without changing the numeric format to scientific notation or precision? - dataframe

I have a CSV like that:
COL,VAL
TEST,100000000.12345679
TEST2,200000000.1234
TEST3,9999.1234679123
I want to load it having the column VAL as a numeric type (due to other requirements of the project) and then persist it back to another CSV as per structure below:
+-----+------------------+
| COL| VAL|
+-----+------------------+
| TEST|100000000.12345679|
|TEST2| 200000000.1234|
|TEST3| 9999.1234679123|
+-----+------------------+
The problem I'm facing is that whenever I load it, the numbers become scientific notation, and I cannot persist it back without having to inform the precision and scale of my data (I want to use the one that it is already in the file, whatever it is - I can't infer it).
Here's what I have tried:
Loading it with DoubleType() it gives me scientific notation:
schema = StructType([
StructField('COL', StringType()),
StructField('VAL', DoubleType())
])
csv_file = "Downloads/test.csv"
df2 = (spark.read.format("csv")
.option("sep",",")
.option("header", "true")
.schema(schema)
.load(csv_file))
df2.show()
+-----+--------------------+
| COL| VAL|
+-----+--------------------+
| TEST|1.0000000012345679E8|
|TEST2| 2.000000001234E8|
|TEST3| 9999.1234679123|
+-----+--------------------+
Loading it with DecimalType() I'm required to specify precision and scale, otherwise, I lose the decimals after the dot. However, specifying it, besides the risk of not getting the correct value (as my data might be rounded), I get zeros after the dot:
For example, using: StructField('VAL', DecimalType(38, 18)) I get:
[Row(COL='TEST', VAL=Decimal('100000000.123456790000000000')),
Row(COL='TEST2', VAL=Decimal('200000000.123400000000000000')),
Row(COL='TEST3', VAL=Decimal('9999.123467912300000000'))]
Realise that in this case, I have zeros on the right side that I don't want in my new file.
The only way I found to address it was using a UDF where I first use the float() to remove the scientific notation and then I convert it to string to make sure it will be persisted as I want:
to_decimal = udf(lambda n: str(float(n)))
df2 = df2.select("*", to_decimal("VAL").alias("VAL2"))
df2 = df2.select(["COL", "VAL2"]).withColumnRenamed("VAL2", "VAL")
df2.show()
display(df2.schema)
+-----+------------------+
| COL| VAL|
+-----+------------------+
| TEST|100000000.12345679|
|TEST2| 200000000.1234|
|TEST3| 9999.1234679123|
+-----+------------------+
StructType(List(StructField(COL,StringType,true),StructField(VAL,StringType,true)))
There's any way to reach the same without using the UDF trick?
Thank you!

The best way I found to address it was as bellow. It is still using UDF, but now, without the workarounds with Strings to avoid scientific notation. I won't make it as correct answer yet, because I still expect someone coming over with a solution without UDF (or a good explanation of why it's not possible without UDFs).
The CSV:
$ cat /Users/bambrozi/Downloads/testf.csv
COL,VAL
TEST,100000000.12345679
TEST2,200000000.1234
TEST3,9999.1234679123
TEST4,123456789.01234567
Load the CSV applying the default PySpark DecimalType precision and scale:
schema = StructType([
StructField('COL', StringType()),
StructField('VAL', DecimalType(38, 18))
])
csv_file = "Downloads/testf.csv"
df2 = (spark.read.format("csv")
.option("sep",",")
.option("header", "true")
.schema(schema)
.load(csv_file))
df2.show(truncate=False)
output:
+-----+----------------------------+
|COL |VAL |
+-----+----------------------------+
|TEST |100000000.123456790000000000|
|TEST2|200000000.123400000000000000|
|TEST3|9999.123467912300000000 |
|TEST4|123456789.012345670000000000|
+-----+----------------------------+
When you are ready to report it (print or save in a new file) you apply a format to trailing zeros:
import decimal
import pyspark.sql.functions as F
normalize_decimals = F.udf(lambda dec: dec.normalize())
(df2
.withColumn('VAL', normalize_decimals(F.col('VAL')))
.show(truncate=False))
output:
+-----+------------------+
|COL |VAL |
+-----+------------------+
|TEST |100000000.12345679|
|TEST2|200000000.1234 |
|TEST3|9999.1234679123 |
|TEST4|123456789.01234567|
+-----+------------------+

You can use spark to do that with sql query :
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
val sparkConf: SparkConf = new SparkConf(true)
.setAppName(this.getClass.getName)
.setMaster("local[*]")
implicit val spark: SparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
val df = spark.read.option("header", "true").format("csv").load(csv_file)
df.createOrReplaceTempView("table")
val query = "Select cast(VAL as BigDecimal) as VAL, COL from table"
val result = spark.sql(query)
result.show()
result.coalesce(1).write.option("header", "true").mode("overwrite").csv(outputPath + table)

Related

optimize pyspark code to find a keyword and its count in a dataframe

We have a lot of files in our s3 bucket. The current pyspark code I have reads each file, takes one column from that file and looks for the keyword and returns a dataframe with count of keyword in the column and the file.
Here is the code in pyspark. (we are using databricks to write code if that helps)
import s3fs
fs = s3fs.S3FileSystem()
from pyspark.sql.functions import lower, col
keywords = ['%keyword1%','%keyword2%']
prefix = ''
deployment_id = ''
pull_id = ''
paths = fs.ls(prefix+'/'+deployment_id+'/'+pull_id)
result = []
errors = []
try:
for path in paths:
df = spark.read.parquet('s3://'+path)
print(path)
for keyword in keywords:
for col in df.columns:
filtered_df = df.filter(lower(df[col]).like(keyword))
filtered_count = filtered_df.count()
if filtered_count > 0 :
#print(col +' has '+ str(filtered_count) +' appearences')
result.append({'keyword': keyword, 'column': col, 'count': filtered_count,'table':path.split('/')[-1]})
except Exception as e:
errors.append({'error_msg':e})
try:
errors = spark.createDataFrame(errors)
except Exception as e:
print('no errors')
try:
result = spark.createDataFrame(result)
result.display()
except Exception as e:
print('problem with results. May be no results')
I am new to pyspark,databricks and spark. Code here works very slow. I know that cause we have a local code in python that is faster than this one. we wanted to use pyspark, databricks cause we thought it would be faster and on local code we need to put aws access keys every day and some times if the file is huge it gives a memory error.
NOTE - The above code reads data faster but the search functionality seems to be slower when compared to local python code
here is the python code in our local system
def search_df(self,keyword,df,regex=False):
start=time.time()
if regex:
mask = df.applymap(lambda x: re.search(keyword,x) is not None if isinstance(x,str) else False).to_numpy()
else:
mask = df.applymap(lambda x: keyword.lower() in x.lower() if isinstance(x,str) else False).to_numpy()
I was hoping if I could have any code changes to the pyspark so its faster.
Thanks.
tried changing
.like(keyword) to .contains(keyword) to see if thats faster. but doesnt seem to work
Check out the below code. Have defined a function that uses List Comprehensions to search each column in the df for a keyword. Next calling that function for each keyword. There will be a new df returned for each keyword, which then need to be unioned using reduce function.
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
from functools import reduce
sampleData = [["Hello s1","Y","Hi s1"],["What is your name?","What is s1","what is s2?"] ]
df = spark.createDataFrame(sampleData,["col1","col2","col3"])
df.show()
# Sample input dataframe
+------------------+----------+-----------+
| col1| col2| col3|
+------------------+----------+-----------+
| Hello s1| Y| Hi s1|
|What is your name?|What is s1|what is s2?|
+------------------+----------+-----------+
keywords=["s1","s2"]
def calc(k) -> DataFrame:
return df.select([F.count(F.when(F.col(c).rlike(k),c)).alias(c) for c in df.columns] ).withColumn("keyword",F.lit(k))
lst=[calc(k) for k in keywords]
fDf=reduce(DataFrame.unionByName, [y for y in lst])
stExpr="stack(3,'col1',col1,'col2',col2,'col3',col3) as (ColName,Count)"
fDf.select("keyword",F.expr(stExpr)).show()
# Output
+-------+-------+-----+
|keyword|ColName|Count|
+-------+-------+-----+
| s1| col1| 1|
| s1| col2| 1|
| s1| col3| 1|
| s2| col1| 0|
| s2| col2| 0|
| s2| col3| 1|
+-------+-------+-----+
You can add a where clause at the end to filter rows greater than 0 ==>
where("Count >0")

PySpark: Transform values of given column in the DataFrame

I am new to PySpark and Spark in general.
I would like to apply transformation on a given column in the DataFrame, essentially call a function for each value on that specific column.
I have my DataFrame df that looks like this:
df.show()
+------------+--------------------+
|version | body |
+------------+--------------------+
| 1|9gIAAAASAQAEAAAAA...|
| 2|2gIAAAASAQAEAAAAA...|
| 3|3gIAAAASAQAEAAAAA...|
| 1|7gIAKAASAQAEAAAAA...|
+------------+--------------------+
I need to read value of body column for each row where the version is 1 and then decrypt it (I have my own logic/function which takes a string and returns a decrypted string). Finally, write the decrypted values in csv format to a S3 bucket.
def decrypt(encrypted_string: str):
# code that returns decrypted string
So, When I do following, I get the corresponding filtered values to which I need to apply my decrypt function.
df.where(col('version') =='1')\
.select(col('body')).show()
+--------------------+
| body|
+--------------------+
|9gIAAAASAQAEAAAAA...|
|7gIAKAASAQAEAAAAA...|
+--------------------+
However, I am not clear how to do that. I tried to use collect() but then it defeats the purpose of using Spark.
I also tried using .rdd.map as follows but that did not work.
df.where(col('version') =='1')\
.select(col('body'))\
.rdd.map(lambda x: decrypt).toDF().show()
OR
.rdd.map(decrypt).toDF().show()
Could someone please help with this.
Please try:
from pyspark.sql.functions import udf
decrypt_udf = udf(decrypt, StringType())
df.where(col('version') =='1').withColumn('body', decrypt_udf('body'))
Got some clue from this post: Pyspark DataFrame UDF on Text Column.
Looks like I can simply get it with following. I was doing it without using udf earlier, so it wasn't working.
dummy_function_udf = udf(decrypt, StringType())
df.where(col('version') == '1')\
.select(col('body')) \
.withColumn('decryptedBody', dummy_function_udf('body')) \
.show()

Strange convertion of pandas dataframe to spark dataframe with defined schema

I'm facing the following problem and cound't get an answer yet: when converting a pandas dataframe with integers to a pyspark dataframe with a schema that supposes data comes as a string, the values change to "strange" strings, just like the example below. I've saved a lot of important data like that, and I wonder why that happened and if it is possible to "decode" these symbols back to integer forms. Thanks in advance!
import pandas as pd
from pyspark.sql.types import StructType, StructField,StringType
df = pd.DataFrame(data = {"a": [111,222, 333]})
schema = StructType([
StructField("a", StringType(), True)
])
sparkdf = spark.createDataFrame(df, schema)
sparkdf.show()
Output:
--+
+---+
| a|
+---+
| o|
| Þ|
| ō|
+---+
I cannot reproduce the problem on any recent version but the most likely reason is that you incorrectly defined the schema (in combination with enabled Arrow support).
Either cast the input:
df["a"] = df.a.astype("str")
or define the correct schema:
from pyspark.sql.types import LongType
schema = StructType([
StructField("a", LongType(), True)
])

Spark DataFrame equivalent to Pandas Dataframe `.iloc()` method?

Is there a way to reference Spark DataFrame columns by position using an integer?
Analogous Pandas DataFrame operation:
df.iloc[:0] # Give me all the rows at column position 0
The equivalent of Python df.iloc is collect
PySpark examples:
X = df.collect()[0]['age']
or
X = df.collect()[0][1] #row 0 col 1
Not really, but you can try something like this:
Python:
df = sc.parallelize([(1, "foo", 2.0)]).toDF()
df.select(*df.columns[:1]) # I assume [:1] is what you really want
## DataFrame[_1: bigint]
or
df.select(df.columns[1:3])
## DataFrame[_2: string, _3: double]
Scala
val df = sc.parallelize(Seq((1, "foo", 2.0))).toDF()
df.select(df.columns.slice(0, 1).map(col(_)): _*)
Note:
Spark SQL doesn't support and it is unlikely to ever support row indexing so it is not possible to index across row dimension.
You can use like this in spark-shell.
scala>: df.columns
Array[String] = Array(age, name)
scala>: df.select(df.columns(0)).show()
+----+
| age|
+----+
|null|
| 30|
| 19|
+----+
As of Spark 3.1.1 on Databricks, it's a matter of selecting the column of interest, and applying limit:
%python
retDF = (inputDF
.select(col(inputDF
.columns[0]))
.limit(100)
)

Renaming columns for PySpark DataFrame aggregates

I am analysing some data with PySpark DataFrames. Suppose I have a DataFrame df that I am aggregating:
(df.groupBy("group")
.agg({"money":"sum"})
.show(100)
)
This will give me:
group SUM(money#2L)
A 137461285853
B 172185566943
C 271179590646
The aggregation works just fine but I dislike the new column name SUM(money#2L). Is there a way to rename this column into something human readable from the .agg method? Maybe something more similar to what one would do in dplyr:
df %>% group_by(group) %>% summarise(sum_money = sum(money))
Although I still prefer dplyr syntax, this code snippet will do:
import pyspark.sql.functions as sf
(df.groupBy("group")
.agg(sf.sum('money').alias('money'))
.show(100))
It gets verbose.
withColumnRenamed should do the trick. Here is the link to the pyspark.sql API.
df.groupBy("group")\
.agg({"money":"sum"})\
.withColumnRenamed("SUM(money)", "money")
.show(100)
I made a little helper function for this that might help some people out.
import re
from functools import partial
def rename_cols(agg_df, ignore_first_n=1):
"""changes the default spark aggregate names `avg(colname)`
to something a bit more useful. Pass an aggregated dataframe
and the number of aggregation columns to ignore.
"""
delimiters = "(", ")"
split_pattern = '|'.join(map(re.escape, delimiters))
splitter = partial(re.split, split_pattern)
split_agg = lambda x: '_'.join(splitter(x))[0:-ignore_first_n]
renamed = map(split_agg, agg_df.columns[ignore_first_n:])
renamed = zip(agg_df.columns[ignore_first_n:], renamed)
for old, new in renamed:
agg_df = agg_df.withColumnRenamed(old, new)
return agg_df
An example:
gb = (df.selectExpr("id", "rank", "rate", "price", "clicks")
.groupby("id")
.agg({"rank": "mean",
"*": "count",
"rate": "mean",
"price": "mean",
"clicks": "mean",
})
)
>>> gb.columns
['id',
'avg(rate)',
'count(1)',
'avg(price)',
'avg(rank)',
'avg(clicks)']
>>> rename_cols(gb).columns
['id',
'avg_rate',
'count_1',
'avg_price',
'avg_rank',
'avg_clicks']
Doing at least a bit to save people from typing so much.
It's simple as:
val maxVideoLenPerItemDf = requiredItemsFiltered.groupBy("itemId").agg(max("playBackDuration").as("customVideoLength"))
maxVideoLenPerItemDf.show()
Use .as in agg to name the new row created.
.alias and .withColumnRenamed both work if you're willing to hard-code your column names. If you need a programmatic solution, e.g. friendlier names for an aggregation of all remaining columns, this provides a good starting point:
grouping_column = 'group'
cols = [F.sum(F.col(x)).alias(x) for x in df.columns if x != grouping_column]
(
df
.groupBy(grouping_column)
.agg(
*cols
)
)
df = df.groupby('Device_ID').agg(aggregate_methods)
for column in df.columns:
start_index = column.find('(')
end_index = column.find(')')
if (start_index and end_index):
df = df.withColumnRenamed(column, column[start_index+1:end_index])
The above code can strip out anything that is outside of the "()". For example, "sum(foo)" will be renamed as "foo".
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder.appName('test').getOrCreate()
data = [(1, "siva", 100), (2, "siva2", 200),(3, "siva3", 300),(4, "siva4", 400),(5, "siva5", 500)]
schema = ['id', 'name', 'sallary']
df = spark.createDataFrame(data, schema=schema)
df.show()
+---+-----+-------+
| id| name|sallary|
+---+-----+-------+
| 1| siva| 100|
| 2|siva2| 200|
| 3|siva3| 300|
| 4|siva4| 400|
| 5|siva5| 500|
+---+-----+-------+
**df.agg({"sallary": "max"}).withColumnRenamed('max(sallary)', 'max').show()**
+---+
|max|
+---+
|500|
+---+
While the previously given answers are good, I think they're lacking a neat way to deal with dictionary-usage in the .agg()
If you want to use a dict, which actually might be also dynamically generated because you have hundreds of columns, you can use the following without dealing with dozens of code-lines:
# Your dictionary-version of using the .agg()-function
# Note: The provided logic could actually also be applied to a non-dictionary approach
df = df.groupBy("group")\
.agg({
"money":"sum"
, "...": "..."
})
# Now do the renaming
newColumnNames = ["group", "money", "..."] # Provide the names for ALL columns of the new df
df = df.toDF(*newColumnNames) # Do the renaming
Of course the newColumnNames-list can also be dynamically generated. E.g., if you only append columns from the aggregation to your df you can pre-store newColumnNames = df.columns and then just append the additional names.
Anyhow, be aware that the newColumnNames must contain all column names of the dataframe, not only those to be renamed (because .toDF() creates a new dataframe due to Sparks immutable RDDs)!
Another quick little one liner to add the the mix:
df.groupBy('group')
.agg({'money':'sum',
'moreMoney':'sum',
'evenMoreMoney':'sum'
})
.select(*(col(i).alias(i.replace("(",'_').replace(')','')) for i in df.columns))
just change the alias function to whatever you'd like to name them. The above generates sum_money, sum_moreMoney, since I do like seeing the operator in the variable name.