PySpark: Transform values of given column in the DataFrame - amazon-s3

I am new to PySpark and Spark in general.
I would like to apply transformation on a given column in the DataFrame, essentially call a function for each value on that specific column.
I have my DataFrame df that looks like this:
df.show()
+------------+--------------------+
|version | body |
+------------+--------------------+
| 1|9gIAAAASAQAEAAAAA...|
| 2|2gIAAAASAQAEAAAAA...|
| 3|3gIAAAASAQAEAAAAA...|
| 1|7gIAKAASAQAEAAAAA...|
+------------+--------------------+
I need to read value of body column for each row where the version is 1 and then decrypt it (I have my own logic/function which takes a string and returns a decrypted string). Finally, write the decrypted values in csv format to a S3 bucket.
def decrypt(encrypted_string: str):
# code that returns decrypted string
So, When I do following, I get the corresponding filtered values to which I need to apply my decrypt function.
df.where(col('version') =='1')\
.select(col('body')).show()
+--------------------+
| body|
+--------------------+
|9gIAAAASAQAEAAAAA...|
|7gIAKAASAQAEAAAAA...|
+--------------------+
However, I am not clear how to do that. I tried to use collect() but then it defeats the purpose of using Spark.
I also tried using .rdd.map as follows but that did not work.
df.where(col('version') =='1')\
.select(col('body'))\
.rdd.map(lambda x: decrypt).toDF().show()
OR
.rdd.map(decrypt).toDF().show()
Could someone please help with this.

Please try:
from pyspark.sql.functions import udf
decrypt_udf = udf(decrypt, StringType())
df.where(col('version') =='1').withColumn('body', decrypt_udf('body'))

Got some clue from this post: Pyspark DataFrame UDF on Text Column.
Looks like I can simply get it with following. I was doing it without using udf earlier, so it wasn't working.
dummy_function_udf = udf(decrypt, StringType())
df.where(col('version') == '1')\
.select(col('body')) \
.withColumn('decryptedBody', dummy_function_udf('body')) \
.show()

Related

How do I replace column after encrypting it by using Spark (PySpark)?

I have a question about replacing personal information to encrypted data using Spark.
Let's say for example, if I have a table like:
std_name
phone_number
John
585-1243-2156
Susan
585-4567-2156
I want to change phone_number to encrypted form like:
std_name
phone_number
John
avawehna'vqqa
Susan
vabdsvwegq'qb
I have tried using withColumn with udf, but it does not work well.
Can someone help me out?
You haven't provided your encryption function, but I will assume that there was something simple wrong. If you create a UDF, it will be separately run for every row, so you can use Python inside your UDF.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('John', '585-1243-2156'),
('Susan', '585-4567-2156')],
['std_name', 'phone_number']
)
#F.udf
def encrypting(data):
# Encrypting logic:
encrypted_data = 'xyz' + data[::-1].replace('-', 'w')
return encrypted_data
df = df.withColumn('phone_number', encrypting('phone_number'))
df.show()
# +--------+----------------+
# |std_name| phone_number|
# +--------+----------------+
# | John|xyz6512w3421w585|
# | Susan|xyz6512w7654w585|
# +--------+----------------+

How to convert pyspark dataframe to JSON?

I have pyspark dataframe and i want to convert it into list which contain JSON object.
For that i have done like below..
df.toJSON().collect()
But this operation send data to driver which is costly and take to much time to perform.And my dataframe contain millions of records.So is there any another way to do it without collect() operation which is optimized than collect().
Below is my dataframe df:-
product cost
pen 10
book 40
bottle 80
glass 55
and output is like below :-
df2 = [{product:'pen',cost:40},{product:'book',cost:40},{product:'bottle',cost:80},{product:'glass',cost:55}]
when i print the datatype of df2 it will be list.
If you want to create json object in dataframe then use collect_list + create_map + to_json functions.
(or)
To write as json document to the file then won't use to_json instead use .write.json()
Create JSON object:
df.agg(collect_list(create_map(lit("product"),"product",lit("cost"),"cost")).alias("stru")).\
selectExpr("to_json(stru) as json").\
show(10,False)
#+-------------------------------------------------------------------------------------------------------------------------------+
#|json |
#+-------------------------------------------------------------------------------------------------------------------------------+
#|[{"product":"pen","cost":"10"},{"product":"book","cost":"40"},{"product":"bottle","cost":"80"},{"product":"glass","cost":"55"}]|
#+-------------------------------------------------------------------------------------------------------------------------------+
#write to hdfs use .saveAsTextFile
df.agg(collect_list(create_map(lit("product"),"product",lit("cost"),"cost")).alias("stru")).selectExpr("to_json(stru) as json").rdd.map(lambda x:x['json']).saveAsTextFile("<path>")
#cat part-00000
#[{"product":"pen","cost":"10"},{"product":"book","cost":"40"},{"product":"bottle","cost":"80"},{"product":"glass","cost":"55"}]
Create JSON file:
df.agg(collect_list(create_map(lit("product"),"product",lit("cost"),"cost")).alias("stru")).write.mode("overwrite").json("<path>")
#cat part-00000-3a19165e-219e-4485-adb8-ef91589d6e31-c000.json
#{"stru":[{"product":"pen","cost":"10"},{"product":"book","cost":"40"},{"product":"bottle","cost":"80"},{"product":"glass","cost":"55"}]}

How to add delimiters to a csv file

I have a csv file with no delimiters. Is it possible to add delimiters at certain position in PySpark? like,
my file looks like:
USDINRFUTCUR23Feb201700000000FF00000000000001990067895000000000NNN*12
USDINRFUTCUR24Feb201700000000FF00000000000001990067895000000000NNN*12
USDINRFUTCUR25Feb201700000000FF00000000000001990067895000000000NNN*12
and i want delimiters at 3rd, 6th 12th position
For fixed width files there is pandas.read_fwf()
widths = [
3,
6,
12,
]
df = pd.read_fwf("fixed_width.txt", widths=widths)
df
For using distributed pyspark solution, there is no similar way to add delimiter right as you read(as there is pandas). A scalable way to solve this would be read the data as is in one column, then use below code (using pyspark functions) to create your columns.
Creating sample dataframe:
from pyspark.sql import functions as F
from pyspark.sql.types import *
list=[['USDINRFUTCUR23Feb201700000000FF00000000000001990067895000000000NNN*12'],
['USDINRFUTCUR24Feb201700000000FF00000000000001990067895000000000NNN*12'],
['USDINRFUTCUR25Feb201700000000FF00000000000001990067895000000000NNN*12']]
df=spark.createDataFrame(list,['col1'])
df.show(truncate=False)
+---------------------------------------------------------------------+
|col1 |
+---------------------------------------------------------------------+
|USDINRFUTCUR23Feb201700000000FF00000000000001990067895000000000NNN*12|
|USDINRFUTCUR24Feb201700000000FF00000000000001990067895000000000NNN*12|
|USDINRFUTCUR25Feb201700000000FF00000000000001990067895000000000NNN*12|
+---------------------------------------------------------------------+
Use substr and withcolumn to create new columns, and drop the first one. You could make a def(function) which reads and performs this code as well, so that you could re use and simplify your pipeline
df.withColumn("Currency1", F.col("col1").substr(0,3))\
.withColumn("Currency2", F.col("col1").substr(4,3))\
.withColumn("Type", F.col("col1").substr(7,6))\
.withColumn("Time", F.expr("""substr(col1,13,length(col1))"""))\
.drop("col1").show(truncate=False)
#output
+---------+---------+------+---------------------------------------------------------+
|Currency1|Currency2|Type |Time |
+---------+---------+------+---------------------------------------------------------+
|USD |INR |FUTCUR|23Feb201700000000FF00000000000001990067895000000000NNN*12|
|USD |INR |FUTCUR|24Feb201700000000FF00000000000001990067895000000000NNN*12|
|USD |INR |FUTCUR|25Feb201700000000FF00000000000001990067895000000000NNN*12|
+---------+---------+------+---------------------------------------------------------+

Performing different computations conditioned on a column value in a spark dataframe

I have a pyspark dataframe with 2 columns, A and B. I need rows of B to be processed differently, based on values of the A column. In plain pandas I might do this:
import pandas as pd
funcDict = {}
funcDict['f1'] = (lambda x:x+1000)
funcDict['f2'] = (lambda x:x*x)
df = pd.DataFrame([['a',1],['b',2],['b',3],['a',4]], columns=['A','B'])
df['newCol'] = df.apply(lambda x: funcDict['f1'](x['B']) if x['A']=='a' else funcDict['f2']
(x['B']), axis=1)
The easy way I can think of to do in (py)spark are
Use files
read in the data into a dataframe
partition by column A and write to separate files (write.partitionBy)
read in each file and then process them separately
or else
use expr
read in the data into a dataframe
write a unwieldy expr (from a readability/maintenance perspective) to conditionally do something differently based on the value of the column
this will not look anywhere as "clean" as the pandas code above looks
Is there anything else that is the appropriate way to handle this requirement? From the efficiency perspective, I expect the first approach to be cleaner, but have more run time due to the partition-write-read, and the second approach is not as good from the code perspective, and harder to extend and maintain.
More primarily, would you choose to use something completely different (e.g. message queues) instead (relative latency difference notwithstanding)?
EDIT 1
Based on my limited knowledge of pyspark, the solution proposed by user pissall (https://stackoverflow.com/users/8805315/pissall) works as long as the processing isn't very complex. If that happens, I don't know how to do it without resorting to UDFs, which come with their own disadvantages. Consider the simple example below
# create a 2-column data frame
# where I wish to extract the city
# in column B differently based on
# the type given in column A
# This requires taking a different
# substring (prefix or suffix) from column B
df = sparkSession.createDataFrame([
(1, "NewYork_NY"),
(2, "FL_Miami"),
(1, "LA_CA"),
(1, "Chicago_IL"),
(2,"PA_Kutztown")
], ["A", "B"])
# create UDFs to get left and right substrings
# I do not know how to avoid creating UDFs
# for this type of processing
getCityLeft = udf(lambda x:x[0:-3],StringType())
getCityRight = udf(lambda x:x[3:],StringType())
#apply UDFs
df = df.withColumn("city", F.when(F.col("A") == 1, getCityLeft(F.col("B"))) \
.otherwise(getCityRight(F.col("B"))))
Is there a way to do this in a simpler manner without resorting to UDFs? If I use expr, I can do this, but as I mentioned earlier, it doesn't seem elegant.
What about using when?
import pyspark.sql.functions as F
df = df.withColumn("transformed_B", F.when(F.col("A") == "a", F.col("B") + 1000).otherwise(F.col("B") * F.col("B")))
EDIT after more clarity on the question:
You can use split on _ and take the first or the second part of it based on your condition.
Is this the expected output?
df.withColumn("city", F.when(F.col("A") == 1, F.split("B", "_")[0]).otherwise(F.split("B", "_")[1])).show()
+---+-----------+--------+
| A| B| city|
+---+-----------+--------+
| 1| NewYork_NY| NewYork|
| 2| FL_Miami| Miami|
| 1| LA_CA| LA|
| 1| Chicago_IL| Chicago|
| 2|PA_Kutztown|Kutztown|
+---+-----------+--------+
UDF approach:
def sub_string(ref_col, city_col):
# ref_col is the reference column (A) and city_col is the string we want to sub (B)
if ref_col == 1:
return city_col[0:-3]
return city_col[3:]
sub_str_udf = F.udf(sub_string, StringType())
df = df.withColumn("city", sub_str_udf(F.col("A"), F.col("B")))
Also, please look into: remove last few characters in PySpark dataframe column

How to convert a dataframe to a variable

Is there any direct function to convert a dataframe and assign to a variable?
For example below returns this
>>> partitionRecordCount= spark.sql("select count(*) from mydb.mytable where partition_date="yyyymmdd")
>>> partitionRecordCount.show()
+--------+
|count(1)|
+--------+
| 206157|
+--------+
what i need is like below
>>> partitionRecordCount
206157
I need that record count integer value directly in that variable on the left hand side rather than a dataframe. Please advice
See this answer
get value out of dataframe
So for your example you can just change it to:
partitionRecordCount = partitionRecordCount.collect()[0]
Try
partitionRecordCount.collect()[0][0]