Performing different computations conditioned on a column value in a spark dataframe - dataframe

I have a pyspark dataframe with 2 columns, A and B. I need rows of B to be processed differently, based on values of the A column. In plain pandas I might do this:
import pandas as pd
funcDict = {}
funcDict['f1'] = (lambda x:x+1000)
funcDict['f2'] = (lambda x:x*x)
df = pd.DataFrame([['a',1],['b',2],['b',3],['a',4]], columns=['A','B'])
df['newCol'] = df.apply(lambda x: funcDict['f1'](x['B']) if x['A']=='a' else funcDict['f2']
(x['B']), axis=1)
The easy way I can think of to do in (py)spark are
Use files
read in the data into a dataframe
partition by column A and write to separate files (write.partitionBy)
read in each file and then process them separately
or else
use expr
read in the data into a dataframe
write a unwieldy expr (from a readability/maintenance perspective) to conditionally do something differently based on the value of the column
this will not look anywhere as "clean" as the pandas code above looks
Is there anything else that is the appropriate way to handle this requirement? From the efficiency perspective, I expect the first approach to be cleaner, but have more run time due to the partition-write-read, and the second approach is not as good from the code perspective, and harder to extend and maintain.
More primarily, would you choose to use something completely different (e.g. message queues) instead (relative latency difference notwithstanding)?
EDIT 1
Based on my limited knowledge of pyspark, the solution proposed by user pissall (https://stackoverflow.com/users/8805315/pissall) works as long as the processing isn't very complex. If that happens, I don't know how to do it without resorting to UDFs, which come with their own disadvantages. Consider the simple example below
# create a 2-column data frame
# where I wish to extract the city
# in column B differently based on
# the type given in column A
# This requires taking a different
# substring (prefix or suffix) from column B
df = sparkSession.createDataFrame([
(1, "NewYork_NY"),
(2, "FL_Miami"),
(1, "LA_CA"),
(1, "Chicago_IL"),
(2,"PA_Kutztown")
], ["A", "B"])
# create UDFs to get left and right substrings
# I do not know how to avoid creating UDFs
# for this type of processing
getCityLeft = udf(lambda x:x[0:-3],StringType())
getCityRight = udf(lambda x:x[3:],StringType())
#apply UDFs
df = df.withColumn("city", F.when(F.col("A") == 1, getCityLeft(F.col("B"))) \
.otherwise(getCityRight(F.col("B"))))
Is there a way to do this in a simpler manner without resorting to UDFs? If I use expr, I can do this, but as I mentioned earlier, it doesn't seem elegant.

What about using when?
import pyspark.sql.functions as F
df = df.withColumn("transformed_B", F.when(F.col("A") == "a", F.col("B") + 1000).otherwise(F.col("B") * F.col("B")))
EDIT after more clarity on the question:
You can use split on _ and take the first or the second part of it based on your condition.
Is this the expected output?
df.withColumn("city", F.when(F.col("A") == 1, F.split("B", "_")[0]).otherwise(F.split("B", "_")[1])).show()
+---+-----------+--------+
| A| B| city|
+---+-----------+--------+
| 1| NewYork_NY| NewYork|
| 2| FL_Miami| Miami|
| 1| LA_CA| LA|
| 1| Chicago_IL| Chicago|
| 2|PA_Kutztown|Kutztown|
+---+-----------+--------+
UDF approach:
def sub_string(ref_col, city_col):
# ref_col is the reference column (A) and city_col is the string we want to sub (B)
if ref_col == 1:
return city_col[0:-3]
return city_col[3:]
sub_str_udf = F.udf(sub_string, StringType())
df = df.withColumn("city", sub_str_udf(F.col("A"), F.col("B")))
Also, please look into: remove last few characters in PySpark dataframe column

Related

PySpark: Transform values of given column in the DataFrame

I am new to PySpark and Spark in general.
I would like to apply transformation on a given column in the DataFrame, essentially call a function for each value on that specific column.
I have my DataFrame df that looks like this:
df.show()
+------------+--------------------+
|version | body |
+------------+--------------------+
| 1|9gIAAAASAQAEAAAAA...|
| 2|2gIAAAASAQAEAAAAA...|
| 3|3gIAAAASAQAEAAAAA...|
| 1|7gIAKAASAQAEAAAAA...|
+------------+--------------------+
I need to read value of body column for each row where the version is 1 and then decrypt it (I have my own logic/function which takes a string and returns a decrypted string). Finally, write the decrypted values in csv format to a S3 bucket.
def decrypt(encrypted_string: str):
# code that returns decrypted string
So, When I do following, I get the corresponding filtered values to which I need to apply my decrypt function.
df.where(col('version') =='1')\
.select(col('body')).show()
+--------------------+
| body|
+--------------------+
|9gIAAAASAQAEAAAAA...|
|7gIAKAASAQAEAAAAA...|
+--------------------+
However, I am not clear how to do that. I tried to use collect() but then it defeats the purpose of using Spark.
I also tried using .rdd.map as follows but that did not work.
df.where(col('version') =='1')\
.select(col('body'))\
.rdd.map(lambda x: decrypt).toDF().show()
OR
.rdd.map(decrypt).toDF().show()
Could someone please help with this.
Please try:
from pyspark.sql.functions import udf
decrypt_udf = udf(decrypt, StringType())
df.where(col('version') =='1').withColumn('body', decrypt_udf('body'))
Got some clue from this post: Pyspark DataFrame UDF on Text Column.
Looks like I can simply get it with following. I was doing it without using udf earlier, so it wasn't working.
dummy_function_udf = udf(decrypt, StringType())
df.where(col('version') == '1')\
.select(col('body')) \
.withColumn('decryptedBody', dummy_function_udf('body')) \
.show()

How to transform pyspark dataframe 1x9 to 3x3

Im using pyspark dataframe.
I have a df which is 1x9
example
temp = spark.read.option("sep","\n").csv("temp.txt")
temp :
sam
11
newyork
john
13
boston
eric
22
texas
without using Pandas library, How can I transform this to 3x3 dataframe with columns name,age,city ?
like this :
name,age,city
sam,11,newyork
john,13,boston
I would read the file as an rdd to take advantage of zipWithIndex to add an index to your data.
rdd = sc.textFile("temp.txt")
We can now use truncating division to create an index with which to group records together. Use this new index as the key for the rdd. The corresponding values will be a tuple of the header, which can be computed using the modulus, and the actual value. (Note the index returned by zipWithIndex will be at the end of the record, which is why we use row[1] for the division/mod.)
Next use reduceByKey to add the value tuples together. This will give you a tuple of keys and values (in sequence). Use map to turn that into a Row (to keep column headers, etc).
Finally use toDF() to convert to a DataFrame. You can use select(header) to get the columns in the desired order.
from operator import add
from pyspark.sql import Row
header = ["name", "age", "city"]
df = rdd.zipWithIndex()\
.map(lambda row: (row[1]//3, (header[row[1]%3], row[0])))\
.reduceByKey(add)\
.map(lambda row: Row(**dict(zip(row[1][::2], row[1][1::2]))))\
.toDF()\
.select(header)
df.show()
#+----+---+-------+
#|name|age| city|
#+----+---+-------+
#| sam| 11|newyork|
#|eric| 22| texas|
#|john| 13| boston|
#+----+---+-------+

Pyspark add sequential and deterministic index to dataframe

I need to add an index column to a dataframe with three very simple constraints:
start from 0
be sequential
be deterministic
I'm sure I'm missing something obvious because the examples I'm finding look very convoluted for such a simple task, or use non-sequential, non deterministic increasingly monotonic id's. I don't want to zip with index and then have to separate the previously separated columns that are now in a single column because my dataframes are in the terabytes and it just seems unnecessary. I don't need to partition by anything, nor order by anything, and the examples I'm finding do this (using window functions and row_number). All I need is a simple 0 to df.count sequence of integers. What am I missing here?
1, 2, 3, 4, 5
What I mean is: how can I add a column with an ordered, monotonically increasing by 1 sequence 0:df.count? (from comments)
You can use row_number() here, but for that you'd need to specify an orderBy(). Since you don't have an ordering column, just use monotonically_increasing_id().
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql import Window
df = df.withColumn(
"index",
row_number().over(Window.orderBy(monotonically_increasing_id()))-1
)
Also, row_number() starts at 1, so you'd have to subtract 1 to have it start from 0. The last value will be df.count - 1.
I don't want to zip with index and then have to separate the previously separated columns that are now in a single column
You can use zipWithIndex if you follow it with a call to map, to avoid having all of the separated columns turn into a single column:
cols = df.columns
df = df.rdd.zipWithIndex().map(lambda row: (row[1],) + tuple(row[0])).toDF(["index"] + cols
Not sure about the performance but here is a trick.
Note - toPandas will collect all the data to driver
from pyspark.sql import SparkSession
# speed up toPandas using arrow
spark = SparkSession.builder.appName('seq-no') \
.config("spark.sql.execution.arrow.pyspark.enabled", "true") \
.config("spark.sql.execution.arrow.enabled", "true") \
.getOrCreate()
df = spark.createDataFrame([
('id1', "a"),
('id2', "b"),
('id2', "c"),
], ["ID", "Text"])
df1 = spark.createDataFrame(df.toPandas().reset_index()).withColumnRenamed("index","seq_no")
df1.show()
+------+---+----+
|seq_no| ID|Text|
+------+---+----+
| 0|id1| a|
| 1|id2| b|
| 2|id2| c|
+------+---+----+

Pyspark dataframe conversion to pandas drops data?

I have a fairly involved process of creating a pyspark dataframe, converting it to a pandas dataframe, and outputting the result to a flat file. I am not sure at which point the error is introduced, so I'll describe the whole process.
Starting out I have a pyspark dataframe that contains pairwise similarity for sets of ids. It looks like this:
+------+-------+-------------------+
| ID_A| ID_B| EuclideanDistance|
+------+-------+-------------------+
| 1| 1| 0.0|
| 1| 2|0.13103884200454394|
| 1| 3| 0.2176246463836219|
| 1| 4| 0.280568636550471|
...
I'like to group it by ID_A, sort each group by EuclideanDistance, and only grab the top N pairs for each group. So first I do this:
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col, row_number
window = Window.partitionBy(df['ID_A']).orderBy(df_sim['EuclideanDistance'])
result = (df.withColumn('row_num', row_number().over(window)))
I make sure ID_A = 1 is still in the "result" dataframe. Then I do this to limit each group to just 20 rows:
result1 = result.where(result.row_num<20)
result1.toPandas().to_csv("mytest.csv")
and ID_A = 1 is NOT in the resultant .csv file (although it's still there in result1). Is there a problem somewhere in this chain of conversions that could lead to a loss of data?
You are referencing 2 dataframes in the window of your solution. Not sure this is causing your error, but it's worth cleaning up. In any case, you don't need to reference a particular dataframe in a window definition. In any case, try
window = Window.partitionBy('ID_A').orderBy('EuclideanDistance')
As David mentioned, you reference a second dataframe "df_sim" in your window function.
I tested the following and it works on my machine (famous last words):
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col, row_number
import pandas as pd
#simulate some data
df = pd.DataFrame({'ID_A': pd.np.arange(100)%5,
'ID_B': pd.np.repeat(pd.np.arange(20),5),
'EuclideanDistance': pd.np.random.rand(100)*5}
)
#artificially set distance between point and self to 0
df['EuclideanDistance'][df['ID_A'] == df['ID_B']] = 0
df = spark.createDataFrame(df)
#end simulation
window = Window.partitionBy(df['ID_A']).orderBy(df['EuclideanDistance'])
output = df.select('*', row_number().over(window).alias('rank')).filter(col('rank') <= 10)
output.show(50)
The simulation code is there just to make this a self-contained example. You can of course use your actual dataframe and ignore the simulation when you test it. Hope that works!

Renaming columns for PySpark DataFrame aggregates

I am analysing some data with PySpark DataFrames. Suppose I have a DataFrame df that I am aggregating:
(df.groupBy("group")
.agg({"money":"sum"})
.show(100)
)
This will give me:
group SUM(money#2L)
A 137461285853
B 172185566943
C 271179590646
The aggregation works just fine but I dislike the new column name SUM(money#2L). Is there a way to rename this column into something human readable from the .agg method? Maybe something more similar to what one would do in dplyr:
df %>% group_by(group) %>% summarise(sum_money = sum(money))
Although I still prefer dplyr syntax, this code snippet will do:
import pyspark.sql.functions as sf
(df.groupBy("group")
.agg(sf.sum('money').alias('money'))
.show(100))
It gets verbose.
withColumnRenamed should do the trick. Here is the link to the pyspark.sql API.
df.groupBy("group")\
.agg({"money":"sum"})\
.withColumnRenamed("SUM(money)", "money")
.show(100)
I made a little helper function for this that might help some people out.
import re
from functools import partial
def rename_cols(agg_df, ignore_first_n=1):
"""changes the default spark aggregate names `avg(colname)`
to something a bit more useful. Pass an aggregated dataframe
and the number of aggregation columns to ignore.
"""
delimiters = "(", ")"
split_pattern = '|'.join(map(re.escape, delimiters))
splitter = partial(re.split, split_pattern)
split_agg = lambda x: '_'.join(splitter(x))[0:-ignore_first_n]
renamed = map(split_agg, agg_df.columns[ignore_first_n:])
renamed = zip(agg_df.columns[ignore_first_n:], renamed)
for old, new in renamed:
agg_df = agg_df.withColumnRenamed(old, new)
return agg_df
An example:
gb = (df.selectExpr("id", "rank", "rate", "price", "clicks")
.groupby("id")
.agg({"rank": "mean",
"*": "count",
"rate": "mean",
"price": "mean",
"clicks": "mean",
})
)
>>> gb.columns
['id',
'avg(rate)',
'count(1)',
'avg(price)',
'avg(rank)',
'avg(clicks)']
>>> rename_cols(gb).columns
['id',
'avg_rate',
'count_1',
'avg_price',
'avg_rank',
'avg_clicks']
Doing at least a bit to save people from typing so much.
It's simple as:
val maxVideoLenPerItemDf = requiredItemsFiltered.groupBy("itemId").agg(max("playBackDuration").as("customVideoLength"))
maxVideoLenPerItemDf.show()
Use .as in agg to name the new row created.
.alias and .withColumnRenamed both work if you're willing to hard-code your column names. If you need a programmatic solution, e.g. friendlier names for an aggregation of all remaining columns, this provides a good starting point:
grouping_column = 'group'
cols = [F.sum(F.col(x)).alias(x) for x in df.columns if x != grouping_column]
(
df
.groupBy(grouping_column)
.agg(
*cols
)
)
df = df.groupby('Device_ID').agg(aggregate_methods)
for column in df.columns:
start_index = column.find('(')
end_index = column.find(')')
if (start_index and end_index):
df = df.withColumnRenamed(column, column[start_index+1:end_index])
The above code can strip out anything that is outside of the "()". For example, "sum(foo)" will be renamed as "foo".
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
spark = SparkSession.builder.appName('test').getOrCreate()
data = [(1, "siva", 100), (2, "siva2", 200),(3, "siva3", 300),(4, "siva4", 400),(5, "siva5", 500)]
schema = ['id', 'name', 'sallary']
df = spark.createDataFrame(data, schema=schema)
df.show()
+---+-----+-------+
| id| name|sallary|
+---+-----+-------+
| 1| siva| 100|
| 2|siva2| 200|
| 3|siva3| 300|
| 4|siva4| 400|
| 5|siva5| 500|
+---+-----+-------+
**df.agg({"sallary": "max"}).withColumnRenamed('max(sallary)', 'max').show()**
+---+
|max|
+---+
|500|
+---+
While the previously given answers are good, I think they're lacking a neat way to deal with dictionary-usage in the .agg()
If you want to use a dict, which actually might be also dynamically generated because you have hundreds of columns, you can use the following without dealing with dozens of code-lines:
# Your dictionary-version of using the .agg()-function
# Note: The provided logic could actually also be applied to a non-dictionary approach
df = df.groupBy("group")\
.agg({
"money":"sum"
, "...": "..."
})
# Now do the renaming
newColumnNames = ["group", "money", "..."] # Provide the names for ALL columns of the new df
df = df.toDF(*newColumnNames) # Do the renaming
Of course the newColumnNames-list can also be dynamically generated. E.g., if you only append columns from the aggregation to your df you can pre-store newColumnNames = df.columns and then just append the additional names.
Anyhow, be aware that the newColumnNames must contain all column names of the dataframe, not only those to be renamed (because .toDF() creates a new dataframe due to Sparks immutable RDDs)!
Another quick little one liner to add the the mix:
df.groupBy('group')
.agg({'money':'sum',
'moreMoney':'sum',
'evenMoreMoney':'sum'
})
.select(*(col(i).alias(i.replace("(",'_').replace(')','')) for i in df.columns))
just change the alias function to whatever you'd like to name them. The above generates sum_money, sum_moreMoney, since I do like seeing the operator in the variable name.