Pyspark SQL Select but with a function? - sql

I am looking at this SQL query:
SELECT
tbl.id as id,
tbl. as my_name,
tbl.account as new_account_id,
CONVERT_TIMEZONE('UTC', 'America/Los_Angeles', tbl.entry_time)::DATE AS my_time
FROM tbl
I am wondering how I would convert this into a Pyspark dataframe?
Say I loaded tbl as a CSV into Pyspark like:
tbl_dataframe = spark...load('/files/tbl.csv')
Now I want to use SELECT on this dataframe, something like:
final_dataframe = tbl_dataframe.select('id', 'name', ...)
The issue here is:
How do I rename say that 'name' into 'my_name' with this select?
Is it even possible to apply that CONVERT_TIMEZONE function with select in dataframe? Whats the best/standard approach for this?

How do I rename say that 'name' into 'my_name' with this select?
Assuming your dataframe looks like this
# +---+----+
# | id|name|
# +---+----+
# | 1| foo|
# | 2| bar|
# +---+----+
There are few different ways to do this "rename":
df.select(F.col('name').alias('my_name')) # you select specific column and give it an alias
# +-------+
# |my_name|
# +-------+
# | foo|
# | bar|
# +-------+
# or
df.withColumn('my_name', F.col('name')) # you create new column with value from old column
# +---+----+-------+
# | id|name|my_name|
# +---+----+-------+
# | 1| foo| foo|
# | 2| bar| bar|
# +---+----+-------+
# or
df.withColumnRenamed('name', 'my_name') # you rename column
# +---+-------+
# | id|my_name|
# +---+-------+
# | 1| foo|
# | 2| bar|
# +---+-------+
Is it even possible to apply that CONVERT_TIMEZONE function with select in dataframe? What's the best/standard approach for this?
CONVERT_TIMEZONE is not a standard Spark function, but if it's a Hive function that is already registered somewhere, you can try this F.expr('convert_timezone()')

Related

Pyspark scan column against other column or list

Given the example dataframe:
+---+---------------+
| id| log|
+---+---------------+
| 1|Test logX blk_A|
| 2|Test logV blk_B|
| 3|Test logF blk_D|
| 4|Test logD blk_F|
| 5|Test logB blk_K|
| 6|Test logY blk_A|
| 7|Test logE blk_C|
+---+---------------+
I'm trying to label it by comparing the log with a list (or df column, I can convert it easily) of the blocks tagges as anomalous.
This means that I need to scan each logLine against this list and add the label column.
Given the list:
anomalous_blocks = ['blk_A','blk_C','blk_D']
The expected resulting dataframe would be:
+---+---------------+-----+
| id| log|Label|
+---+---------------+-----+
| 1|Test logX blk_A| True|
| 2|Test logV blk_B|False|
| 3|Test logF blk_D| True|
| 4|Test logD blk_F|False|
| 5|Test logB blk_K|False|
| 6|Test logY blk_A| True|
| 7|Test logE blk_C| True|
+---+---------------+-----+
I tried to think and look for a solution in SQL or Spark that could accomplish this, but came up short.
I thought of using a udf (user defined function) like this:
from pyspark.sql.functions import udf
def check_anomaly(text, anomalies):
for a in anomalies:
if a in text:
return True
return False
anomaly_matchUDF = udf(lambda x,y:check_anomaly(x,y))
But it takes way too long and doesn't seem the proper way to go about this.
Any suggestion would be greatly appreciated.
EDIT:
For clarity, the size of the list is way smaller compared to the number of rows/logs.
In other words, given N log lines and a list of M blocks tagged as anomalous
N >> M
EDIT2:
Updated df to represent more accurately the real situation
you could use the like or contains operator and create a chain of condition using reduce.
anomalous_blocks = ['blk_A','blk_C','blk_D']
label_condition = reduce(lambda a, b: a | b,
[func.col('log').like('%'+k+'%') for k in anomalous_blocks]
)
# Column<'((log LIKE %blk_A% OR log LIKE %blk_C%) OR log LIKE %blk_D%)'>
data_sdf. \
withColumn('label', label_condition). \
show()
# +---+---------------+-----+
# | id| log|label|
# +---+---------------+-----+
# | 1|Test logX blk_A| true|
# | 2|Test logV blk_B|false|
# | 3|Test logF blk_D| true|
# | 4|Test logD blk_F|false|
# | 5|Test logB blk_K|false|
# | 6|Test logY blk_A| true|
# | 7|Test logE blk_C| true|
# +---+---------------+-----+
You can use the isin method on a pyspark.sql.Column to achieve this without needing UDFs (notice that I changed the contents of your anomalous_blocks list slightly to match exactly to the df's contents. This should be really cheap since you said N >> M):
df = spark.createDataFrame(
[
(1, "Test log blk_A"),
(2, "Test log blk_B"),
(3, "Test log blk_D"),
(4, "Test log blk_F"),
(5, "Test log blk_K"),
(6, "Test log blk_A"),
(7, "Test log blk_C")
],
["id", "log"]
)
anomalous_blocks = ['blk_A','blk_C','blk_D']
# Solution starts here
adapted_anomalous_blocks = ["Test log " + x for x in anomalous_blocks]
output = df.withColumn("Label", df.log.isin(adapted_anomalous_blocks))
output.show()
+---+--------------+-----+
| id| log|Label|
+---+--------------+-----+
| 1|Test log blk_A| true|
| 2|Test log blk_B|false|
| 3|Test log blk_D| true|
| 4|Test log blk_F|false|
| 5|Test log blk_K|false|
| 6|Test log blk_A| true|
| 7|Test log blk_C| true|
+---+--------------+-----+

How to calculate values for a column in a row based on previous row's column's value for a PySpark Dataframe?

I have a column 'val' whose value gets calculated at each row and then the next row takes in that value and applies some logic on it, and then value for that row also gets updated. It can be shown as follows:-
val(x) = f(val(x-1), col_a(x), col_b(x)) where x is the row number (indexed at 0)
val(0) = f(col_a(0), col_b(0)) {some fixed value calculated based on two columns}
val(0) represents the first value in a partition.
[ f here represents some arbitrary function]
I tried using lag function as follows (for a sample dataframe):-
windowSpec = Window.partitionBy("department")
+-------------+----------+------+------+------+
|employee_name|department| a | b | val |
+-------------+----------+------+------+------+
|James |Sales |3000 |2500 |5500 | #val(0) = (a(0) + b(0)) = 5500 [first value within a partition]
|Michael |Sales |4600 |1650 |750 | #val(1) = (a(1) + b(1) - val(0)) = 750
|Robert |Sales |4100 |1100 |4450 | #val(2) = (a(2) + b(2) - val(1)) = 4450
|Maria |Finance |3000 |7000 |xxxx | #....... and so on, this is how I want the calculations to take place.
|James |Finance |3000 |5000 |xxxx |
|Scott |Marketing |3300 |4300 |xxxx |
|Jen |Marketing |3900 |3700 |xxxx |
df = df.withColumn("val",col("a") + col("b") - lag("val",1).over(windowSpec)) #I tried this but it does not have the desired result.
How can I implement this in PySpark?
Tracking the previously calculated value from the same column is hard to do in spark -- I'm not saying it's impossible, and there certainly are ways (hacks) to achieve it. One way to do is using array of structs and aggregate function.
Two assumptions in your data
There is an ID column that has the sort order of the data - spark does not retain dataframe sorting due to its distributed nature
There is a grouping key for the processing to be optimized
# input data with aforementioned assumptions
data_sdf.show()
# +---+---+-------+---------+----+----+
# | gk|idx| name| dept| a| b|
# +---+---+-------+---------+----+----+
# | gk| 1| James| Sales|3000|2500|
# | gk| 2|Michael| Sales|4600|1650|
# | gk| 3| Robert| Sales|4100|1100|
# | gk| 4| Maria| Finance|3000|7000|
# | gk| 5| James| Finance|3000|5000|
# | gk| 6| Scott|Marketing|3300|4300|
# | gk| 7| Jen|Marketing|3900|3700|
# +---+---+-------+---------+----+----+
# create structs with all columns and collect it to an array
# use the array of structs to do the val calcs
# NOTE - keep the ID field at the beginning for the `array_sort` to work as reqd
arr_of_structs_sdf = data_sdf. \
withColumn('allcol_struct', func.struct(*data_sdf.columns)). \
groupBy('gk'). \
agg(func.array_sort(func.collect_list('allcol_struct')).alias('allcol_struct_arr'))
# function to create struct schema string
struct_fields = lambda x: ', '.join([str(x)+'.'+k+' as '+k for k in data_sdf.columns])
# use `aggregate` to do the val calc
arr_of_structs_sdf. \
withColumn('new_allcol_struct_arr',
func.expr('''
aggregate(slice(allcol_struct_arr, 2, size(allcol_struct_arr)),
array(struct({0}, (allcol_struct_arr[0].a+allcol_struct_arr[0].b) as val)),
(x, y) -> array_union(x,
array(struct({1}, ((y.a+y.b)-element_at(x, -1).val) as val))
)
)
'''.format(struct_fields('allcol_struct_arr[0]'), struct_fields('y'))
)
). \
selectExpr('inline(new_allcol_struct_arr)'). \
show(truncate=False)
# +---+---+-------+---------+----+----+----+
# |gk |idx|name |dept |a |b |val |
# +---+---+-------+---------+----+----+----+
# |gk |1 |James |Sales |3000|2500|5500|
# |gk |2 |Michael|Sales |4600|1650|750 |
# |gk |3 |Robert |Sales |4100|1100|4450|
# |gk |4 |Maria |Finance |3000|7000|5550|
# |gk |5 |James |Finance |3000|5000|2450|
# |gk |6 |Scott |Marketing|3300|4300|5150|
# |gk |7 |Jen |Marketing|3900|3700|2450|
# +---+---+-------+---------+----+----+----+

How can we iterate through a column vertically downwards using PySpark?

For instance, in a dataframe where col1 is the name of a column and it has values 1,2,3 and so for every row, how do I iterate through the 10,20,30.. values alone?
Well... Bluntly said, in Spark you just don't iterate. You don't deal with rows in Spark. You just learn a new way of thinking and only deal with columns.
E.g., your example:
df = spark.range(101).toDF("col1")
df.show()
# +----+
# |col1|
# +----+
# | 0|
# | 1|
# | 2|
# | 3|
# | 4|
# | 5|
# | 6|
# | 7|
# | 8|
# | 9|
# | 10|
# | 11|
# | ...|
If you want to get only rows where col1 = 10,........ 20,........ 30,........ 40......... you must see a sequence there. You think about it and create a rule to smart-filter your dataframe:
df = df.filter('col1 % 10 = 0')
df.show()
# +----+
# |col1|
# +----+
# | 0|
# | 10|
# | 20|
# | 30|
# | 40|
# | 50|
# | 60|
# | 70|
# | 80|
# | 90|
# | 100|
# +----+
Row order is never deterministic in Spark. Every action changes row order. Sorting is available, but it's costy and impractical, as next operation will ruin the order. When you sort, you pull everything into one machine (only when data is on one node you may, at least temporarily, preserve the order, because normally data is split across many machines and none of them is "first" or "second"). In distributed computing, data as much as possible should stay distributed.
That said, iterating rarely may be needed. There's df.collect() which (same as sorting) collects all rows into one list in one machine (the driver - the weakest machine). This operation is to be avoided, because it distorts the nature of distributed computing. But in rare cases it is used. Iterating over rows is an exception. Almost any data operation is possible without iterating. You just search the web, think and learn new ways of doing things.

How to increment the value by one if the value of the column is not there in Pyspark

I have a below pyspark dataframe
a = [["1","fawef"],["","esd"],["","rdf"],["2","ddbf"]]
columns = ["id","name"]
df = spark.createDataFrame(data = a, schema = columns)
id name
1 fawef
esd
rdf
2 ddbf
Now my requirement is, if the id column is empty then I need to get the max of the id and increment that value by 1 and place the value in new column of the particular row.
Example:
In the above dataframe second row is empty in the id column now i need to get the max of id column that will be 2 and now i need to add 1 to the max value now the output will be 3. Now i need to place 3 in the second row of the new column.
output i am expecting
id name new_col
1 fawef 1
esd 3
rdf 4
2 ddbf 2
Is there any way to achieve the above output, it will be great.
Incrementing it will be easy if we have the max of the id field and a row_number() wherever the id field is blank.
I used the following data
# +----+----+
# | id|name|
# +----+----+
# | 1|blah|
# |null| yes|
# |null| no|
# | 2|bleh|
# |null|ohno|
# +----+----+
and, did the following transformations
data_sdf. \
withColumn('id_rn', func.row_number().over(wd.partitionBy('id').orderBy(func.lit('1')))). \
withColumn('new_id',
func.when(func.col('id').isNull(), func.max('id').over(wd.partitionBy(func.lit('1'))) + func.col('id_rn')).
otherwise(func.col('id'))
). \
show()
# +----+----+-----+------+
# | id|name|id_rn|new_id|
# +----+----+-----+------+
# |null| yes| 1| 3|
# |null| no| 2| 4|
# |null|ohno| 3| 5|
# | 1|blah| 1| 1|
# | 2|bleh| 1| 2|
# +----+----+-----+------+
I created a new field to assign row numbers or some ID to the blank values using row_number()
I used the aforementioned to add it to the max of id field whenever id is blank

Mismatched feature counts in spark data frame

I am new to Spark and I am trying to clean a relatively large dataset. The problem I have is that the feature values seem to be mismatched in the original dataset. It looks something like this for the first line when I take a summary of the dataset :
|summary A B |
---------------------
|count 5 10 |
I am trying to find a way to filter based on the row with the lowest count across all features and maintain the ordering.
I would like to have:
|summary A B |
---------------------
|count 5 5 |
How could I achieve this? Thanks!
Here are two approaches for you to consider:
Simple approach
# Set up the example df
df = spark.createDataFrame([('count',5,10)],['summary','A','B'])
# +-------+---+---+
# |summary| A| B|
# +-------+---+---+
# | count| 5| 10|
# +-------+---+---+
from pyspark.sql.functions import udf, col
from pyspark.sql.types import IntegerType
#udf(returnType=IntegerType())
def get_row_min(A,B):
return min([A,B])
df.withColumn('new_A', get_row_min(col('A'),col('B')))\
.withColumn('new_B', col('new_A'))\
.drop('A')\
.drop('B')\
.withColumnRenamed('new_A','A')\
.withColumnRenamed('new_B', 'B')\
.show()
# +-------+---+---+
# |summary| A| B|
# +-------+---+---+
# | count| 5| 5|
# +-------+---+---+
Generic approach for indirectly specified columns
# Set up df with an extra column (and an extra row to show it works)
df2 = spark.createDataFrame([('count',5,10,15),
('count',3,2,1)],
['summary','A','B','C'])
# +-------+---+---+---+
# |summary| A| B| C|
# +-------+---+---+---+
# | count| 5| 10| 15|
# | count| 3| 2| 1|
# +-------+---+---+---+
#udf(returnType=IntegerType())
def get_row_min_generic(*cols):
return min(cols)
exclude = ['summary']
df3 = df2.withColumn('min_val', get_row_min_generic(*[col(col_name) for col_name in df2.columns
if col_name not in exclude]))
exclude.append('min_val') # this could just be specified in the list
# from the beginning instead of appending
new_cols = [col('min_val').alias(c) for c in df2.columns if c not in exclude]
df_out = df3.select(['summary']+new_cols)
df_out.show()
# +-------+---+---+---+
# |summary| A| B| C|
# +-------+---+---+---+
# | count| 5| 5| 5|
# | count| 1| 1| 1|
# +-------+---+---+---+