Pyspark scan column against other column or list - dataframe

Given the example dataframe:
+---+---------------+
| id| log|
+---+---------------+
| 1|Test logX blk_A|
| 2|Test logV blk_B|
| 3|Test logF blk_D|
| 4|Test logD blk_F|
| 5|Test logB blk_K|
| 6|Test logY blk_A|
| 7|Test logE blk_C|
+---+---------------+
I'm trying to label it by comparing the log with a list (or df column, I can convert it easily) of the blocks tagges as anomalous.
This means that I need to scan each logLine against this list and add the label column.
Given the list:
anomalous_blocks = ['blk_A','blk_C','blk_D']
The expected resulting dataframe would be:
+---+---------------+-----+
| id| log|Label|
+---+---------------+-----+
| 1|Test logX blk_A| True|
| 2|Test logV blk_B|False|
| 3|Test logF blk_D| True|
| 4|Test logD blk_F|False|
| 5|Test logB blk_K|False|
| 6|Test logY blk_A| True|
| 7|Test logE blk_C| True|
+---+---------------+-----+
I tried to think and look for a solution in SQL or Spark that could accomplish this, but came up short.
I thought of using a udf (user defined function) like this:
from pyspark.sql.functions import udf
def check_anomaly(text, anomalies):
for a in anomalies:
if a in text:
return True
return False
anomaly_matchUDF = udf(lambda x,y:check_anomaly(x,y))
But it takes way too long and doesn't seem the proper way to go about this.
Any suggestion would be greatly appreciated.
EDIT:
For clarity, the size of the list is way smaller compared to the number of rows/logs.
In other words, given N log lines and a list of M blocks tagged as anomalous
N >> M
EDIT2:
Updated df to represent more accurately the real situation

you could use the like or contains operator and create a chain of condition using reduce.
anomalous_blocks = ['blk_A','blk_C','blk_D']
label_condition = reduce(lambda a, b: a | b,
[func.col('log').like('%'+k+'%') for k in anomalous_blocks]
)
# Column<'((log LIKE %blk_A% OR log LIKE %blk_C%) OR log LIKE %blk_D%)'>
data_sdf. \
withColumn('label', label_condition). \
show()
# +---+---------------+-----+
# | id| log|label|
# +---+---------------+-----+
# | 1|Test logX blk_A| true|
# | 2|Test logV blk_B|false|
# | 3|Test logF blk_D| true|
# | 4|Test logD blk_F|false|
# | 5|Test logB blk_K|false|
# | 6|Test logY blk_A| true|
# | 7|Test logE blk_C| true|
# +---+---------------+-----+

You can use the isin method on a pyspark.sql.Column to achieve this without needing UDFs (notice that I changed the contents of your anomalous_blocks list slightly to match exactly to the df's contents. This should be really cheap since you said N >> M):
df = spark.createDataFrame(
[
(1, "Test log blk_A"),
(2, "Test log blk_B"),
(3, "Test log blk_D"),
(4, "Test log blk_F"),
(5, "Test log blk_K"),
(6, "Test log blk_A"),
(7, "Test log blk_C")
],
["id", "log"]
)
anomalous_blocks = ['blk_A','blk_C','blk_D']
# Solution starts here
adapted_anomalous_blocks = ["Test log " + x for x in anomalous_blocks]
output = df.withColumn("Label", df.log.isin(adapted_anomalous_blocks))
output.show()
+---+--------------+-----+
| id| log|Label|
+---+--------------+-----+
| 1|Test log blk_A| true|
| 2|Test log blk_B|false|
| 3|Test log blk_D| true|
| 4|Test log blk_F|false|
| 5|Test log blk_K|false|
| 6|Test log blk_A| true|
| 7|Test log blk_C| true|
+---+--------------+-----+

Related

(spark scala) How to remove nulls in all colums of dataframe and substitue with default values

I am getting a dataframe which when printed is as follows. Essentially its Array[String] data types and at times in database we have arrays of nulls.
+----------+
|newAddress|
+----------+
| null|
| null|
| null|
| null|
| null|
| null|
| [,,]|
| [,,]|
| [,,]|
| [,,]|
| null|
| [,,]|
| [,,]|
| [,,]|
| [,,]|
| [,,]|
| [,,]|
| [,,]|
| [,,]|
+----------+
So I want to write a UDF which scans all columns of the dataframe and if the datatype is an array (of any type); then scans through the array and removes the nulls. If this can be generically built without requiring taking the column names etc -- it will be great
any thoughts?
DataFrame has dtypes method, which returns column names along with their data types: Array[("Column name", "Data Type")].
You can map this array, applying different expressions to each column, based on their data type. And you can then pass this mapped list to the select method:
val df = Seq((1,List[Integer](1,2,null))).toDF
+---+------------+
| _1| _2|
+---+------------+
| 1|[1, 2, null]|
+---+------------+
df.dtypes
// Array[(String, String)] = Array((_1,IntegerType), (_2,ArrayType(IntegerType,true)))
val cols =
df.dtypes.map{
case (c, t) if t.startsWith("ArrayType") => filter(col(c), x => x.isNotNull).as(c)
case (c, _) => col(c)
}
df.select(cols:_*).show
+---+------+
| _1| _2|
+---+------+
| 1|[1, 2]|
+---+------+
You can iterate over the schema of the dataframe and use spark sql built-in functions to filter the array columns:
import org.apache.spark.sql.functions.{filter, col}
// sample data
import spark.implicits._
val data = Seq((5,List[Integer](1,2,null), List[String](null, null, "a"))).toDF
// get the array columns out of the schema and filter the null values
val dataWithoutNulls = data.schema.filter(col => col.dataType.typeName === "array").map(_.name).foldLeft(data) { (df, colName) =>
df.withColumn(colName, filter(col(colName), c => c.isNotNull))
}
dataWithoutNulls.show()
// +---+------+---+
// | _1| _2| _3|
// +---+------+---+
// | 5|[1, 2]|[a]|
// +---+------+---+

How to calculate values for a column in a row based on previous row's column's value for a PySpark Dataframe?

I have a column 'val' whose value gets calculated at each row and then the next row takes in that value and applies some logic on it, and then value for that row also gets updated. It can be shown as follows:-
val(x) = f(val(x-1), col_a(x), col_b(x)) where x is the row number (indexed at 0)
val(0) = f(col_a(0), col_b(0)) {some fixed value calculated based on two columns}
val(0) represents the first value in a partition.
[ f here represents some arbitrary function]
I tried using lag function as follows (for a sample dataframe):-
windowSpec = Window.partitionBy("department")
+-------------+----------+------+------+------+
|employee_name|department| a | b | val |
+-------------+----------+------+------+------+
|James |Sales |3000 |2500 |5500 | #val(0) = (a(0) + b(0)) = 5500 [first value within a partition]
|Michael |Sales |4600 |1650 |750 | #val(1) = (a(1) + b(1) - val(0)) = 750
|Robert |Sales |4100 |1100 |4450 | #val(2) = (a(2) + b(2) - val(1)) = 4450
|Maria |Finance |3000 |7000 |xxxx | #....... and so on, this is how I want the calculations to take place.
|James |Finance |3000 |5000 |xxxx |
|Scott |Marketing |3300 |4300 |xxxx |
|Jen |Marketing |3900 |3700 |xxxx |
df = df.withColumn("val",col("a") + col("b") - lag("val",1).over(windowSpec)) #I tried this but it does not have the desired result.
How can I implement this in PySpark?
Tracking the previously calculated value from the same column is hard to do in spark -- I'm not saying it's impossible, and there certainly are ways (hacks) to achieve it. One way to do is using array of structs and aggregate function.
Two assumptions in your data
There is an ID column that has the sort order of the data - spark does not retain dataframe sorting due to its distributed nature
There is a grouping key for the processing to be optimized
# input data with aforementioned assumptions
data_sdf.show()
# +---+---+-------+---------+----+----+
# | gk|idx| name| dept| a| b|
# +---+---+-------+---------+----+----+
# | gk| 1| James| Sales|3000|2500|
# | gk| 2|Michael| Sales|4600|1650|
# | gk| 3| Robert| Sales|4100|1100|
# | gk| 4| Maria| Finance|3000|7000|
# | gk| 5| James| Finance|3000|5000|
# | gk| 6| Scott|Marketing|3300|4300|
# | gk| 7| Jen|Marketing|3900|3700|
# +---+---+-------+---------+----+----+
# create structs with all columns and collect it to an array
# use the array of structs to do the val calcs
# NOTE - keep the ID field at the beginning for the `array_sort` to work as reqd
arr_of_structs_sdf = data_sdf. \
withColumn('allcol_struct', func.struct(*data_sdf.columns)). \
groupBy('gk'). \
agg(func.array_sort(func.collect_list('allcol_struct')).alias('allcol_struct_arr'))
# function to create struct schema string
struct_fields = lambda x: ', '.join([str(x)+'.'+k+' as '+k for k in data_sdf.columns])
# use `aggregate` to do the val calc
arr_of_structs_sdf. \
withColumn('new_allcol_struct_arr',
func.expr('''
aggregate(slice(allcol_struct_arr, 2, size(allcol_struct_arr)),
array(struct({0}, (allcol_struct_arr[0].a+allcol_struct_arr[0].b) as val)),
(x, y) -> array_union(x,
array(struct({1}, ((y.a+y.b)-element_at(x, -1).val) as val))
)
)
'''.format(struct_fields('allcol_struct_arr[0]'), struct_fields('y'))
)
). \
selectExpr('inline(new_allcol_struct_arr)'). \
show(truncate=False)
# +---+---+-------+---------+----+----+----+
# |gk |idx|name |dept |a |b |val |
# +---+---+-------+---------+----+----+----+
# |gk |1 |James |Sales |3000|2500|5500|
# |gk |2 |Michael|Sales |4600|1650|750 |
# |gk |3 |Robert |Sales |4100|1100|4450|
# |gk |4 |Maria |Finance |3000|7000|5550|
# |gk |5 |James |Finance |3000|5000|2450|
# |gk |6 |Scott |Marketing|3300|4300|5150|
# |gk |7 |Jen |Marketing|3900|3700|2450|
# +---+---+-------+---------+----+----+----+

How can I replace the values in one pyspark dataframe column with the values from another column in a sub-section of the dataframe?

I have to perform a group-by and pivot operation on a dataframe's "activity" column, and populate the new columns resulting from the pivot with the sum of the "quantity" column. One of the activity columns, however has to be populated with the sum of the "cost" column.
Data frame before group-by and pivot:
+----+-----------+-----------+-----------+-----------+
| id | quantity | cost | activity | category |
+----+-----------+-----------+-----------+-----------+
| 1 | 2 | 2 | skiing | outdoor |
| 2 | 0 | 2 | swimming | outdoor |
+----+-----------+-----------+-----------+-----------+
pivot code:
pivotDF = df.groupBy("category").pivot("activity").sum("quantity")
result:
+----+-----------+-----------+-----------+
| id | category | skiing | swimming |
+----+-----------+-----------+-----------+
| 1 | outdoor | 2 | 5 |
| 2 | outdoor | 4 | 7 |
+----+-----------+-----------+-----------+
The problem is that for one of these activities, I need the activity column to be populated with sum("cost") instead of sum("quantity"). I can't seem to find a way to specify this during the pivot operation itself, so I thought maybe I can just exchange the values in the quantity column for the ones in the cost column wherever the activity column value corresponds to the relevant activity. However, I can't find an example of how to do this in a pyspark data frame.
Any help would be much appreciated.
You can provide more than 1 aggregation after the pivot.
Let's say the input dataframe looks like the following
# +---+---+----+--------+-------+
# | id|qty|cost| act| cat|
# +---+---+----+--------+-------+
# | 1| 2| 2| skiing|outdoor|
# | 2| 0| 2|swimming|outdoor|
# | 3| 1| 2| skiing|outdoor|
# | 4| 2| 4|swimming|outdoor|
# +---+---+----+--------+-------+
Do a pivot and use agg() to provide more than 1 aggregation.
data_sdf. \
groupBy('id', 'cat'). \
pivot('act'). \
agg(func.sum('cost').alias('cost'),
func.sum('qty').alias('qty')
). \
show()
# +---+-------+-----------+----------+-------------+------------+
# | id| cat|skiing_cost|skiing_qty|swimming_cost|swimming_qty|
# +---+-------+-----------+----------+-------------+------------+
# | 2|outdoor| null| null| 2| 0|
# | 1|outdoor| 2| 2| null| null|
# | 3|outdoor| 2| 1| null| null|
# | 4|outdoor| null| null| 4| 2|
# +---+-------+-----------+----------+-------------+------------+
Notice the field names. Pyspark automatically assigned the suffix based on the alias provided in the aggregations. Use a drop or select to retain the columns required and rename them per your choice.
Removing id from the groupBy makes the result much better.
data_sdf. \
groupBy('cat'). \
pivot('act'). \
agg(func.sum('cost').alias('cost'),
func.sum('qty').alias('qty')
). \
show()
# +-------+-----------+----------+-------------+------------+
# | cat|skiing_cost|skiing_qty|swimming_cost|swimming_qty|
# +-------+-----------+----------+-------------+------------+
# |outdoor| 4| 3| 6| 2|
# +-------+-----------+----------+-------------+------------+

How can we iterate through a column vertically downwards using PySpark?

For instance, in a dataframe where col1 is the name of a column and it has values 1,2,3 and so for every row, how do I iterate through the 10,20,30.. values alone?
Well... Bluntly said, in Spark you just don't iterate. You don't deal with rows in Spark. You just learn a new way of thinking and only deal with columns.
E.g., your example:
df = spark.range(101).toDF("col1")
df.show()
# +----+
# |col1|
# +----+
# | 0|
# | 1|
# | 2|
# | 3|
# | 4|
# | 5|
# | 6|
# | 7|
# | 8|
# | 9|
# | 10|
# | 11|
# | ...|
If you want to get only rows where col1 = 10,........ 20,........ 30,........ 40......... you must see a sequence there. You think about it and create a rule to smart-filter your dataframe:
df = df.filter('col1 % 10 = 0')
df.show()
# +----+
# |col1|
# +----+
# | 0|
# | 10|
# | 20|
# | 30|
# | 40|
# | 50|
# | 60|
# | 70|
# | 80|
# | 90|
# | 100|
# +----+
Row order is never deterministic in Spark. Every action changes row order. Sorting is available, but it's costy and impractical, as next operation will ruin the order. When you sort, you pull everything into one machine (only when data is on one node you may, at least temporarily, preserve the order, because normally data is split across many machines and none of them is "first" or "second"). In distributed computing, data as much as possible should stay distributed.
That said, iterating rarely may be needed. There's df.collect() which (same as sorting) collects all rows into one list in one machine (the driver - the weakest machine). This operation is to be avoided, because it distorts the nature of distributed computing. But in rare cases it is used. Iterating over rows is an exception. Almost any data operation is possible without iterating. You just search the web, think and learn new ways of doing things.

Pyspark how to compare row by row based on hash from two data frame and group the result

I have bellow two data frame with hash added as additional column to identify differences for same id from both data frame
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
#identify unmatched row for same id from both data frame
df1_un_match_indf2=df1.join(df2,df1.hash==df2.hash,"leftanti")
df2_un_match_indf1=df2.join(df1,df2.hash==df1.hash,"leftanti")
#The above case list both data frame, since all hash for same id are different
Now i am trying to find difference of row value against the same id from 'df1_un_match_indf1,df2_un_match_indf1' data frame, so that it shows differences row by row
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff=df3.join(df4,df3.id==df4.id,"inner")
common_dff.show()
but result show difference like this
+--------+----------+-----+----+-----+-----------+-------+---+---+----+
|name |department|state|id |hash |name | department|state| id|hash
+--------+----------+-----+----+-----+-----+-----------+-----+---+-----+
|James |Sales |NY |101 | c123|James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234|Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34|Jen | |NY2 |103| 2f34
What i am expecting is
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
I tried with different ways, but didn't find right solution to make this expected format
Can anyone give a solution or idea to this?
Thanks
What you want to use is likely collect_list or maybe 'collect_set'
This is really well described here:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
+---+-----+-----+
| id| code| name|
+---+-----+-----+
| a| null| null|
| a|code1| null|
| a|code2|name2|
+---+-----+-----+
(df
.groupby("id")
.agg(F.collect_set("code"),
F.collect_list("name"))
.show())
+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
| a| [code1, code2]| [name2]|
+---+-----------------+------------------+
In your case you need to slightly change your join into a union to enable you to group the data.
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff = df3.union(df4)
(common_diff
.groupby("id")
.agg(F.collect_set("name"),
F.collect_list("department"))
.show())
If you can do a union just use an array:
from pyspark.sql.functions import array
common_diff.select(
df.id,
array(
common_diff.thisState,
common_diff.thatState
).alias("State"),
array(
common_diff.thisDept,
common_diff.thatDept
).alias("Department")
)
It a lot more typing and a little more fragile. I suggest that renaming columns and using the groupby is likely cleaner and clearer.