how to do intersection of list columns with pyspark dataframe columns? - dataframe

I have a below pyspark dataframe and i need to create new column (new_col) which is common items in column X and Y excluding items in Z.
df
id X Y Z new_col
1 [12,23,1,24] [13,412,12,23,24] [12] [23,24]
2 [1,2,3] [2,4,5,6] [] [2]

If your schema is the following:
df.printSchema()
#root
# |-- id: long (nullable = true)
# |-- X: array (nullable = true)
# | |-- element: long (containsNull = true)
# |-- Y: array (nullable = true)
# | |-- element: long (containsNull = true)
# |-- Z: array (nullable = true)
# | |-- element: long (containsNull = true)
and your pyspark version 2.4+ you can use array_intersect and array_except:
from pyspark.sql.functions import array_except, array_intersect
df=df.withColumn("new_col", array_except(array_intersect("X", "Y"), "Z"))
df.show()
#+---+---------------+---------------------+----+--------+
#|id |X |Y |Z |new_col |
#+---+---------------+---------------------+----+--------+
#|1 |[12, 23, 1, 24]|[13, 412, 12, 23, 24]|[12]|[23, 24]|
#|2 |[1, 2, 3] |[2, 4, 5, 6] |[] |[2] |
#+---+---------------+---------------------+----+--------+

You can use withcolumn + udf
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, ArrayType
def intersection_function(list1, list2):
intersection_list = [value for value in list1 if value in list2]
return intersection_list
udf_intersection = F.udf(intersection_function, ArrayType(IntegerType())
newdf = df.withColumn("new_col", udf_intersection(df["ListColumn1"], df["ListColumn2"]))

Related

Spark - Write a DF with Struct Schema as CSV

I have a Spark DataFrame that I created by reading an EBCDIC data file. Its schema is as follows
scala> myDF.printSchema
root
|-- DF_RECORD: struct (nullable = true)
| |-- ID: string (nullable = true)
| |-- BALANCE_AMT: decimal(15,4) (nullable = true)
I can even get it to show its rows. But when I try to write it as CSV, I get the following error
scala> myDF.write.format("csv").mode("overwrite").option("header", "true").save("/path")
org.apache.spark.sql.AnalysisException: CSV data source does not support struct<ID:string,BALANCE_AMT:decimal(15,4)> data type.;
Is there a column conversion I might have to de before writing?
Spark version 2.4.0-cdh6.2.1
Scala version 2.11.12
Add a selectExpr before writing
scala> myDF.selectExpr("DF_RECORD.*").write.format("csv").mode("overwrite").option("header", "true").save("/path")
Convert it to individual columns by accessing columns names in withColumn:
from pyspark.sql.types import StructType, StructField, StringType, Row, DoubleType
import pyspark.sql.functions as F
schema = StructType([
StructField("DF_RECORD",
StructType([
StructField("ID", StringType()),
StructField("BALANCE_AMT", DoubleType())
])
)
])
df = spark.createDataFrame([Row(Row("1",1000.0))], schema=schema)
df.printSchema()
[Out]:
root
|-- DF_RECORD: struct (nullable = true)
| |-- ID: string (nullable = true)
| |-- BALANCE_AMT: double (nullable = true)
df_csv = df \
.withColumn("ID", F.col("DF_RECORD")["ID"]) \
.withColumn("BALANCE_AMT", F.col("DF_RECORD")["BALANCE_AMT"]) \
.drop("DF_RECORD")
[Out]:
+---+-----------+
| ID|BALANCE_AMT|
+---+-----------+
| 1| 1000.0|
+---+-----------+
Now, export it as CSV:
df_csv.write.format("csv").mode("overwrite").option("header", "true").save("/content/sample_data/test1.csv")
!cat "/content/sample_data/test1.csv/part-00000-b342e07a-6d41-40b5-afa2-39eeef3b70a2-c000.csv"
[Out]:
ID, BALANCE_AMT
1, 1000.0

Struct data type when creating dataframe with createDataFrame in Scala

In PySpark, we can create struct data type when using createDataFrame like in the following example ("b", "c") and ("e", "f")
df = spark.createDataFrame([
["a", ("b", "c")],
["d", ("e", "f")]
])
df.printSchema()
# root
# |-- _1: string (nullable = true)
# |-- _2: struct (nullable = true)
# | |-- _1: string (nullable = true)
# | |-- _2: string (nullable = true)
df.show()
# +---+------+
# | _1| _2|
# +---+------+
# | a|{b, c}|
# | d|{e, f}|
# +---+------+
Is there a similar way in Scala - to create struct schema inside createDataFrame, without using org.apache.spark.sql.functions?
For your specific example, you can use tuples and call this flavor of createDataFrame.
val df = spark.createDataFrame(Seq(
("a", ("b", "c")),
("d", ("e", "f"))
))
df.printSchema()
/*
root
|-- _1: string (nullable = true)
|-- _2: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: string (nullable = true)
*/
df.show()
/*
+---+------+
| _1| _2|
+---+------+
| a|[b, c]|
| d|[e, f]|
+---+------+
*/
Instead of ("b", "c") one can also use "b" -> "c" to create a tuple of length 2.
Preferred method
Tuples can become difficult to manage when dealing with many fields and especially nested fields. Likely, you'll want to model your data using case class(s). This also allows to specify struct field names and types.
case class Person(name: String, age: Int)
case class Car(manufacturer: String, model: String, mileage: Double, owner: Person)
val df = spark.createDataFrame(Seq(
Car("Toyota", "Camry", 81400.8, Person("John", 37)),
Car("Honda", "Accord", 152090.2, Person("Jane", 25))
))
df.printSchema()
/*
root
|-- manufacturer: string (nullable = true)
|-- model: string (nullable = true)
|-- mileage: double (nullable = false)
|-- owner: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- age: integer (nullable = false)
*/
df.show()
/*
+------------+------+--------+----------+
|manufacturer| model| mileage| owner|
+------------+------+--------+----------+
| Toyota| Camry| 81400.8|[John, 37]|
| Honda|Accord|152090.2|[Jane, 25]|
+------------+------+--------+----------+
*/

Transposing spark dataframe based on a subcolumn

I have a spark dataframe which looks like this:
root
|-- 0000154d-7585-5eb283ff985c: struct (nullable = true)
| |-- collaborative_rank: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- content_rank: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- curated_rank: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- discovery_score: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- original_rank: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- recipe_id: array (nullable = true)
| | |-- element: long (containsNull = true)
|-- 00005426-2675-68085cd359c7: struct (nullable = true)
| |-- collaborative_rank: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- content_rank: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- curated_rank: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- discovery_score: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- original_rank: array (nullable = true)
| | |-- element: long (containsNull = true)
| |-- recipe_id: array (nullable = true)
| | |-- element: long (containsNull = true)
Each column is a user id, e.g. 0000154d-7585-5eb283ff985c, and each row is made up of 15 000 users (they come from json files that each contain 15 000 users).
I want to transpose it such that each user id is a row, and each sub-column collaborative_rank, content_rank, curated_rank, discovery_score, original_rank and recipe_id is a column with the array being the value. I'm new to spark is there any painless way to do this?
Edit:
For reference, an input .json file I'm reading from looks like this:
{"0000154d-7585-4096-a71a-5eb283ff985c": {"recipe_id": [1, 2, 3], "collaborative_rank": [1, 2, 3], "curated_rank": [1, 2, 3], "discovery_score": [1]}, "00005426-2675-4940-8394-e8085cd359c7": {"recipe_id": [] ... }
etc.
If you don't want to convert it to rdd and perform UDF, you can consider stacking the dataframe.
df = spark.read.json(r'C:\stackoverflow\samples\inp.json')
stack_characteristics = str(len(df.columns))+','+','.join([f"'{v}',`{v}`" for v in df.columns])
df.select(expr(f'''stack({stack_characteristics})''').alias('userId','vals')).\
select('userId', 'vals.*').show()
+--------------------+------------------+------------+---------------+---------+
| userId|collaborative_rank|curated_rank|discovery_score|recipe_id|
+--------------------+------------------+------------+---------------+---------+
|0000154d-7585-409...| [1, 2, 3]| [1, 2, 3]| [1]|[1, 2, 3]|
|00005426-2675-494...| [1, 2, 3]| [1, 2, 3]| [1]|[1, 2, 3]|
+--------------------+------------------+------------+---------------+---------+
AFAIK, this below code may solve your problem.
input json considered,
{"0000154d-7585-4096-a71a-5eb283ff985c": {"recipe_id": [1, 2, 3], "collaborative_rank": [1, 2, 3], "curated_rank": [1, 2, 3], "discovery_score": [1] }}
from pyspark.sql import Row
#read an input data
df=spark.read.json("/home/sathya/Desktop/stackoverflo/input.json")
#method to extract keys to columns
def extract_json(row):
out_array = []
data_dict = row.asDict()
for k in data_dict.keys():
out_array.append(Row(k, data_dict[k][0], data_dict[k][1],data_dict[k][2],data_dict[k][3]))
return Row(*out_array)
#flatmap columns and extracting the data
rdd = df.rdd.flatMap(extract_json)
#df creation
df1=spark.createDataFrame(rdd)
df1.selectExpr("_1 as user_id","_2 as recipe_id", "_3 as collaborative_rank", "_4 as curated_rank", "_5 as discovery_score").show(truncate=False)
/*
+------------------------------------+---------+------------------+------------+---------------+
|user_id |recipe_id|collaborative_rank|curated_rank|discovery_score|
+------------------------------------+---------+------------------+------------+---------------+
|0000154d-7585-4096-a71a-5eb283ff985c|[1, 2, 3]|[1, 2, 3] |[1] |[1, 2, 3] |
+------------------------------------+---------+------------------+------------+---------------+
*/

How do I add a column to a nested struct in a PySpark dataframe?

I have a dataframe with a schema like
root
|-- state: struct (nullable = true)
| |-- fld: integer (nullable = true)
I'd like to add columns within the state struct, that is, create a dataframe with a schema like
root
|-- state: struct (nullable = true)
| |-- fld: integer (nullable = true)
| |-- a: integer (nullable = true)
I tried
df.withColumn('state.a', val).printSchema()
# root
# |-- state: struct (nullable = true)
# | |-- fld: integer (nullable = true)
# |-- state.a: integer (nullable = true)
Here is a way to do it without using a udf:
# create example dataframe
import pyspark.sql.functions as f
data = [
({'fld': 0},)
]
schema = StructType(
[
StructField('state',
StructType(
[StructField('fld', IntegerType())]
)
)
]
)
df = sqlCtx.createDataFrame(data, schema)
df.printSchema()
#root
# |-- state: struct (nullable = true)
# | |-- fld: integer (nullable = true)
Now use withColumn() and add the new field using lit() and alias().
val = 1
df_new = df.withColumn(
'state',
f.struct(*[f.col('state')['fld'].alias('fld'), f.lit(val).alias('a')])
)
df_new.printSchema()
#root
# |-- state: struct (nullable = false)
# | |-- fld: integer (nullable = true)
# | |-- a: integer (nullable = false)
If you have a lot of fields in the nested struct you can use a list comprehension, using df.schema["state"].dataType.names to get the field names. For example:
val = 1
s_fields = df.schema["state"].dataType.names # ['fld']
df_new = df.withColumn(
'state',
f.struct(*([f.col('state')[c].alias(c) for c in s_fields] + [f.lit(val).alias('a')]))
)
df_new.printSchema()
#root
# |-- state: struct (nullable = false)
# | |-- fld: integer (nullable = true)
# | |-- a: integer (nullable = false)
References
I found a way to get the field names from the Struct without naming them manually from this answer.
Use a transformation such as the following:
import pyspark.sql.functions as f
df = df.withColumn(
"state",
f.struct(
f.col("state.*"),
f.lit(123).alias("a")
)
)
Although this is a too late answer, for pyspark version 2.x.x following is supported.
Assuming dfOld already contains state and fld as asked in question.
dfOld.withColumn("a","value")
dfNew = dfOld.select("level1Field1", "level1Field2", struct(col("state.fld").alias("fld"), col("a")).alias("state"))
Reference: https://medium.com/#mrpowers/adding-structtype-columns-to-spark-dataframes-b44125409803
Here's a way to do it without a udf.
Initialize example dataframe:
nested_df1 = (spark.read.json(sc.parallelize(["""[
{ "state": {"fld": 1} },
{ "state": {"fld": 2}}
]"""])))
nested_df1.printSchema()
root
|-- state: struct (nullable = true)
| |-- fld: long (nullable = true)
Spark .read.json imports all integers as long by default.
If state.fld has to be an int, you will need to cast it.
from pyspark.sql import functions as F
nested_df1 = (nested_df1
.select( F.struct(F.col("state.fld").alias("fld").cast('int')).alias("state") ))
nested_df1.printSchema()
root
|-- state: struct (nullable = false)
| |-- col1: integer (nullable = true)
nested_df1.show()
+-----+
|state|
+-----+
| [1]|
| [2]|
+-----+
Finally
Use .select to get the nested columns you want from the existing struct with the "parent.child" notation, create the new column, then re-wrap the old columns together with the new columns in a struct.
val_a = 3
nested_df2 = (nested_df
.select(
F.struct(
F.col("state.fld"),
F.lit(val_a).alias("a")
).alias("state")
)
)
nested_df2.printSchema()
root
|-- state: struct (nullable = false)
| |-- fld: integer (nullable = true)
| |-- a: integer (nullable = false)
nested_df2.show()
+------+
| state|
+------+
|[1, 3]|
|[2, 3]|
+------+
Flatten if needed with "parent.*".
nested_df2.select("state.*").printSchema()
root
|-- fld: integer (nullable = true)
|-- a: integer (nullable = false)
nested_df2.select("state.*").show()
+---+---+
|fld| a|
+---+---+
| 1| 3|
| 2| 3|
+---+---+
Spark 3.1+
F.col('state').withField('a', F.lit(1))
Example:
from pyspark.sql import functions as F
df = spark.createDataFrame([((1,),)], 'state:struct<fld:int>')
df.printSchema()
# root
# |-- state: struct (nullable = true)
# | |-- fld: integer (nullable = true)
df = df.withColumn('state', F.col('state').withField('a', F.lit(1)))
df.printSchema()
# root
# |-- state: struct (nullable = true)
# | |-- fld: integer (nullable = true)
# | |-- a: integer (nullable = false)
You can use the struct function
import pyspark.sql.functions as f
df = df.withColumn(
"state",
f.struct(
f.col("state.fld").alias("fld"),
f.lit(1).alias("a")
)
)
from pyspark.sql.functions import *
from pyspark.sql.types import *
def add_field_in_dataframe(nfield, df, dt):
fields = nfield.split(".")
print fields
n = len(fields)
addField = fields[0]
if n == 1:
return df.withColumn(addField, lit(None).cast(dt))
nestedField = ".".join(fields[:-1])
sfields = df.select(nestedField).schema[fields[-2]].dataType.names
print sfields
ac = col(nestedField)
if n == 2:
nc = struct(*( [ac[c].alias(c) for c in sfields] + [lit(None).cast(dt).alias(fields[-1])]))
else:
nc = struct(*( [ac[c].alias(c) for c in sfields] + [lit(None).cast(dt).alias(fields[-1])])).alias(fields[-2])
print nc
n = n - 1
while n > 1:
print "n: ",n
fields = fields[:-1]
print "fields: ", fields
nestedField = ".".join(fields[:-1])
print "nestedField: ", nestedField
sfields = df.select(nestedField).schema[fields[-2]].dataType.names
print fields[-1]
print "sfields: ", sfields
sfields = [s for s in sfields if s != fields[-1]]
print "sfields: ", sfields
ac = col(".".join(fields[:-1]))
if n > 2:
print fields[-2]
nc = struct(*( [ac[c].alias(c) for c in sfields] + [nc])).alias(fields[-2])
else:
nc = struct(*( [ac[c].alias(c) for c in sfields] + [nc]))
n = n - 1
return df.withColumn(addField, nc)

Dropping nested column of Dataframe with PySpark

I'm trying to drop some nested columns from structs in a Spark dataframe using PySpark.
I found this for Scala that seems to be doing exactly what I want to, but I'm not familiar with Scala and don't know how to write it in Python.
https://stackoverflow.com/a/39943812/5706548
Example for pyspark:
def drop_col(df, struct_nm, delete_struct_child_col_nm):
fields_to_keep = filter(lambda x: x != delete_struct_child_col_nm, df.select("{}.*".format(struct_nm)).columns)
fields_to_keep = list(map(lambda x: "{}.{}".format(struct_nm, x), fields_to_keep))
return df.withColumn(struct_nm, struct(fields_to_keep))
A method that I found using pyspark is by first converting the nested column into json and then parse the converted json with a new nested schema with the unwanted columns filtered out.
Suppose I have the following schema and I want to drop d, e and j (a.b.d, a.e, a.h.j) from the dataframe:
root
|-- a: struct (nullable = true)
| |-- b: struct (nullable = true)
| | |-- c: long (nullable = true)
| | |-- d: string (nullable = true)
| |-- e: struct (nullable = true)
| | |-- f: long (nullable = true)
| | |-- g: string (nullable = true)
| |-- h: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- i: string (nullable = true)
| | | |-- j: string (nullable = true)
|-- k: string (nullable = true)
I used the following approach:
Create new schema for a by excluding d, e and j. A quick way to do this is by manually select the fields that you want from df.select("a").schema and create a new schema from the selected fields using StructType. Or, you can do this programmatically by traversing the schema tree and exclude the unwanted fields, something like:
def exclude_nested_field(schema, unwanted_fields, parent=""):
new_schema = []
for field in schema:
full_field_name = field.name
if parent:
full_field_name = parent + "." + full_field_name
if full_field_name not in unwanted_fields:
if isinstance(field.dataType, StructType):
inner_schema = exclude_nested_field(field.dataType, unwanted_fields, full_field_name)
new_schema.append(StructField(field.name, inner_schema))
elif isinstance(field.dataType, ArrayType):
new_schema.append(StructField(field.name, ArrayType(field.dataType.elementType)))
else:
new_schema.append(StructField(field.name, field.dataType))
return StructType(new_schema)
new_schema = exclude_nested_field(df.schema["a"].dataType, ["b.d", "e", "h.j"])
Convert a column to json: .withColumn("json", F.to_json("a")).drop("a")
Parse the json-converted a column from step 2 with the new schema found in step 1: .withColumn("a", F.from_json("json", new_schema)).drop("json")
We can now do it natively with Spark version >= 3.1
https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.Column.dropFields.html
Althoug I've no solution for PySpark, maybe it's easier to translate this into python. Consider a dataframe df with schema:
root
|-- employee: struct (nullable = false)
| |-- name: string (nullable = false)
| |-- age: integer (nullable = false)
Then if you want e.g. to drop name,
you can do:
val fieldsToKeep = df.select($"employee.*").columns
.filter(_!="name") // the nested column you want to drop
.map(n => "employee."+n)
// overwite column with subset of fields
df
.withColumn("employee",struct(fieldsToKeep.head,fieldsToKeep.tail:_*))
Having the below dataframe, the aim is to drop d, e and j.
from pyspark.sql import functions as F
df = spark.createDataFrame([], "a struct<b:struct<c:bigint,d:string>,e:struct<f:bigint,g:string>,h:array<struct<i:string,j:string>>>, k string")
df.printSchema()
# root
# |-- a: struct (nullable = true)
# | |-- b: struct (nullable = true)
# | | |-- c: long (nullable = true)
# | | |-- d: string (nullable = true) # <<--- to be dropped
# | |-- e: struct (nullable = true) # <<--- to be dropped
# | | |-- f: long (nullable = true)
# | | |-- g: string (nullable = true)
# | |-- h: array (nullable = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- i: string (nullable = true)
# | | | |-- j: string (nullable = true) # <<--- to be dropped
# |-- k: string (nullable = true)
e is the easiest:
df = df.withColumn("a", F.col("a").dropFields("e"))
df.printSchema()
# root
# |-- a: struct (nullable = true)
# | |-- b: struct (nullable = true)
# | | |-- c: long (nullable = true)
# | | |-- d: string (nullable = true)
# | |-- h: array (nullable = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- i: string (nullable = true)
# | | | |-- j: string (nullable = true)
# |-- k: string (nullable = true)
In order to drop d, we must go inside b:
df = df.withColumn("a", F.col("a").withField("b", F.col("a.b").dropFields("d")))
df.printSchema()
# root
# |-- a: struct (nullable = true)
# | |-- b: struct (nullable = true)
# | | |-- c: long (nullable = true)
# | |-- h: array (nullable = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- i: string (nullable = true)
# | | | |-- j: string (nullable = true)
# |-- k: string (nullable = true)
j is inside array, so transform must also be used. It "loops" through every array's elements (in this case, the element is a struct) and transforms it (removes a field).
df = df.withColumn("a", F.col("a").withField(
"h",
F.transform(
F.col("a.h"),
lambda x: x.dropFields("j")
)
))
df.printSchema()
# root
# |-- a: struct (nullable = true)
# | |-- b: struct (nullable = true)
# | | |-- c: long (nullable = true)
# | |-- h: array (nullable = true)
# | | |-- element: struct (containsNull = true)
# | | | |-- i: string (nullable = true)
# |-- k: string (nullable = true)
Pyspark version of Raphaels Scala answer.
This runs at a certain depth, discards everything above that depth and filters on the row below it.
def remove_columns(df,root):
from pyspark.sql.functions import col
cols = df.select(root).columns
fields_filter = filter(lambda x: x[0]!= "$", cols) # use your own lambda here.
fieldsToKeep = list(map(lambda x: root[:-1] + x, fields_filter))
return df.select(fieldsToKeep)
df = remove_columns(raw_df, root="level1.level2.*")