Spark - Write a DF with Struct Schema as CSV - dataframe

I have a Spark DataFrame that I created by reading an EBCDIC data file. Its schema is as follows
scala> myDF.printSchema
root
|-- DF_RECORD: struct (nullable = true)
| |-- ID: string (nullable = true)
| |-- BALANCE_AMT: decimal(15,4) (nullable = true)
I can even get it to show its rows. But when I try to write it as CSV, I get the following error
scala> myDF.write.format("csv").mode("overwrite").option("header", "true").save("/path")
org.apache.spark.sql.AnalysisException: CSV data source does not support struct<ID:string,BALANCE_AMT:decimal(15,4)> data type.;
Is there a column conversion I might have to de before writing?
Spark version 2.4.0-cdh6.2.1
Scala version 2.11.12

Add a selectExpr before writing
scala> myDF.selectExpr("DF_RECORD.*").write.format("csv").mode("overwrite").option("header", "true").save("/path")

Convert it to individual columns by accessing columns names in withColumn:
from pyspark.sql.types import StructType, StructField, StringType, Row, DoubleType
import pyspark.sql.functions as F
schema = StructType([
StructField("DF_RECORD",
StructType([
StructField("ID", StringType()),
StructField("BALANCE_AMT", DoubleType())
])
)
])
df = spark.createDataFrame([Row(Row("1",1000.0))], schema=schema)
df.printSchema()
[Out]:
root
|-- DF_RECORD: struct (nullable = true)
| |-- ID: string (nullable = true)
| |-- BALANCE_AMT: double (nullable = true)
df_csv = df \
.withColumn("ID", F.col("DF_RECORD")["ID"]) \
.withColumn("BALANCE_AMT", F.col("DF_RECORD")["BALANCE_AMT"]) \
.drop("DF_RECORD")
[Out]:
+---+-----------+
| ID|BALANCE_AMT|
+---+-----------+
| 1| 1000.0|
+---+-----------+
Now, export it as CSV:
df_csv.write.format("csv").mode("overwrite").option("header", "true").save("/content/sample_data/test1.csv")
!cat "/content/sample_data/test1.csv/part-00000-b342e07a-6d41-40b5-afa2-39eeef3b70a2-c000.csv"
[Out]:
ID, BALANCE_AMT
1, 1000.0

Related

filter string with quotes in Spark dataframe column

I have a DF with this data:
--------+------------------------------------------+
|recType |value |
+--------+------------------------------------------+
|{"id": 1|{"id": 1, "user_id": 100, "price": 50} |
...
I can filter recType with contains, but how to do with === and quotes? I seem to get some error every time.
I understand that columns here are strings. If so, from_json
function can parse them into structure.
import org.apache.spark.sql.types.{StructField, StructType, IntegerType}
import org.apache.spark.sql.functions.from_json
val recTypeSchema = StructType(Array(
StructField("id", IntegerType, true)
))
val valueSchema = StructType(Array(
StructField("id", IntegerType, true),
StructField("user_id", IntegerType, true),
StructField("price", IntegerType, true)
))
val parsedDf = df
.withColumn("recType", from_json($"recType", recTypeSchema))
.withColumn("value", from_json($"value", valueSchema))
parsedDf.printSchema
root
|-- recType: struct (nullable = true)
| |-- id: integer (nullable = true)
|-- value: struct (nullable = true)
| |-- id: integer (nullable = true)
| |-- user_id: integer (nullable = true)
| |-- price: integer (nullable = true)
parsedDf.filter($"recType.id" === 1).show
+-------+------------+
|recType| value|
+-------+------------+
| {1}|{1, 100, 50}|
+-------+------------+

how to do intersection of list columns with pyspark dataframe columns?

I have a below pyspark dataframe and i need to create new column (new_col) which is common items in column X and Y excluding items in Z.
df
id X Y Z new_col
1 [12,23,1,24] [13,412,12,23,24] [12] [23,24]
2 [1,2,3] [2,4,5,6] [] [2]
If your schema is the following:
df.printSchema()
#root
# |-- id: long (nullable = true)
# |-- X: array (nullable = true)
# | |-- element: long (containsNull = true)
# |-- Y: array (nullable = true)
# | |-- element: long (containsNull = true)
# |-- Z: array (nullable = true)
# | |-- element: long (containsNull = true)
and your pyspark version 2.4+ you can use array_intersect and array_except:
from pyspark.sql.functions import array_except, array_intersect
df=df.withColumn("new_col", array_except(array_intersect("X", "Y"), "Z"))
df.show()
#+---+---------------+---------------------+----+--------+
#|id |X |Y |Z |new_col |
#+---+---------------+---------------------+----+--------+
#|1 |[12, 23, 1, 24]|[13, 412, 12, 23, 24]|[12]|[23, 24]|
#|2 |[1, 2, 3] |[2, 4, 5, 6] |[] |[2] |
#+---+---------------+---------------------+----+--------+
You can use withcolumn + udf
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, ArrayType
def intersection_function(list1, list2):
intersection_list = [value for value in list1 if value in list2]
return intersection_list
udf_intersection = F.udf(intersection_function, ArrayType(IntegerType())
newdf = df.withColumn("new_col", udf_intersection(df["ListColumn1"], df["ListColumn2"]))

Schema Conversion from String datatype to Array(Map(Array)) datatype in Pyspark

We are reading data from dynamo db so we are getting datatype as string, but we want writing string data type as array(map(array))
string Data :
{"policy_details":[{"cdhid":" 123","p2cid":" NA", "roleDesc":" NA","positionnum":"NA"}, {"cdhid":" 1234","p2cid":" NA", "roleDesc":" NA","positionnum":"NA"}]}
output required:
string data type need to convert to ARRAY(MAP(ARRAY))
We have tried with below schema:
ArrayType([
StructField("policy_num", MapType(ArrayType([
StructField("cdhid", StringType(), True),
StructField("role_id", StringType(), True),
StructField("role_desc", StringType(), True)
])))
])
getting below issue:
elementType [StructField(cdhid,StringType,true),
StructField(role_id,StringType,true),
StructField(role_desc,StringType,true)] should be an instance of < class 'pyspark.sql.types.DataType' >
Regarding your data, the schema your want is not the one that fits.
The schema of you data is :
from pyspark.sql import types as T
schm = T.StructType([T.StructField("policy_details",T.ArrayType(T.StructType([
T.StructField("cdhid", T.StringType(), True),
T.StructField("p2cid", T.StringType(), True),
T.StructField("roleDesc", T.StringType(), True),
T.StructField("positionnum", T.StringType(), True),
])), True)])
Then, you just need to use the from_json function.
from pyspark.sql import functions as F
df.show()
+--------------------+
| db_data|
+--------------------+
|{"policy_details"...|
+--------------------+
new_df = df.select(F.from_json("db_data", schm).alias("data"))
new_df.printSchema()
root
|-- data: struct (nullable = true)
| |-- policy_details: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- cdhid: string (nullable = true)
| | | |-- p2cid: string (nullable = true)
| | | |-- roleDesc: string (nullable = true)
| | | |-- positionnum: string (nullable = true)
EDIT : If you want to use MapType, you can replace the schema with :
schm = T.StructType([
T.StructField(
"policy_details",
T.ArrayType(T.MapType(
T.StringType(),
T.StringType()
)),
True
)
])

reading csv from pyspark specifying schema wrong types

I am trying to output csv from a pyspark df an then re inputting it, but when I specify schema, for a column that is an array, it says that some of the rows are False.
Here is my df
avg(rating) belongs_to_collection budget \
0 2.909946 False 5000000
1 3.291962 False 18000000
2 3.239811 False 8000000
3 3.573318 False 1500000
4 3.516590 False 40000000
genres original_language
0 ['Drama', 'Romance'] en
1 ['Comedy'] en
2 ['Drama', 'Family'] en
3 ['Crime', 'Drama', 'Mystery', 'Thriller'] en
4 ['Crime', 'Drama', 'Thriller'] en
I first output to csv: df.drop('id').toPandas().to_csv('mergedDf.csv',index=False)
I tried reading using df = spark.read.csv('mergedDf.csv',schema=schema), but I get this error: 'CSV data source does not support array<string> data type.;'
So, I tried reading from pandas and then converting to spark df, but it tells me that the column that contains a list has a boolean value.
df = pd.read_csv('mergedDf.csv')
df = spark.createDataFrame(df,schema=schema)
TypeError: field genres: ArrayType(StringType,true) can not accept object False in type <class 'bool'>
However, when I check if some of the rows are == to False, I find that none of them are.
I checked :
df[df['genres']=="False"] and df[df['genres']==False]
Unfortunately, spark read csv function doesn't yet support complex datatypes like "array". You would have handle the logic of cast the string column into array column
Use pandas to write the spark dataframe as csv with header.
df.drop('id').toPandas().to_csv('mergedDf.csv',index=False,header=True)
df1 = spark.read.option('header','true').option("inferSchema","true").csv('mergedDf.csv')
df1.printSchema()
df1.show(10,False)
When you read back csv with spark, the array column would be convert to string type
root
|-- avg(rating): double (nullable = true)
|-- belongs_to_collection: boolean (nullable = true)
|-- budget: integer (nullable = true)
|-- genres: string (nullable = true)
|-- original_language: string (nullable = true)
+-----------+---------------------+--------+-----------------------------------------+-----------------+
|avg(rating)|belongs_to_collection|budget |genres |original_language|
+-----------+---------------------+--------+-----------------------------------------+-----------------+
|2.909946 |false |5000000 |['Drama', 'Romance'] |en |
|3.291962 |false |18000000|['Comedy'] |en |
|3.239811 |false |8000000 |['Drama', 'Family'] |en |
|3.573318 |false |1500000 |['Crime', 'Drama', 'Mystery', 'Thriller']|en |
|3.51659 |false |40000000|['Crime', 'Drama', 'Thriller'] |en |
+-----------+---------------------+--------+-----------------------------------------+-----------------+
Split the string column to create an array to get back your original format.
df2 = df1.withColumn('genres',split(regexp_replace(col('genres'), '\[|\]',''),',').cast('array<string>'))
df2.printSchema()
.
root
|-- avg(rating): double (nullable = true)
|-- belongs_to_collection: boolean (nullable = true)
|-- budget: integer (nullable = true)
|-- genres: array (nullable = true)
| |-- element: string (containsNull = true)
|-- original_language: string (nullable = true)

How do I add a column to a nested struct in a PySpark dataframe?

I have a dataframe with a schema like
root
|-- state: struct (nullable = true)
| |-- fld: integer (nullable = true)
I'd like to add columns within the state struct, that is, create a dataframe with a schema like
root
|-- state: struct (nullable = true)
| |-- fld: integer (nullable = true)
| |-- a: integer (nullable = true)
I tried
df.withColumn('state.a', val).printSchema()
# root
# |-- state: struct (nullable = true)
# | |-- fld: integer (nullable = true)
# |-- state.a: integer (nullable = true)
Here is a way to do it without using a udf:
# create example dataframe
import pyspark.sql.functions as f
data = [
({'fld': 0},)
]
schema = StructType(
[
StructField('state',
StructType(
[StructField('fld', IntegerType())]
)
)
]
)
df = sqlCtx.createDataFrame(data, schema)
df.printSchema()
#root
# |-- state: struct (nullable = true)
# | |-- fld: integer (nullable = true)
Now use withColumn() and add the new field using lit() and alias().
val = 1
df_new = df.withColumn(
'state',
f.struct(*[f.col('state')['fld'].alias('fld'), f.lit(val).alias('a')])
)
df_new.printSchema()
#root
# |-- state: struct (nullable = false)
# | |-- fld: integer (nullable = true)
# | |-- a: integer (nullable = false)
If you have a lot of fields in the nested struct you can use a list comprehension, using df.schema["state"].dataType.names to get the field names. For example:
val = 1
s_fields = df.schema["state"].dataType.names # ['fld']
df_new = df.withColumn(
'state',
f.struct(*([f.col('state')[c].alias(c) for c in s_fields] + [f.lit(val).alias('a')]))
)
df_new.printSchema()
#root
# |-- state: struct (nullable = false)
# | |-- fld: integer (nullable = true)
# | |-- a: integer (nullable = false)
References
I found a way to get the field names from the Struct without naming them manually from this answer.
Use a transformation such as the following:
import pyspark.sql.functions as f
df = df.withColumn(
"state",
f.struct(
f.col("state.*"),
f.lit(123).alias("a")
)
)
Although this is a too late answer, for pyspark version 2.x.x following is supported.
Assuming dfOld already contains state and fld as asked in question.
dfOld.withColumn("a","value")
dfNew = dfOld.select("level1Field1", "level1Field2", struct(col("state.fld").alias("fld"), col("a")).alias("state"))
Reference: https://medium.com/#mrpowers/adding-structtype-columns-to-spark-dataframes-b44125409803
Here's a way to do it without a udf.
Initialize example dataframe:
nested_df1 = (spark.read.json(sc.parallelize(["""[
{ "state": {"fld": 1} },
{ "state": {"fld": 2}}
]"""])))
nested_df1.printSchema()
root
|-- state: struct (nullable = true)
| |-- fld: long (nullable = true)
Spark .read.json imports all integers as long by default.
If state.fld has to be an int, you will need to cast it.
from pyspark.sql import functions as F
nested_df1 = (nested_df1
.select( F.struct(F.col("state.fld").alias("fld").cast('int')).alias("state") ))
nested_df1.printSchema()
root
|-- state: struct (nullable = false)
| |-- col1: integer (nullable = true)
nested_df1.show()
+-----+
|state|
+-----+
| [1]|
| [2]|
+-----+
Finally
Use .select to get the nested columns you want from the existing struct with the "parent.child" notation, create the new column, then re-wrap the old columns together with the new columns in a struct.
val_a = 3
nested_df2 = (nested_df
.select(
F.struct(
F.col("state.fld"),
F.lit(val_a).alias("a")
).alias("state")
)
)
nested_df2.printSchema()
root
|-- state: struct (nullable = false)
| |-- fld: integer (nullable = true)
| |-- a: integer (nullable = false)
nested_df2.show()
+------+
| state|
+------+
|[1, 3]|
|[2, 3]|
+------+
Flatten if needed with "parent.*".
nested_df2.select("state.*").printSchema()
root
|-- fld: integer (nullable = true)
|-- a: integer (nullable = false)
nested_df2.select("state.*").show()
+---+---+
|fld| a|
+---+---+
| 1| 3|
| 2| 3|
+---+---+
Spark 3.1+
F.col('state').withField('a', F.lit(1))
Example:
from pyspark.sql import functions as F
df = spark.createDataFrame([((1,),)], 'state:struct<fld:int>')
df.printSchema()
# root
# |-- state: struct (nullable = true)
# | |-- fld: integer (nullable = true)
df = df.withColumn('state', F.col('state').withField('a', F.lit(1)))
df.printSchema()
# root
# |-- state: struct (nullable = true)
# | |-- fld: integer (nullable = true)
# | |-- a: integer (nullable = false)
You can use the struct function
import pyspark.sql.functions as f
df = df.withColumn(
"state",
f.struct(
f.col("state.fld").alias("fld"),
f.lit(1).alias("a")
)
)
from pyspark.sql.functions import *
from pyspark.sql.types import *
def add_field_in_dataframe(nfield, df, dt):
fields = nfield.split(".")
print fields
n = len(fields)
addField = fields[0]
if n == 1:
return df.withColumn(addField, lit(None).cast(dt))
nestedField = ".".join(fields[:-1])
sfields = df.select(nestedField).schema[fields[-2]].dataType.names
print sfields
ac = col(nestedField)
if n == 2:
nc = struct(*( [ac[c].alias(c) for c in sfields] + [lit(None).cast(dt).alias(fields[-1])]))
else:
nc = struct(*( [ac[c].alias(c) for c in sfields] + [lit(None).cast(dt).alias(fields[-1])])).alias(fields[-2])
print nc
n = n - 1
while n > 1:
print "n: ",n
fields = fields[:-1]
print "fields: ", fields
nestedField = ".".join(fields[:-1])
print "nestedField: ", nestedField
sfields = df.select(nestedField).schema[fields[-2]].dataType.names
print fields[-1]
print "sfields: ", sfields
sfields = [s for s in sfields if s != fields[-1]]
print "sfields: ", sfields
ac = col(".".join(fields[:-1]))
if n > 2:
print fields[-2]
nc = struct(*( [ac[c].alias(c) for c in sfields] + [nc])).alias(fields[-2])
else:
nc = struct(*( [ac[c].alias(c) for c in sfields] + [nc]))
n = n - 1
return df.withColumn(addField, nc)