Spark Scala - Split DataFrame column into multiple depending on the size of the column - apache-spark-sql

I need to split a column in several columns depending on the number of fields that each record has, for example, if I have the following DF:
+---+-------------------------------------------------+---+
|...|unique_code |...|
+---+-------------------------------------------------+---+
|...|2022-12-31 00:00:00.000000000*_*AAAAA*_*000000000|...|
+---+-------------------------------------------------+---+
|...|2022-12-31 00:00:00.000000000*_*BBBB |...|
+---+-------------------------------------------------+---+
|...|2022-12-31 00:00:00.000000000*_*CCC*_*1111*_*XX |...|
+---+-------------------------------------------------+---+
I know that at most it is going to have 4 fields and at least 1, being always in the same order, which is the one in this list:
val uniqueCodeFields = List("col1", "col2", "col3", "col4")
Therefore the resulting DF would be the following:
+---+-----------------------------+-----+---------+----+---+
|...|col1 |col2 |col3 |col4|...|
+---+-----------------------------+-----+---------+----+---+
|...|2022-12-31 00:00:00.000000000|AAAAA|000000000|NULL| |
+---+-----------------------------+-----+---------+--- +---+
|...|2022-12-31 00:00:00.000000000|BBBB |NULL |NULL| |
+---+-----------------------------+-----+---------+--- +---+
|...|2022-12-31 00:00:00.000000000|CCC |1111 |XX |...|
+---+-----------------------------+-----+---------+----+---+
I developed this, based on https://stackoverflow.com/a/45972636/9025222
chgPivotedDF.withColumn("temp", split(col("unique_code"), "\\*_\\*")).select(
(0 until size(col("temp"))).map(i => col("temp").getItem(i).as(uniqueCodeFields(i))): _*
)
But I am not being able to get the length of the "temp" column so as to only loop through the column to its limit in each case, getting the following error:
error: type mismatch;
found : org.apache.spark.sql.Column
required: Int
(0 until col($"temp")).map(i => col("temp").getItem(i).as(uniqueCodeFields(i))): _*
^
any help is welcome, thanks!

Related

How to convert 1 row 4 columns dataframe to 4 rows 2 columns dataframe in pyspark or sql

I have a dataframe which returns the output as
I would like to transpose this into
Can someone help to understand how to prepare the pyspark code to achieve this result dynamically. I have tried Unpivot in sql but no luck.
df =spark.createDataFrame([
(78,20,19,90),
],
('Machines', 'Books', 'Vehicles', 'Plants'))
Create a new array of struct column that combines column names and value names. Use the magic inline to explode the struct field. Code below
df.withColumn('tab', F.array(*[F.struct(lit(x).alias('Fields'), col(x).alias('Count')).alias(x) for x in df.columns])).selectExpr('inline(tab)').show()
+--------+-----+
| Fields|Count|
+--------+-----+
|Machines| 78|
| Books| 20|
|Vehicles| 19|
| Plants| 90|
+--------+-----+
As mentioned in unpivot-dataframe tutoral use:
df = df.selectExpr("""stack(4, "Machines", Machines, "Books", Books, "Vehicles", Vehicles, "Plants", Plants) as (Fields, Count)""")
Or to generalise:
cols = [f'"{c}", {c}' for c in df.columns]
exprs = f"stack({len(cols)}, {', '.join(str(c) for c in cols)}) as (Fields, Count)"
df = df.selectExpr(exprs)
Full example:
df = spark.createDataFrame(data=[[78,20,19,90]], schema=['Machines','Books','Vehicles','Plants'])
# Hard coded
# df = df.selectExpr("""stack(4, "Machines", Machines, "Books", Books, "Vehicles", Vehicles, "Plants", Plants) as (Fields, Count)""")
# Generalised
cols = [f'"{c}", {c}' for c in df.columns]
exprs = f"stack({len(cols)}, {', '.join(str(c) for c in cols)}) as (Fields, Count)"
df = df.selectExpr(exprs)
[Out]:
+--------+-----+
|Fields |Count|
+--------+-----+
|Machines|78 |
|Books |20 |
|Vehicles|19 |
|Plants |90 |
+--------+-----+

Spark scala create dataframe from text with columns split by delimiter | [duplicate]

This question already has answers here:
Java String split not returning the right values
(4 answers)
Closed 10 months ago.
I am trying to create a spark dataframe from text file in which data is delimited by | symbol.
Have to Spark with Scala.
The text files has data as below:
John|1234|$2500|giggle
Ross|1344|$5500|Micsoft
Jennifer|5432|$2100|healthcare
val schemaString = "name,employeeid,salary,company"
val fields = schemaString.split(",").map(fieldName => StructField(fieldName,StringType, nullable=true))
val schema = StructType(fields)
val rddView= sc.textFile("/dev/path/*").map(_.split("|")).map{x
=> org.apache.spark.sql.Row(x:_*)}
val rddViewDf = sqlContext.createDataFrame(rddView,schema)
rddViewDf.show()
Expecting the values to be mapped to corresponding columns but output is not as expected.
Can someone provide the correct solution in Spark using scala language
Output I am getting:
+----+----------+------+-------+
|name|employeeid|salary|company|
+----+----------+------+-------+
| J| o| h| n|
| R| o| s| s|
| J| e| n| n|
+----+----------+------+-------+
Expected Output
+----------+------------+----------+-----------+
|name |employeeid | salary| company|
+---------+-------------+----------+-----------+
| John| 1234| $2500| giggle|
| Ross| 1344| $5500| Micsoft|
| Jennifer| 5432| $2100| healthcare|
+----+----------+------+-----------------------+
As pointed out in the comments, your split delimeter is incorrect.
However, you should not be using RDDs anyway
scala> spark.read.option("delimiter", "|").csv("data.txt").show()
+--------+----+-----+----------+
| _c0| _c1| _c2| _c3|
+--------+----+-----+----------+
| John|1234|$2500| giggle|
| Ross|1344|$5500| Micsoft|
|Jennifer|5432|$2100|healthcare|
+--------+----+-----+----------+
https://spark.apache.org/docs/latest/sql-data-sources-csv.html
To rename the columns, please see this and translate to Scala How to read csv without header and name them with names while reading in pyspark?
Note: Ideally, your employeeId column is defined as a LongType, not StringType

How to export Spark DataFrame with columns having valuse lists aggregated with collect_list() to 3 dimentional Pandas in Pyspark?

I have the DataFrame like this one (How to get the occurence rate of the specific values with Apache Spark)
+-----------+--------------------+------------+-------+
|device | windowtime | values| counts|
+-----------+--------------------+------------+-------+
| device_A|2022-01-01 18:00:00 |[99,100,102]|[1,3,1]|
| device_A|2022-01-01 18:00:10 |[98,100,101]|[1,2,2]|
Windowtime is considered to be X axis value, values are considered to be Y value, while counts are Z axis value (to be later plot say on heatmap).
How to export that to Pandas 3d object from PySpark dataframe?
With "2 dimensions", I have
pdf = df.toPandas()
and then I can use that for Bokeh's figure like that:
fig1ADB = figure(title="My 2 graph", tooltips=TOOLTIPS, x_axis_type='datetime')
fig1ADB.line(x='windowtime', y='values', source=source, color="orange")
But I'd like to use something like this:
hm = HeatMap(data, x='windowtime', y='values', values='counts', title='My heatmap (3d) graph', stat=None)
show(hm)
WHat kind of transformation I should do for that?
I have realized, that the approach itself is wrong, there should be no aggregation to list done before the exporting to Pandas!
According to discussion below
https://discourse.bokeh.org/t/cant-render-heatmap-data-for-apache-zeppelins-pyspark-dataframe/8844/8
instead of grouped to list columns values/counts we have raw table with one line per unique id ('value') and value of count ('index') and each line has its 'write_time'
+-------------------+------+-----+
|window_time |values|index|
+-------------------+------+-----+
|2022-01-24 18:00:00|999 |2 |
|2022-01-24 19:00:00|999 |1 |
|2022-01-24 20:00:00|999 |3 |
|2022-01-24 21:00:00|999 |4 |
|2022-01-24 22:00:00|999 |5 |
|2022-01-24 18:00:00|998 |4 |
|2022-01-24 19:00:00|998 |5 |
|2022-01-24 20:00:00|998 |3 |
rowIDs = pdf['values']
colIDs = pdf['window_time']
A = pdf.pivot_table('index', 'values', 'window_time', fill_value=0)
source = ColumnDataSource(data={'x':[pd.to_datetime('Jan 24 2022')] #left most
,'y':[0] #bottom most
,'dw':[pdf['window_time'].max()-pdf['window_time'].min()] #TOTAL width of image
#,'dh':[df['delayWindowEnd'].max()] #TOTAL height of image
,'dh':[1000] #TOTAL height of image
,'im':[A.to_numpy()] #2D array using to_numpy() method on pivotted df
})
color_mapper = LogColorMapper(palette="Viridis256", low=1, high=20)
plot = figure(toolbar_location=None,x_axis_type='datetime')
plot.image(x='x', y='y', source=source, image='im',dw='dw',dh='dh', color_mapper=color_mapper)
color_bar = ColorBar(color_mapper=color_mapper, label_standoff=12)
plot.add_layout(color_bar, 'right')
#show(plot)
show(gridplot([plot], ncols=1, plot_width=1000, plot_height=400))
And the result:

Scala convert Array to DataFrame Column

I am trying to add an Array of values as a new column to the DataFrame.
Ex:
Lets assume there is an Array(4,5,10) and a dataframe
+----------+-----+
| name | age |
+----------+-----+
| John | 32 |
| Elizabeth| 28 |
| Eric | 41 |
+----------+-----+
My requirement is to add the above array as a new column to the dataframe. My expected output is as follows:
+----------+-----+------+
| name | age | rank |
+----------+-----+------+
| John | 32 | 4 |
| Elizabeth| 28 | 5 |
| Eric | 41 | 10 |
+----------+-----+------+
I am trying if I can achieve this using rdd and zipWithIndex.
df.rdd.zipWithIndex.map(_.swap).join(array_rdd.zipWithIndex.map(_.swap))
This is resulting in something of this sort.
(0,([John, 32],4))
I want to convert the above RDD back to required dataframe. Let me know how to achieve this.
Are there any alternatives available for achieving the desired result other than using rdd and zipWithIndex? What is the best way to do it?
PS:
Context for better understanding:
I am using Xpress optimization suite to solve a mathematical problem. Xpress takes inputs interms of Arrays and also outputs the result in an Array. I get input as a DataFrame and I am extracting columns as Arrays(using collect) and passing to Xpress. Xpress outputs Array[Double] as solution. I want to add this solution back to the dataframe as a column and every value in the solution array corresponds to the row of the dataframe at its index i.e value at index 'n' of the output Array corresponds to 'n'th row of the dataframe
After the join just map the results to what you are looking for.
You can convert this back to a dataframe after joining the RDDs.
val originalDF = Seq(("John", 32), ("Elizabeth", 28), ("Eric", 41)).toDF("name", "age")
val rank = Array(4, 5, 10)
// convert to Seq first
val rankDF = rank.toSeq.toDF("rank")
val joined = originalDF.rdd.zipWithIndex.map(_.swap).join(rankDF.rdd.zipWithIndex.map(_.swap))
val finalRDD = joined.map{ case (k,v) => (k, v._1.getString(0), v._1.getInt(1), v._2.getInt(0)) }
val finalDF = finalRDD.toDF("id", "name", "age", "rank")
finalDF.show()
/*
+---+---------+---+----+
| id| name|age|rank|
+---+---------+---+----+
| 0| John| 32| 4|
| 1|Elizabeth| 28| 5|
| 2| Eric| 41| 10|
+---+---------+---+----+
*/
The only alternate way that I can think of is to use the org.apache.spark.sql.functions.row_number() window function. This essentially achieves the same thing by adding an increasing, consecutive row number to the dataframe.
The drawback with this is the large amount of data shuffle into one partition, since we need to have unrepeated row numbers for all rows in the dataframe. If your data is very large this can lead to an out of memory issue. (Note: this may not be applicable in your case, since you mentioned you are doing a collect on the data and have not mentioned any memory issues in this).
The approach of converting to an rdd and using zipWithIndex is an acceptable solution, but generally converting from dataframe to rdd is not recommended due to the performance difference of using an RDD instead of a dataframe.

Transform several Dataframe rows into a single row

The following is an example Dataframe snippet:
+-------------------+--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_lid |trace |message |
+-------------------+--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1103960793391132675|47c10fda9b40407c998c154dc71a9e8c|[app.py:208] Prediction label: {"id": 617, "name": "CENSORED"}, score=0.3874854505062103 |
|1103960793391132676|47c10fda9b40407c998c154dc71a9e8c|[app.py:224] Similarity values: [0.6530804801919593, 0.6359653379418201] |
|1103960793391132677|47c10fda9b40407c998c154dc71a9e8c|[app.py:317] Predict=s3://CENSORED/scan_4745/scan4745_t1_r0_c9_2019-07-15-10-32-43.jpg trait_id=112 result=InferenceResult(predictions=[Prediction(label_id='230', label_name='H3', probability=0.0), Prediction(label_id='231', label_name='Other', probability=1.0)], selected=Prediction(label_id='231', label_name='Other', probability=1.0)). Took 1.3637824058532715 seconds |
+-------------------+--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I have millions of these, log like structures, where they all can be grouped by trace which is unique to a session.
I'm looking to transform these sets of rows into single rows, essentially mapping over them, for for this example I would extract from the first name the "id": 617 from the second row the values 0.6530804801919593, 0.6359653379418201 and from the third row the Prediction(label_id='231', label_name='Other', probability=1.0) value.
Then I would compose a new table having the columns:
| trace | id | similarity | selected |
with the values:
| 47c10fda9b40407c998c154dc71a9e8c | 617 | 0.6530804801919593, 0.6359653379418201 | 231 |
How should I implement this group-map transform over several rows in pyspark ?
I've written the below example in Scala for my own convenience, but it should translate readily to Pyspark.
1) Create the new columns in your dataframe via regexp_extract on the "message" field. This will produce the desired values if the regex matches, or empty strings if not:
scala> val dss = ds.select(
| 'trace,
| regexp_extract('message, "\"id\": (\\d+),", 1) as "id",
| regexp_extract('message, "Similarity values: \\[(\\-?[0-9\\.]+, \\-?[0-9\\.]+)\\]", 1) as "similarity",
| regexp_extract('message, "selected=Prediction\\(label_id='(\\d+)'", 1) as "selected"
| )
dss: org.apache.spark.sql.DataFrame = [trace: string, id: string ... 2 more fields]
scala> dss.show(false)
+--------------------------------+---+--------------------------------------+--------+
|trace |id |similarity |selected|
+--------------------------------+---+--------------------------------------+--------+
|47c10fda9b40407c998c154dc71a9e8c|617| | |
|47c10fda9b40407c998c154dc71a9e8c| |0.6530804801919593, 0.6359653379418201| |
|47c10fda9b40407c998c154dc71a9e8c| | |231 |
+--------------------------------+---+--------------------------------------+--------+
2) Group by "trace" and eliminate the cases where the regex didn't match. The quick and dirty way (show below) is to select the max of each column, but you might need to do something more sophisticated if you expect to encounter more than one match per trace:
scala> val ds_final = dss.groupBy('trace).agg(max('id) as "id", max('similarity) as "similarity", max('selected) as "selected")
ds_final: org.apache.spark.sql.DataFrame = [trace: string, id: string ... 2 more fields]
scala> ds_final.show(false)
+--------------------------------+---+--------------------------------------+--------+
|trace |id |similarity |selected|
+--------------------------------+---+--------------------------------------+--------+
|47c10fda9b40407c998c154dc71a9e8c|617|0.6530804801919593, 0.6359653379418201|231 |
+--------------------------------+---+--------------------------------------+--------+
I ended up using something in the lines of
expected_schema = StructType([
StructField("event_timestamp", TimestampType(), False),
StructField("trace", StringType(), False),
...
])
#F.pandas_udf(expected_schema, F.PandasUDFType.GROUPED_MAP)
# Input/output are both a pandas.DataFrame
def transform(pdf):
output = {}
for l in pdf.to_dict(orient='record'):
x = re.findall(r'^(\[.*:\d+\]) (.*)', l['message'])[0][1]
...
return pd.DataFrame(data=[output])
df.groupby('trace').apply(transform)