How to access data in a Spark Dataset Column - api

I have a Dataframe like this:
+------+---+
| Name|Age|
+------+---+
|A-2 | 26|
|B-1 | 30|
|C-3 | 20|
+------+---+
scala> p.select("Name", "Age")
res2: org.apache.spark.sql.DataFrame = [Name: string, Age: string]
We can see clearly here that the data in the columns are of type String
I want to transform the Name column with a split("-") like method to get only the first part of it (i.e A, B, C).
But type Column in spark doesn't have such a method, so i'm thinking how to get the 'string' inside of the Column so i can perform the split operation.
Does anyone know what should i do ?

Use functions.split method
df.select(split(col("Name"), "-").getItem(0))

Split function available for spark dataframe. See the example below.
//Creating Test Data
val df = Seq(("A-2", 26)
, ("B-1", 30)
, ("C-3", 20)
).toDF("name", "age")
df.withColumn("new_name", split(col("name"),"-")(0)).show(false)
+----+---+--------+
|name|age|new_name|
+----+---+--------+
|A-2 |26 |A |
|B-1 |30 |B |
|C-3 |20 |C |
+----+---+--------+

Related

Finding the value of a column based on other 2 columns

I have specific problem, where I want to retrieve the value of bu_id field from id and matched_ id.
When there is some value in matched_id column, bu_id should be same as the id for that particular id and ids of corresponding matched_id.
When matched_id is blank, bu_id should be same as id.
input
+---+------------+
|id |matched_id |
+---+------------+
|0 |7,8 |
|1 | |
|2 |4 |
|3 |5,9 |
|4 |2 |
|5 |3,9 |
|6 | |
|7 |0,8 |
|8 |0,7 |
|9 |3,5 |
output
+---+------------+-----+
|id |matched_id |bu_id|
+---+------------+-----+
|0 |7,8 |0 |
|1 | |1 |
|2 |4 |2 |
|3 |5,9 |3 |
|4 |2 |2 |
|5 |3,9 |3 |
|6 | |6 |
|7 |0,8 |0 |
|8 |0,7 |0 |
|9 |3,5 |3 |
Can anyone help me how to approach this problem. Thanks in advance.
We should try to use functions exclusively from the pyspark.sql.functions module because these are optimized for pyspark dataframes (see here), whereas udfs are not and should be avoided when possible.
To achieve the desired output pyspark dataframe, we can concatenate both "id" and "matched_id" columns together, convert the string that into a list of strings using split, cast the result as an array of integers, and take the minimum of the array – and we can get away with not having to worry about the blank strings because they get converted into null, and F.array_min drops nulls from consideration. This can be done with the following line of code (and while it is a little hard to read, it gets the job done):
import pyspark.sql.functions as F
df = spark.createDataFrame(
[
("0", "7,8"),
("1", ""),
("2", "4"),
("3", "5,9"),
("4", "2"),
("5", "3,9"),
("6", ""),
("7", "0,8"),
("8", "0,7"),
("9", "3,5"),
],
["id", "matched_id"]
)
df.withColumn(
"bu_id",
F.array_min(F.split(F.concat(F.col("id"),F.lit(","),F.col("matched_id")),",").cast("array<int>"))
).show()
Output:
+---+----------+-----+
| id|matched_id|bu_id|
+---+----------+-----+
| 0| 7,8| 0|
| 1| | 1|
| 2| 4| 2|
| 3| 5,9| 3|
| 4| 2| 2|
| 5| 3,9| 3|
| 6| | 6|
| 7| 0,8| 0|
| 8| 0,7| 0|
| 9| 3,5| 3|
+---+----------+-----+
Update: in the case of non-numeric strings in columns "id" and "matched_id", we can no longer cast to an array of integers, so we can instead use pyspark functions F.when and .otherwise (see here) to set our new column to the "id" column when "matched_id" is an empty string "", and apply our other longer nested function when "matched_id" is non-empty.
df2 = spark.createDataFrame(
[
("0", "7,8"),
("1", ""),
("2", "4"),
("3", "5,9"),
("4", "2"),
("5", "3,9"),
("6", ""),
("7", "0,8"),
("8", "0,7"),
("9", "3,5"),
("x", ""),
("x", "y,z")
],
["id", "matched_id"]
)
df2.withColumn(
"bu_id",
F.when(F.col("matched_id") != "", F.array_min(F.split(F.concat(F.col("id"),F.lit(","),F.col("matched_id")),","))).otherwise(
F.col("id")
)
).show()
Output:
+---+----------+-----+
| id|matched_id|bu_id|
+---+----------+-----+
| 0| 7,8| 0|
| 1| | 1|
| 2| 4| 2|
| 3| 5,9| 3|
| 4| 2| 2|
| 5| 3,9| 3|
| 6| | 6|
| 7| 0,8| 0|
| 8| 0,7| 0|
| 9| 3,5| 3|
| x| | x|
| x| y,z| x|
+---+----------+-----+
To answer this question I assumed that the logic you are looking to implement is,
If the matched_id column is null, then bu_id should be the same as id.
If the matched_id column is not null, we should consider the values listed in both the id and matched_id columns and bu_id should be the minimum of those values.
The Set-Up
# imports to include
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
# making your dataframe
df = spark.createDataFrame(
[
('0','7,8'),
('1',''),
('2','4'),
('3','5,9'),
('4','2'),
('5','3,9'),
('6',''),
('7','0,8'),
('8','0,7'),
('9','3,5'),
],
['id', 'matched_id'])
print(df.schema.fields)
df.show(truncate=False)
In this df, both the id and matched_id columns are StringType data types. The code that follows builds-off this assumption. You can check the column types in your df by running print(df.schema.fields)
id
matched_id
0
7,8
1
2
4
3
5,9
4
2
5
3,9
6
7
0,8
8
0,7
9
3,5
The Logic
To implement the logic for bu_id, we created a function called bu_calculation that defines the logic. Then we wrap the function in pyspark sql UDF. The bu_id column is then created by inputing the columns we need to evaluate (the id and matched_id columns) into the UDF
# create custom function with the logic for bu_id
def bu_calculation(id_col, matched_id_col):
id_int = int(id_col)
# turn the string in the matched_id column into a list and remove empty values from the list
matched_id_list = list(filter(None, matched_id_col.split(",")))
if len(matched_id_list) > 0:
# if matched_id column has values, convert strings to ints
all_ids = [int(x) for x in matched_id_list]
# join id column values with matched_id column values
all_ids.append(id_int)
# return minimum value
return min(all_ids)
else:
# if matched_id column is empty return the id column value
return id_int
# apply custom bu_calculation function to pyspark sql udf
# the use of IntegerType() here enforces that the bu_calculation function has to return an int
bu_udf = F.udf(bu_calculation, IntegerType())
# make a new column called bu_id using the pyspark sql udf we created called bu_udf
df = df.withColumn('bu_id', bu_udf('id', 'matched_id'))
df.show(truncate=False)
id
matched_id
bu_id
0
7,8
0
1
1
2
4
2
3
5,9
3
4
2
2
5
3,9
3
6
6
7
0,8
0
8
0,7
0
9
3,5
3
More about the pyspark sql udf function here: https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.udf.html

How to append data to a column value in dataframe

In spark, I have a dataframe having a column named goals which holds numeric value. Here, I just want to append "goal or goals" string to the actual value
I want to print it as
if,
value = 1 then 1 goal
value = 2 then 2 goals and so on..
My data looks like this
val goalsDF = Seq(("meg", 2), ("meg", 4), ("min", 3),
("min2", 1), ("ss", 1)).toDF("name", "goals")
goalsDF.show()
+-----+-----+
|name |goals|
+-----+-----+
|meg |2 |
|meg |4 |
|min |3 |
|min2 |1 |
|ss |1 |
+-----+-----+
Expected Output:
+-----+---------+
|name |goals |
+-----+---------+
|meg |2 goals |
|meg |4 goals |
|min |3 goals |
|min2 |1 goal |
|ss |1 goal |
+-----+---------+
I tried below code but it doesn't work and prints the data as null
goalsDF.withColumn("goals", col("goals") + lit("goals")).show()
+----+-----+
|name|goals|
+----+-----+
| meg| null|
| meg| null|
| min| null|
|min2| null|
| ss| null|
+----+-----+
Please suggest if we can do this inside .withColumn() without any addition user defined method
You should use case when. It's pyspark example but you should be able to reference it and use scala.
DF.
withColumn('goals', F.When(F.col('goals') == 1, '1 goal').otherwise(F.concat_ws(" ", F.col("goals"), "goals"))
)
For scala example see here: https://stackoverflow.com/a/37108127/5899997

Spark SQL - count number of distinct word in all columns

There is a DataFrame df_titles with one column "title":
+--------------------+
| title|
+--------------------+
| harry_potter_1|
| harry_potter_2|
+--------------------+
I want to know the number of unique terms appearing in the titles, where the terms are delimited by "_", and get something like this:
+--------------------+------+
| term| count|
+--------------------+------+
| harry| 2|
| potter| 2|
| 1| 1|
| 2| 1|
+--------------------+------+
I am thinking of creating a new_df with columns "term" and "count", and for each row in df_titles, split the string and insert [string, 1] to the new_df. Then maybe reduce the new df by "term":
val test = Seq.empty[Term].toDF()
df.foreach(spark.sql("INSERT INTO test VALUES (...)"))
...
But I am stuck with the code. How should I proceed? Is there a better way to do this?
You can use spark built-in functions such as split and explode to transform your dataframe of titles to dataframe of terms and then do a simple groupBy. Your code should be:
import org.apache.spark.sql.functions.{col, desc, explode, split}
df_titles
.select(explode(split(col("title"), "_")).as("term"))
.groupBy("term")
.count()
.orderBy(desc("count")) // optional, to have count in descending order
Usually, when you have to perform something over a dataframe, it is better to first try to use a combination of spark built-in functions that you can find in Spark documentation
Details
Starting from df_titles:
+--------------+
|title |
+--------------+
|harry_potter_1|
|harry_potter_2|
+--------------+
split creates an array of words separated by _:
+-------------------+
|split(title, _, -1)|
+-------------------+
|[harry, potter, 1] |
|[harry, potter, 2] |
+-------------------+
Then, explode creates one line per item in array created by split:
+------+
|col |
+------+
|harry |
|potter|
|1 |
|harry |
|potter|
|2 |
+------+
.as("term") renames column col to term:
+------+
|term |
+------+
|harry |
|potter|
|1 |
|harry |
|potter|
|2 |
+------+
Then .groupBy("term") with .count() aggregates counting by term, count() is a shortcut for .agg(count("term").as("count"))
+------+-----+
|term |count|
+------+-----+
|harry |2 |
|1 |1 |
|potter|2 |
|2 |1 |
+------+-----+
And finally .orderBy(desc("count")) orders count in reverse order:
+------+-----+
|term |count|
+------+-----+
|harry |2 |
|potter|2 |
|1 |1 |
|2 |1 |
+------+-----+

How to append new data to existing Hive table using Spark data frame and partitionby clause

I have a dataframe which I am writing to Hive table using partitionBy -
val df = Seq(
("day1", 1),
("day1", 2),
("day1", 3),
("day1", 4)
).toDF("day","id")
df.write.partitionBy("day").format("parquet").saveAsTable("mydb.day_table")
spark.sql("select * from mydb.day_table").show
+---+----+
| id| day|
+---+----+
| 1|day1|
| 2|day1|
| 3|day1|
| 4|day1|
+---+----+
spark.sql("show create table mydb.day_table").show(false)
+---------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt |
+---------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `mydb`.`day_table` (`id` INT, `day` STRING)
USING parquet
OPTIONS (
`serialization.format` '1'
)
PARTITIONED BY (day)
|
+---------------------------------------------------------------------------------------------------------------------------------------+
If I create another dataframe and want to append the content of this data frame to the same table by keeping the partition intact, then how
do I do it? Is this the correct way ?
val df2 = Seq(
("day2",5),
("day2",6)
).toDF("day","id")
df2.write.mode("append").partitionBy("day").format("parquet").saveAsTable("mydb.day_table")
I want below output but I want the table to be partitioned by day even after I keep appending the data to the original table.
spark.sql("select * from mydb.day_table").show(false)
+---+----+
|id |day |
+---+----+
|1 |day1|
|2 |day1|
|3 |day1|
|4 |day1|
|5 |day2|
|6 |day2|
+---+----+
I am using Scala with Spark-Shell. Thanks.
You have to use spark data frame api's
1)partitionby
2)saveAsTable
df.\
write.\
format("parquet").\
mode(org.apache.spark.sql.SaveMode.Append).\
partitionBy("first").\
saveAsTable("example")
I see This is what you were doing as well. But you did not post any error.

Apache Spark SQL - Multiple arrays explode and 1:1 mapping

I am pretty new to Apache Spark SQL and trying to achieve the following. I have the following DF which I want to convert to a intermediate DF and then to json.
array [a,b,c,d,e] and array [1,2,3,4,5]
Need them to be
a 1
b 2
c 3
Tried the explode option but I get only one array exploded.
Thanks for the assistance..
To join two dataframes in Spark you will need to use a common column which exists on both dataframes and since you don't have one you need to create it. Since version 1.6.0 Spark supports this functionality through monotonically_increasing_id() function. The next code illustrates this case:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq("a","b","c","d","e")
.toDF("val1")
.withColumn("id", monotonically_increasing_id)
val df2 = Seq(1, 2, 3, 4, 5)
.toDF("val2")
.withColumn("id", monotonically_increasing_id)
df.join(df2, "id").select($"val1", $"val2").show(false)
Output:
+----+----+
|val1|val2|
+----+----+
|a |1 |
|b |2 |
|c |3 |
|d |4 |
|e |5 |
+----+----+
Good luck