I am pretty new to Apache Spark SQL and trying to achieve the following. I have the following DF which I want to convert to a intermediate DF and then to json.
array [a,b,c,d,e] and array [1,2,3,4,5]
Need them to be
a 1
b 2
c 3
Tried the explode option but I get only one array exploded.
Thanks for the assistance..
To join two dataframes in Spark you will need to use a common column which exists on both dataframes and since you don't have one you need to create it. Since version 1.6.0 Spark supports this functionality through monotonically_increasing_id() function. The next code illustrates this case:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq("a","b","c","d","e")
.toDF("val1")
.withColumn("id", monotonically_increasing_id)
val df2 = Seq(1, 2, 3, 4, 5)
.toDF("val2")
.withColumn("id", monotonically_increasing_id)
df.join(df2, "id").select($"val1", $"val2").show(false)
Output:
+----+----+
|val1|val2|
+----+----+
|a |1 |
|b |2 |
|c |3 |
|d |4 |
|e |5 |
+----+----+
Good luck
Related
Hello I have nested json files with size of 400 megabytes with 200k records.I created a solution using pyspark to parse the file and store in a customized dataframe , but it takes about 5-7 minutes to do this operation which is very slow.
Here is an example of a json file (small one but with same structure as the large ones) :
{"status":"success",
"data":{"resultType":"matrix","result":
[{"metric":{"data0":"T" ,"data1":"O"},"values":[[90,"0"],[80, "0"]]},
{"metric":{"data0":"K" ,"data1":"S"},"values":[[70,"0"],[60, "0"]]},
{"metric":{"data2":"J" ,"data3":"O"},"values":[[50,"0"],[40, "0"]]}]}}
Here is the structure of the output dataframe I want :
time | value |data0 | data1 | data2 | data3
90 | "0" | "T"| "O"| nan | nan
80 | "0" | "T"| "O"| nan | nan
70 | "0" | "K"| "S"| nan | nan
60 | "0" | "K"| "S"| nan | nan
50 | "0" | nan| nan| "J" | "O"
40 | "0" | nan| nan| "J" | "O"
and this is the pyspark code I used to on the large file to produce the structure of the dataframe listed above:
from datetime import datetime
import json
import rapidjson
import pyspark.sql.functions as F
from pyspark.sql.types import StructType
from util import schema ,meta_date
new_schema = StructType.fromJson(json.loads(schema))
with open("largefile.json", "r") as json_file:
result_count = len(rapidjson.load(json_file)["data"]["result"])
spark = SparkSession.builder.master("spark://IP").getOrCreate()
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '5g'),
('spark.executor.cores', '4'),
('spark.driver.memory', '4g'),
])
spark.sparkContext.stop()
spark = SparkSession.builder.config(conf=conf).getOrCreate()
df = spark.read.json("largefile.json")
for data_name in meta_date:
df = df.withColumn(
data_name, F.expr(f"transform(data.result, x -> x.metric.{data_name})")
)
df = (
df.withColumn("values", F.expr("transform(data.result, x -> x.values)"))
.withColumn("items", F.array(*[F.lit(x) for x in range(0, result_count)]))
.withColumn("items", F.explode(F.col("items")))
)
for data_name in meta_date:
df = df.withColumn(data_name, F.col(data_name).getItem(F.col("items")))
df = (df.withColumn("values", F.col("values").getItem(F.col("items")))
.withColumn("values", F.explode("values"))
.withColumn("time", F.col("values").getItem(0))
.withColumn("value", F.col("values").getItem(1))
.drop("data", "status", "items", "values")).show()
My machine has 4 cores (8 logical cores) and memory 16 gb . I'm using the standalone mode with cluster of master and 2 worker nodes.
Any help on how to speed up this process either by editing the cluster configurations or refactoring the transformations in the code?
What about this? Read json, select columns with explode and it looks like match with your desired result.
df.select(f.explode('data.result').alias('result')) \
.select('result.metric.*', f.explode('result.values').alias('values')) \
.withColumn('time', f.col('values')[0]) \
.withColumn('value', f.col('values')[1]) \
.drop('values') \
.show(truncate=False)
+-----+-----+-----+-----+----+-----+
|data0|data1|data2|data3|time|value|
+-----+-----+-----+-----+----+-----+
|T |O |null |null |90 |0 |
|T |O |null |null |80 |0 |
|K |S |null |null |70 |0 |
|K |S |null |null |60 |0 |
|null |null |J |O |50 |0 |
|null |null |J |O |40 |0 |
+-----+-----+-----+-----+----+-----+
I have a simple csv file with columns(id:Integer,Name:String,Gender:String). I want to add a column to this dataframe using withColumn. I am aware that first parameter is column name but not quite sure how to pass the second parameter which will group by the gender. Can someone help me with this?
I have tried using df.withColumn("Count",df.groupby("Gender").count()) but it is giving type mismatch error saying it is expecting a column type but getting a dataframe. I have tried using col() also expecting it to type cast it.
val newDf=df.withColumn("count",col(df.groupBy("Gender").count()))
I expect a new dataframe with count column added.
I think this is what you are looking for, use of window function and count function
import spark.implicits._
val df = Seq(
(1, "name1", "M"),
(2, "name2", "M"),
(3, "name3", "M"),
(4, "name4", "M"),
(5, "name5", "F"),
(6, "name6", "F"),
(7, "name7", "M")
).toDF("id", "Name", "Gender")
val window = Window.partitionBy("Gender")
df.withColumn("count", count($"Gender").over(window))
.show(false)
Output:
+---+-----+------+-----+
|id |Name |Gender|count|
+---+-----+------+-----+
|5 |name5|F |2 |
|6 |name6|F |2 |
|1 |name1|M |5 |
|2 |name2|M |5 |
|3 |name3|M |5 |
|4 |name4|M |5 |
|7 |name7|M |5 |
+---+-----+------+-----+
I have loaded a parquet file and created a Data frame as shown below
----------------------------------------------------------------------
time | data1 | data2
-----------------------------------------------------------------------
1-40 | [ lion-> 34, bear -> 2 ] | [ monkey -> [9,23], goose -> [4,5] ]
So, the data type of data1 column is string->integer map, where data type of data2 column is string->array map.
I want to explode the above data frame into below structure
------------------------
time | key | val
------------------------
1-40 | lion | 34
1-40 | bear | 2
1-40 | monkey_0 | 9
1-40 | monkey_1 | 23
1-40 | goose_0 | 4
1-40 | goose_1 | 5
I tried to convert both data1 and data2 into same datatype as string->array by using udfs in pyspark and then exploded the column as show below
def to_map(col1, col2):
for i in col1.keys():
col2[i] = [col1[i]]
return col2
caster= udf(to_map,MapType(StringType(),ArrayType(IntegerType())))
pm_df = pm_df.withColumn("animals", caster('data1', 'data2'))
pm_df.select('time',explode(col('animals')))
I also tried using hive sql by assuming hive sql has more performance than using pyspark UDFs.
rdd = spark.sparkContext.parallelize([[datetime.datetime.now(), {'lion': 34, 'bear': 2}, {'monkey': [9, 23], 'goose':[4,5]} ]])
df = rdd.toDF(fields)
df.createOrReplaceTempView("df")
df = spark.sql("select time, explode(data1), data2 from df")
df.createOrReplaceTempView("df")
df = spark.sql("select time,key as animal,value,posexplode(data2) from df").show(truncate=False)
But I am stuck with below result and don't know how to merge the splitted columns as per my requirement.Output of above hive sql is:
+--------------------------+------+-----+---+------+-------+
|time |animal|value|pos|key |value |
+--------------------------+------+-----+---+------+-------+
|2019-06-12 19:23:00.169739|bear |2 |0 |goose |[4, 5] |
|2019-06-12 19:23:00.169739|bear |2 |1 |monkey|[9, 23]|
|2019-06-12 19:23:00.169739|lion |34 |0 |goose |[4, 5] |
|2019-06-12 19:23:00.169739|lion |34 |1 |monkey|[9, 23]|
+--------------------------+------+-----+---+------+-------+
I know that while using python udfs there is lot of overhead that goes for communication between a python processor and JVMs. Is there any way to achieve the above expected result using inbuilt functions or hive sql.
I would process data1 and data2 separately and then union the resultset:
from pyspark.sql import functions as F
df1 = df.select('time', F.explode('data1').alias('key', 'value'))
>>> df1.show()
#+--------------------+----+-----+
#| time| key|value|
#+--------------------+----+-----+
#|2019-06-12 20:19:...|bear| 2|
#|2019-06-12 20:19:...|lion| 34|
#+--------------------+----+-----+
df2 = df.select('time', F.explode('data2').alias('key', 'values')) \
.select('time', 'key', F.posexplode('values').alias('pos','value')) \
.select('time', F.concat('key', F.lit('_'), 'pos').alias('key'), 'value')
>>> df2.show()
#+--------------------+--------+-----+
#| time| key|value|
#+--------------------+--------+-----+
#|2019-06-12 20:19:...| goose_0| 4|
#|2019-06-12 20:19:...| goose_1| 5|
#|2019-06-12 20:19:...|monkey_0| 9|
#|2019-06-12 20:19:...|monkey_1| 23|
#+--------------------+--------+-----+
df_new = df1.union(df2)
I have a Dataframe like this:
+------+---+
| Name|Age|
+------+---+
|A-2 | 26|
|B-1 | 30|
|C-3 | 20|
+------+---+
scala> p.select("Name", "Age")
res2: org.apache.spark.sql.DataFrame = [Name: string, Age: string]
We can see clearly here that the data in the columns are of type String
I want to transform the Name column with a split("-") like method to get only the first part of it (i.e A, B, C).
But type Column in spark doesn't have such a method, so i'm thinking how to get the 'string' inside of the Column so i can perform the split operation.
Does anyone know what should i do ?
Use functions.split method
df.select(split(col("Name"), "-").getItem(0))
Split function available for spark dataframe. See the example below.
//Creating Test Data
val df = Seq(("A-2", 26)
, ("B-1", 30)
, ("C-3", 20)
).toDF("name", "age")
df.withColumn("new_name", split(col("name"),"-")(0)).show(false)
+----+---+--------+
|name|age|new_name|
+----+---+--------+
|A-2 |26 |A |
|B-1 |30 |B |
|C-3 |20 |C |
+----+---+--------+
I have a dataframe which I am writing to Hive table using partitionBy -
val df = Seq(
("day1", 1),
("day1", 2),
("day1", 3),
("day1", 4)
).toDF("day","id")
df.write.partitionBy("day").format("parquet").saveAsTable("mydb.day_table")
spark.sql("select * from mydb.day_table").show
+---+----+
| id| day|
+---+----+
| 1|day1|
| 2|day1|
| 3|day1|
| 4|day1|
+---+----+
spark.sql("show create table mydb.day_table").show(false)
+---------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt |
+---------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `mydb`.`day_table` (`id` INT, `day` STRING)
USING parquet
OPTIONS (
`serialization.format` '1'
)
PARTITIONED BY (day)
|
+---------------------------------------------------------------------------------------------------------------------------------------+
If I create another dataframe and want to append the content of this data frame to the same table by keeping the partition intact, then how
do I do it? Is this the correct way ?
val df2 = Seq(
("day2",5),
("day2",6)
).toDF("day","id")
df2.write.mode("append").partitionBy("day").format("parquet").saveAsTable("mydb.day_table")
I want below output but I want the table to be partitioned by day even after I keep appending the data to the original table.
spark.sql("select * from mydb.day_table").show(false)
+---+----+
|id |day |
+---+----+
|1 |day1|
|2 |day1|
|3 |day1|
|4 |day1|
|5 |day2|
|6 |day2|
+---+----+
I am using Scala with Spark-Shell. Thanks.
You have to use spark data frame api's
1)partitionby
2)saveAsTable
df.\
write.\
format("parquet").\
mode(org.apache.spark.sql.SaveMode.Append).\
partitionBy("first").\
saveAsTable("example")
I see This is what you were doing as well. But you did not post any error.