I have specific problem, where I want to retrieve the value of bu_id field from id and matched_ id.
When there is some value in matched_id column, bu_id should be same as the id for that particular id and ids of corresponding matched_id.
When matched_id is blank, bu_id should be same as id.
input
+---+------------+
|id |matched_id |
+---+------------+
|0 |7,8 |
|1 | |
|2 |4 |
|3 |5,9 |
|4 |2 |
|5 |3,9 |
|6 | |
|7 |0,8 |
|8 |0,7 |
|9 |3,5 |
output
+---+------------+-----+
|id |matched_id |bu_id|
+---+------------+-----+
|0 |7,8 |0 |
|1 | |1 |
|2 |4 |2 |
|3 |5,9 |3 |
|4 |2 |2 |
|5 |3,9 |3 |
|6 | |6 |
|7 |0,8 |0 |
|8 |0,7 |0 |
|9 |3,5 |3 |
Can anyone help me how to approach this problem. Thanks in advance.
We should try to use functions exclusively from the pyspark.sql.functions module because these are optimized for pyspark dataframes (see here), whereas udfs are not and should be avoided when possible.
To achieve the desired output pyspark dataframe, we can concatenate both "id" and "matched_id" columns together, convert the string that into a list of strings using split, cast the result as an array of integers, and take the minimum of the array – and we can get away with not having to worry about the blank strings because they get converted into null, and F.array_min drops nulls from consideration. This can be done with the following line of code (and while it is a little hard to read, it gets the job done):
import pyspark.sql.functions as F
df = spark.createDataFrame(
[
("0", "7,8"),
("1", ""),
("2", "4"),
("3", "5,9"),
("4", "2"),
("5", "3,9"),
("6", ""),
("7", "0,8"),
("8", "0,7"),
("9", "3,5"),
],
["id", "matched_id"]
)
df.withColumn(
"bu_id",
F.array_min(F.split(F.concat(F.col("id"),F.lit(","),F.col("matched_id")),",").cast("array<int>"))
).show()
Output:
+---+----------+-----+
| id|matched_id|bu_id|
+---+----------+-----+
| 0| 7,8| 0|
| 1| | 1|
| 2| 4| 2|
| 3| 5,9| 3|
| 4| 2| 2|
| 5| 3,9| 3|
| 6| | 6|
| 7| 0,8| 0|
| 8| 0,7| 0|
| 9| 3,5| 3|
+---+----------+-----+
Update: in the case of non-numeric strings in columns "id" and "matched_id", we can no longer cast to an array of integers, so we can instead use pyspark functions F.when and .otherwise (see here) to set our new column to the "id" column when "matched_id" is an empty string "", and apply our other longer nested function when "matched_id" is non-empty.
df2 = spark.createDataFrame(
[
("0", "7,8"),
("1", ""),
("2", "4"),
("3", "5,9"),
("4", "2"),
("5", "3,9"),
("6", ""),
("7", "0,8"),
("8", "0,7"),
("9", "3,5"),
("x", ""),
("x", "y,z")
],
["id", "matched_id"]
)
df2.withColumn(
"bu_id",
F.when(F.col("matched_id") != "", F.array_min(F.split(F.concat(F.col("id"),F.lit(","),F.col("matched_id")),","))).otherwise(
F.col("id")
)
).show()
Output:
+---+----------+-----+
| id|matched_id|bu_id|
+---+----------+-----+
| 0| 7,8| 0|
| 1| | 1|
| 2| 4| 2|
| 3| 5,9| 3|
| 4| 2| 2|
| 5| 3,9| 3|
| 6| | 6|
| 7| 0,8| 0|
| 8| 0,7| 0|
| 9| 3,5| 3|
| x| | x|
| x| y,z| x|
+---+----------+-----+
To answer this question I assumed that the logic you are looking to implement is,
If the matched_id column is null, then bu_id should be the same as id.
If the matched_id column is not null, we should consider the values listed in both the id and matched_id columns and bu_id should be the minimum of those values.
The Set-Up
# imports to include
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
# making your dataframe
df = spark.createDataFrame(
[
('0','7,8'),
('1',''),
('2','4'),
('3','5,9'),
('4','2'),
('5','3,9'),
('6',''),
('7','0,8'),
('8','0,7'),
('9','3,5'),
],
['id', 'matched_id'])
print(df.schema.fields)
df.show(truncate=False)
In this df, both the id and matched_id columns are StringType data types. The code that follows builds-off this assumption. You can check the column types in your df by running print(df.schema.fields)
id
matched_id
0
7,8
1
2
4
3
5,9
4
2
5
3,9
6
7
0,8
8
0,7
9
3,5
The Logic
To implement the logic for bu_id, we created a function called bu_calculation that defines the logic. Then we wrap the function in pyspark sql UDF. The bu_id column is then created by inputing the columns we need to evaluate (the id and matched_id columns) into the UDF
# create custom function with the logic for bu_id
def bu_calculation(id_col, matched_id_col):
id_int = int(id_col)
# turn the string in the matched_id column into a list and remove empty values from the list
matched_id_list = list(filter(None, matched_id_col.split(",")))
if len(matched_id_list) > 0:
# if matched_id column has values, convert strings to ints
all_ids = [int(x) for x in matched_id_list]
# join id column values with matched_id column values
all_ids.append(id_int)
# return minimum value
return min(all_ids)
else:
# if matched_id column is empty return the id column value
return id_int
# apply custom bu_calculation function to pyspark sql udf
# the use of IntegerType() here enforces that the bu_calculation function has to return an int
bu_udf = F.udf(bu_calculation, IntegerType())
# make a new column called bu_id using the pyspark sql udf we created called bu_udf
df = df.withColumn('bu_id', bu_udf('id', 'matched_id'))
df.show(truncate=False)
id
matched_id
bu_id
0
7,8
0
1
1
2
4
2
3
5,9
3
4
2
2
5
3,9
3
6
6
7
0,8
0
8
0,7
0
9
3,5
3
More about the pyspark sql udf function here: https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.udf.html
I have bellow two data frame with hash added as additional column to identify differences for same id from both data frame
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
#identify unmatched row for same id from both data frame
df1_un_match_indf2=df1.join(df2,df1.hash==df2.hash,"leftanti")
df2_un_match_indf1=df2.join(df1,df2.hash==df1.hash,"leftanti")
#The above case list both data frame, since all hash for same id are different
Now i am trying to find difference of row value against the same id from 'df1_un_match_indf1,df2_un_match_indf1' data frame, so that it shows differences row by row
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff=df3.join(df4,df3.id==df4.id,"inner")
common_dff.show()
but result show difference like this
+--------+----------+-----+----+-----+-----------+-------+---+---+----+
|name |department|state|id |hash |name | department|state| id|hash
+--------+----------+-----+----+-----+-----+-----------+-----+---+-----+
|James |Sales |NY |101 | c123|James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234|Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34|Jen | |NY2 |103| 2f34
What i am expecting is
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
I tried with different ways, but didn't find right solution to make this expected format
Can anyone give a solution or idea to this?
Thanks
What you want to use is likely collect_list or maybe 'collect_set'
This is really well described here:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
+---+-----+-----+
| id| code| name|
+---+-----+-----+
| a| null| null|
| a|code1| null|
| a|code2|name2|
+---+-----+-----+
(df
.groupby("id")
.agg(F.collect_set("code"),
F.collect_list("name"))
.show())
+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
| a| [code1, code2]| [name2]|
+---+-----------------+------------------+
In your case you need to slightly change your join into a union to enable you to group the data.
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff = df3.union(df4)
(common_diff
.groupby("id")
.agg(F.collect_set("name"),
F.collect_list("department"))
.show())
If you can do a union just use an array:
from pyspark.sql.functions import array
common_diff.select(
df.id,
array(
common_diff.thisState,
common_diff.thatState
).alias("State"),
array(
common_diff.thisDept,
common_diff.thatDept
).alias("Department")
)
It a lot more typing and a little more fragile. I suggest that renaming columns and using the groupby is likely cleaner and clearer.
I have a column of lists in a spark dataframe.
+-----+----------+
|c1 | c2 |
+-----+----------+
|a |[1, 0, 1, 1] |
|b |[0, 1, 1, 0] |
|c |[1, 1, 0, 0] |
+-----+----------+
How do I convert this into another spark dataframe where each list is turned into a dataframe column? Also each entry from column 'c1' is the name of the new column created. Something like below.
+--------+
|a| b | c|
+--------+
|1 |0| 1 |
|0 |0| 1 |
|1 |1| 0 |
|1 |0| 0 |
+--------+
Note: I did think about following this: Convert Column of List to Dataframe and then taking a transpose of the resultant matrix. But, this creates quite a lot of columns [as the size of the list data I have is pretty huge] and therefore isn't an efficient solution.
Any help is welcome.
import pyspark.sql.functions as F
#Not a part of the solution, only used to generate the data sample
df = spark.sql("select stack(3 ,'a',array(1, 0, 1, 1), 'b',array(0, 1, 1, 0) ,'c',array(1, 1, 0, 0)) as (c1,c2)")
df.groupBy().pivot('c1').agg(F.first('c2')).selectExpr('inline(arrays_zip(*))').show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 0| 1|
| 0| 1| 1|
| 1| 1| 0|
| 1| 0| 0|
+---+---+---+
This can be easily tested for large datasets
df = sql("select id as c1, transform(sequence(1,10000), e -> tinyint(round(rand()))) as c2 from range(10000)")
Just completed a succesfull execution of 10K arrays, 10K elements each, on a VM with 4 cores & 32 GB RAM (Azure Databricks).
Took 5.35 minutes.
Hello I have nested json files with size of 400 megabytes with 200k records.I created a solution using pyspark to parse the file and store in a customized dataframe , but it takes about 5-7 minutes to do this operation which is very slow.
Here is an example of a json file (small one but with same structure as the large ones) :
{"status":"success",
"data":{"resultType":"matrix","result":
[{"metric":{"data0":"T" ,"data1":"O"},"values":[[90,"0"],[80, "0"]]},
{"metric":{"data0":"K" ,"data1":"S"},"values":[[70,"0"],[60, "0"]]},
{"metric":{"data2":"J" ,"data3":"O"},"values":[[50,"0"],[40, "0"]]}]}}
Here is the structure of the output dataframe I want :
time | value |data0 | data1 | data2 | data3
90 | "0" | "T"| "O"| nan | nan
80 | "0" | "T"| "O"| nan | nan
70 | "0" | "K"| "S"| nan | nan
60 | "0" | "K"| "S"| nan | nan
50 | "0" | nan| nan| "J" | "O"
40 | "0" | nan| nan| "J" | "O"
and this is the pyspark code I used to on the large file to produce the structure of the dataframe listed above:
from datetime import datetime
import json
import rapidjson
import pyspark.sql.functions as F
from pyspark.sql.types import StructType
from util import schema ,meta_date
new_schema = StructType.fromJson(json.loads(schema))
with open("largefile.json", "r") as json_file:
result_count = len(rapidjson.load(json_file)["data"]["result"])
spark = SparkSession.builder.master("spark://IP").getOrCreate()
conf = spark.sparkContext._conf.setAll([('spark.executor.memory', '5g'),
('spark.executor.cores', '4'),
('spark.driver.memory', '4g'),
])
spark.sparkContext.stop()
spark = SparkSession.builder.config(conf=conf).getOrCreate()
df = spark.read.json("largefile.json")
for data_name in meta_date:
df = df.withColumn(
data_name, F.expr(f"transform(data.result, x -> x.metric.{data_name})")
)
df = (
df.withColumn("values", F.expr("transform(data.result, x -> x.values)"))
.withColumn("items", F.array(*[F.lit(x) for x in range(0, result_count)]))
.withColumn("items", F.explode(F.col("items")))
)
for data_name in meta_date:
df = df.withColumn(data_name, F.col(data_name).getItem(F.col("items")))
df = (df.withColumn("values", F.col("values").getItem(F.col("items")))
.withColumn("values", F.explode("values"))
.withColumn("time", F.col("values").getItem(0))
.withColumn("value", F.col("values").getItem(1))
.drop("data", "status", "items", "values")).show()
My machine has 4 cores (8 logical cores) and memory 16 gb . I'm using the standalone mode with cluster of master and 2 worker nodes.
Any help on how to speed up this process either by editing the cluster configurations or refactoring the transformations in the code?
What about this? Read json, select columns with explode and it looks like match with your desired result.
df.select(f.explode('data.result').alias('result')) \
.select('result.metric.*', f.explode('result.values').alias('values')) \
.withColumn('time', f.col('values')[0]) \
.withColumn('value', f.col('values')[1]) \
.drop('values') \
.show(truncate=False)
+-----+-----+-----+-----+----+-----+
|data0|data1|data2|data3|time|value|
+-----+-----+-----+-----+----+-----+
|T |O |null |null |90 |0 |
|T |O |null |null |80 |0 |
|K |S |null |null |70 |0 |
|K |S |null |null |60 |0 |
|null |null |J |O |50 |0 |
|null |null |J |O |40 |0 |
+-----+-----+-----+-----+----+-----+
Suppose that we have a csv file which has been imported as a dataframe in PysPark as follows
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("file path and name.csv", inferSchema = True, header = True)
df.show()
output
+-----+----+----+
|lable|year|val |
+-----+----+----+
| A|2003| 5.0|
| A|2003| 6.0|
| A|2003| 3.0|
| A|2004|null|
| B|2000| 2.0|
| B|2000|null|
| B|2009| 1.0|
| B|2000| 6.0|
| B|2009| 6.0|
+-----+----+----+
Now, we want to add another column to df which contains the standard deviation of val based on the grouping on two columns lable and year. So, the output must be as follows:
+-----+----+----+-----+
|lable|year|val | std |
+-----+----+----+-----+
| A|2003| 5.0| 1.53|
| A|2003| 6.0| 1.53|
| A|2003| 3.0| 1.53|
| A|2004|null| null|
| B|2000| 2.0| 2.83|
| B|2000|null| 2.83|
| B|2009| 1.0| 3.54|
| B|2000| 6.0| 2.83|
| B|2009| 6.0| 3.54|
+-----+----+----+-----+
I have the following codes which works for a small dataframe but it does not work for a very large dataframe (with about 40 million rows) which I am working with now.
import pyspark.sql.functions as f
a = df.groupby('lable','year').agg(f.round(f.stddev("val"),2).alias('std'))
df = df.join(a, on = ['lable', 'year'], how = 'inner')
I get Py4JJavaError Traceback (most recent call last) error after running on my large dataframe.
Does anyone knows any alternative way? I hope your way works on my dataset.
I am using python3.7.1, pyspark2.4, and jupyter4.4.0
The join on dataframe causes a lot of data shuffle between executors. In your case, you can do without the join.
Use a window specification to partition data by 'lable' and 'year' and aggregate on the window.
from pyspark.sql.window import *
windowSpec = Window.partitionBy('lable','year')\
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df = df.withColumn("std", f.round(f.stddev("val").over(windowSpec), 2))