transposing rows into columns in pyspark

transposing rows into columns in pyspark - apache-spark-sql

I have dataframe track_log where columns are
item track_info Date
----------------------
1 ordered 01/01/19
1 Shipped 02/01/19
1 delivered 03/01/19
I want to get data as
item ordered Shipped Delivered
--------------------------------------------
1 01/01/19 02/01/19 03/01/19
need to resolve this using pyspark

I can think of a solution like this:
>>> df.show()
+----+----------+--------+
|item|track_info| date|
+----+----------+--------+
| 1| ordered|01/01/19|
| 1| Shipped|02/01/19|
| 1| delivered|03/01/19|
+----+----------+--------+
>>> df_grouped=df.groupBy(df.item).agg(collect_list(df.track_info).alias('grouped_data'))
>>> df_grouped_date=df.groupBy(df.item).agg(collect_list(df.date).alias('grouped_dates'))
>>> df_cols=df_grouped.select(df_grouped.grouped_data).first()['grouped_data'].insert(0,'item')
>>> df_grouped_date.select(df_grouped_date.item,df_grouped_date.grouped_dates[0],df_grouped_date.grouped_dates[1],df_grouped_date.grouped_dates[2]).toDF(*df_cols).show()
+----+--------+--------+---------+
|item| ordered| Shipped|delivered|
+----+--------+--------+---------+
| 1|01/01/19|02/01/19| 03/01/19|
+----+--------+--------+---------+

You can use spark pivot function to do that as a single liner as below
>>> df.show()
+----+----------+--------+
|item|track_info| date|
+----+----------+--------+
| 1| ordered|01/01/19|
| 1| Shipped|02/01/19|
| 1| delivered|03/01/19|
+----+----------+--------+
>>> pivot_df = df.groupBy('item').pivot('track_info').agg(collect_list('date'))
>>> pivot_df.show()
+----+--------+--------+---------+
|item| ordered| Shipped|delivered|
+----+--------+--------+---------+
| 1|[01/01/19]|[02/01/19]| [03/01/19]|
+----+--------+--------+---------+

Related

PySpark: How to concatenate two distinct dataframes?

I have multiple dataframes that I need to concatenate together, row-wise. In pandas, we would typically write: pd.concat([df1, df2]).
This thread: How to concatenate/append multiple Spark dataframes column wise in Pyspark? appears close, but its respective answer:
df1_schema = StructType([StructField("id",IntegerType()),StructField("name",StringType())])
df1 = spark.sparkContext.parallelize([(1, "sammy"),(2, "jill"),(3, "john")])
df1 = spark.createDataFrame(df1, schema=df1_schema)
df2_schema = StructType([StructField("secNo",IntegerType()),StructField("city",StringType())])
df2 = spark.sparkContext.parallelize([(101, "LA"),(102, "CA"),(103,"DC")])
df2 = spark.createDataFrame(df2, schema=df2_schema)
schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()
Yields the following error when done on my data at scale: Can only zip RDDs with same number of elements in each partition
How can I join 2 or more data frames that are identical in row length but are otherwise independent of content (they share a similar repeating structure/order but contain no shared data)?
Example expected data looks like:
+---+-----+ +-----+----+ +---+-----+-----+----+
| id| name| |secNo|city| | id| name|secNo|city|
+---+-----+ +-----+----+ +---+-----+-----+----+
| 1|sammy| + | 101| LA| => | 1|sammy| 101| LA|
| 2| jill| | 102| CA| | 2| jill| 102| CA|
| 3| john| | 103| DC| | 3| john| 103| DC|
+---+-----+ +-----+----+ +---+-----+-----+----+

You can create unique IDs with
df1 = df1.withColumn("unique_id", expr("row_number() over (order by (select null))"))
df2 = df2.withColumn("unique_id", expr("row_number() over (order by (select null))"))
then, you can left join them
df1.join(df2, Seq("unique_id"), "left").drop("unique_id")
Final output looks like
+---+----+---+-------+
| id|name|age|address|
+---+----+---+-------+
| 1| a| 7| x|
| 2| b| 8| y|
| 3| c| 9| z|
+---+----+---+-------+

Pyspark how to compare row by row based on hash from two data frame and group the result

I have bellow two data frame with hash added as additional column to identify differences for same id from both data frame
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
#identify unmatched row for same id from both data frame
df1_un_match_indf2=df1.join(df2,df1.hash==df2.hash,"leftanti")
df2_un_match_indf1=df2.join(df1,df2.hash==df1.hash,"leftanti")
#The above case list both data frame, since all hash for same id are different
Now i am trying to find difference of row value against the same id from 'df1_un_match_indf1,df2_un_match_indf1' data frame, so that it shows differences row by row
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff=df3.join(df4,df3.id==df4.id,"inner")
common_dff.show()
but result show difference like this
+--------+----------+-----+----+-----+-----------+-------+---+---+----+
|name |department|state|id |hash |name | department|state| id|hash
+--------+----------+-----+----+-----+-----+-----------+-----+---+-----+
|James |Sales |NY |101 | c123|James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234|Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34|Jen | |NY2 |103| 2f34
What i am expecting is
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
I tried with different ways, but didn't find right solution to make this expected format
Can anyone give a solution or idea to this?
Thanks

What you want to use is likely collect_list or maybe 'collect_set'
This is really well described here:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
+---+-----+-----+
| id| code| name|
+---+-----+-----+
| a| null| null|
| a|code1| null|
| a|code2|name2|
+---+-----+-----+
(df
.groupby("id")
.agg(F.collect_set("code"),
F.collect_list("name"))
.show())
+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
| a| [code1, code2]| [name2]|
+---+-----------------+------------------+
In your case you need to slightly change your join into a union to enable you to group the data.
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff = df3.union(df4)
(common_diff
.groupby("id")
.agg(F.collect_set("name"),
F.collect_list("department"))
.show())
If you can do a union just use an array:
from pyspark.sql.functions import array
common_diff.select(
df.id,
array(
common_diff.thisState,
common_diff.thatState
).alias("State"),
array(
common_diff.thisDept,
common_diff.thatDept
).alias("Department")
)
It a lot more typing and a little more fragile. I suggest that renaming columns and using the groupby is likely cleaner and clearer.

How to use Window.unboundedPreceding, Window.unboundedFollowing on Distinct datetime

I have data like below
---------------------------------------------------|
|Id | DateTime | products |
|--------|-----------------------------|-----------|
| 1| 2017-08-24T00:00:00.000+0000| 1 |
| 1| 2017-08-24T00:00:00.000+0000| 2 |
| 1| 2017-08-24T00:00:00.000+0000| 3 |
| 1| 2016-05-24T00:00:00.000+0000| 1 |
I am using window.unboundedPreceding , window.unboundedFollowing as below to get the second recent datetime.
sorted_times = Window.partitionBy('Id').orderBy(F.col('ModifiedTime').desc()).rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df3 = (data.withColumn("second_recent",F.collect_list(F.col('ModifiedTime')).over(sorted_times)).getItem(1)))
But I get the results as below,getting the second date from second row which is same as first row
------------------------------------------------------------------------------
|Id |DateTime | secondtime |Products
|--------|-----------------------------|----------------------------- |--------------
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 2
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 3
| 1| 2016-05-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
Please help me in finding the second latest datetime on distinct datetime.
Thanks in advance

Use collect_set instead of collect_list for no duplicates:
df3 = data.withColumn(
"second_recent",
F.collect_set(F.col('LastModifiedTime')).over(sorted_times)[1]
)
df3.show(truncate=False)
#+-----+----------------------------+--------+----------------------------+
#|VipId|LastModifiedTime |products|second_recent |
#+-----+----------------------------+--------+----------------------------+
#|1 |2017-08-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|2 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|3 |2016-05-24T00:00:00.000+0000|
#|1 |2016-05-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#+-----+----------------------------+--------+----------------------------+
Another way by using unordered window and sorting the array before taking second_recent:
from pyspark.sql import functions as F, Window
df3 = data.withColumn(
"second_recent",
F.sort_array(
F.collect_set(F.col('LastModifiedTime')).over(Window.partitionBy('VipId')),
False
)[1]
)

How to create BinaryType Column using multiple columns of a pySpark Dataframe?

I have recently started working with pySpark so don't know about many details regarding this.
I am trying to create a BinaryType column in a data frame? But struggling to do it...
for example, let's take a simple df
df.show(2)
+---+----------+
| col1|col2|
+---+----------+
| "1"| null|
| "2"| "20"|
+---+----------+
Now I want to have a third column "col3" with BinaryType like
| col1|col2| col3|
+---+----------+
| "1"| null|[1 null]
| "2"| "20"|[ 2 20]
+---+----------+
How should i do it?

Try this:
a = [('1', None), ('2', '20')]
df = spark.createDataFrame(a, ['col1', 'col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| 1|null|
| 2| 20|
+----+----+
df = df.withColumn('col3', F.array(['col1', 'col2']))
df.show()
+----+----+-------+
|col1|col2| col3|
+----+----+-------+
| 1|null| [1,]|
| 2| 20|[2, 20]|
+----+----+-------+

How to filter dates column with one condtion from the other column in Pyspark?

Assume, I have the following data frame named table_df in Pyspark
sid | date | label
------------------
1033| 20170521 | 0
1033| 20170520 | 0
1033| 20170519 | 1
1033| 20170516 | 0
1033| 20170515 | 0
1033| 20170511 | 1
1033| 20170511 | 0
1033| 20170509 | 0
.....................
The data frame table_df contains different IDs in different rows, the above is simply one typical case of ID.
For each ID and for each date with label 1, I would like to find the date with label 0 that is the closest and before.
For the above table, with ID 1033, date=20170519, label 1, the date of label 0 that is closest and before is 20170516.
And with ID 1033, date=20170511, label 1, the date of label 0 that is closest and before is 20170509 .
So, finally using groupBy and some complicated operations, I will obtain the following table:
sid | filtered_date |
-------------------------
1033| 20170516 |
1033| 20170509 |
-------------
Any help is highly appreciated. I tried but could not find any smart ways.
Thanks

We can use window partition ordered by date and find difference with the next row,
df.show()
+----+--------+-----+
| sid| date|label|
+----+--------+-----+
|1033|20170521| 0|
|1033|20170520| 0|
|1033|20170519| 1|
|1033|20170516| 0|
|1033|20170515| 0|
|1033|20170511| 1|
|1033|20170511| 0|
|1033|20170509| 0|
+----+--------+-----+
from pyspark.sql import Window
from pyspark.sql import functions as F
w = Window.partitionBy('sid').orderBy('date')
df.withColumn('diff',F.lead('label').over(w) - df['label']).where(F.col('diff') == 1).drop('diff').show()
+----+--------+-----+
| sid| date|label|
+----+--------+-----+
|1033|20170509| 0|
|1033|20170516| 0|
+----+--------+-----+

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

transposing rows into columns in pyspark - apache-spark-sql

Related

PySpark: How to concatenate two distinct dataframes?

Pyspark how to compare row by row based on hash from two data frame and group the result

How to use Window.unboundedPreceding, Window.unboundedFollowing on Distinct datetime

How to create BinaryType Column using multiple columns of a pySpark Dataframe?

How to filter dates column with one condtion from the other column in Pyspark?

Categories

Resources