Pyspark DataFrame from given input text - apache-spark-sql

I have a text file which contains the below formatted data:
'a:1,b:1,d:1,e:1,h:1'
'a:2,e:2,d:2,h:2,f:2'
'c:3,e:3,d:3,h:3,f:3'
'a:4,b:4,c:4,e:4,h:4,f:4,i:4,j:4'
Expected output:
a|b|c|d|e|f|g|h|i|j
1|1| |1|1| |1|1| |
2| | |2|2|2| |2| |
| |3|3|3|3| |3| |
4|4|4| |4|4| |4|4|4

Related

How to use Window.unboundedPreceding, Window.unboundedFollowing on Distinct datetime

I have data like below
---------------------------------------------------|
|Id | DateTime | products |
|--------|-----------------------------|-----------|
| 1| 2017-08-24T00:00:00.000+0000| 1 |
| 1| 2017-08-24T00:00:00.000+0000| 2 |
| 1| 2017-08-24T00:00:00.000+0000| 3 |
| 1| 2016-05-24T00:00:00.000+0000| 1 |
I am using window.unboundedPreceding , window.unboundedFollowing as below to get the second recent datetime.
sorted_times = Window.partitionBy('Id').orderBy(F.col('ModifiedTime').desc()).rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df3 = (data.withColumn("second_recent",F.collect_list(F.col('ModifiedTime')).over(sorted_times)).getItem(1)))
But I get the results as below,getting the second date from second row which is same as first row
------------------------------------------------------------------------------
|Id |DateTime | secondtime |Products
|--------|-----------------------------|----------------------------- |--------------
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 2
| 1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 3
| 1| 2016-05-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
Please help me in finding the second latest datetime on distinct datetime.
Thanks in advance
Use collect_set instead of collect_list for no duplicates:
df3 = data.withColumn(
"second_recent",
F.collect_set(F.col('LastModifiedTime')).over(sorted_times)[1]
)
df3.show(truncate=False)
#+-----+----------------------------+--------+----------------------------+
#|VipId|LastModifiedTime |products|second_recent |
#+-----+----------------------------+--------+----------------------------+
#|1 |2017-08-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|2 |2016-05-24T00:00:00.000+0000|
#|1 |2017-08-24T00:00:00.000+0000|3 |2016-05-24T00:00:00.000+0000|
#|1 |2016-05-24T00:00:00.000+0000|1 |2016-05-24T00:00:00.000+0000|
#+-----+----------------------------+--------+----------------------------+
Another way by using unordered window and sorting the array before taking second_recent:
from pyspark.sql import functions as F, Window
df3 = data.withColumn(
"second_recent",
F.sort_array(
F.collect_set(F.col('LastModifiedTime')).over(Window.partitionBy('VipId')),
False
)[1]
)

Spark SQL - count number of distinct word in all columns

There is a DataFrame df_titles with one column "title":
+--------------------+
| title|
+--------------------+
| harry_potter_1|
| harry_potter_2|
+--------------------+
I want to know the number of unique terms appearing in the titles, where the terms are delimited by "_", and get something like this:
+--------------------+------+
| term| count|
+--------------------+------+
| harry| 2|
| potter| 2|
| 1| 1|
| 2| 1|
+--------------------+------+
I am thinking of creating a new_df with columns "term" and "count", and for each row in df_titles, split the string and insert [string, 1] to the new_df. Then maybe reduce the new df by "term":
val test = Seq.empty[Term].toDF()
df.foreach(spark.sql("INSERT INTO test VALUES (...)"))
...
But I am stuck with the code. How should I proceed? Is there a better way to do this?
You can use spark built-in functions such as split and explode to transform your dataframe of titles to dataframe of terms and then do a simple groupBy. Your code should be:
import org.apache.spark.sql.functions.{col, desc, explode, split}
df_titles
.select(explode(split(col("title"), "_")).as("term"))
.groupBy("term")
.count()
.orderBy(desc("count")) // optional, to have count in descending order
Usually, when you have to perform something over a dataframe, it is better to first try to use a combination of spark built-in functions that you can find in Spark documentation
Details
Starting from df_titles:
+--------------+
|title |
+--------------+
|harry_potter_1|
|harry_potter_2|
+--------------+
split creates an array of words separated by _:
+-------------------+
|split(title, _, -1)|
+-------------------+
|[harry, potter, 1] |
|[harry, potter, 2] |
+-------------------+
Then, explode creates one line per item in array created by split:
+------+
|col |
+------+
|harry |
|potter|
|1 |
|harry |
|potter|
|2 |
+------+
.as("term") renames column col to term:
+------+
|term |
+------+
|harry |
|potter|
|1 |
|harry |
|potter|
|2 |
+------+
Then .groupBy("term") with .count() aggregates counting by term, count() is a shortcut for .agg(count("term").as("count"))
+------+-----+
|term |count|
+------+-----+
|harry |2 |
|1 |1 |
|potter|2 |
|2 |1 |
+------+-----+
And finally .orderBy(desc("count")) orders count in reverse order:
+------+-----+
|term |count|
+------+-----+
|harry |2 |
|potter|2 |
|1 |1 |
|2 |1 |
+------+-----+

Spark SQL group, map reduce

I have the following dataset with the name 'data':
+---------+-------------+------+
| name | subject| mark |
+---------+-------------+------+
| Anna| math| 80|
| Vlad| history| 67|
| Jack| art| 78|
| David| math| 71|
| Monica| art| 65|
| Alex| lit| 59|
| Mark| math| 82|
+---------+-------------+------+
I would like to do a map-reduce job.
The result show look like this or similar:
Anna, David : 1
Anna, Mark : 1
David, mark: 1
Vlad, None : 1
Jack, Monica: 1
Alex, None : 1
I have tried to do the following:
new_data = data.select(['name', 'subject']).show()
+---------+-------------+
| name | subject|
+---------+-------------+
| Anna| math|
| Vlad| history|
| Jack| art|
| David| math|
| Monica| art|
| Alex| lit|
| Mark| math|
+---------+-------------+
data_new.groupBy('name','subject').count().show(10)
However, this command does not give what I need.
You can do a self left join using the subject, get the distinct pairs, and add a column of 1.
import pyspark.sql.functions as F
result = df.alias('t1').join(df.alias('t2'),
F.expr("t1.subject = t2.subject and t1.name != t2.name"),
'left'
).select(
F.concat_ws(
', ',
F.greatest('t1.name', F.coalesce('t2.name', F.lit('None'))),
F.least('t1.name', F.coalesce('t2.name', F.lit('None')))
).alias('pair')
).distinct().withColumn('val', F.lit(1))
result.show()
+------------+---+
| pair|val|
+------------+---+
| Alex, None| 1|
| Anna, David| 1|
| Anna, Mark| 1|
| None, Vlad| 1|
| David, Mark| 1|
|Jack, Monica| 1|
+------------+---+
The process could be:
Grouping student with the same subject in an array
Call a udf function to create the array items permutation
Add a column that shows numbers for each subject
Call explode function to create separate3 rows for each item in the array
Let's do the steps one by one:
Step 1: Grouping
import pyspark.sql.functions as F
grouped_df = data_new.groupBy('subject').agg(F.collect_set('name').alias('students_array'))
Step 2: udf function
from itertools import permutations
def permutatoin(df_col):
result = sorted([e for e in set(permutations(df_col))])
return result
spark.udf.register("perWithPython", permutatoin)
grouped_df = grouped_df.select('*', permutatoin('students_array'))
Step 3: Create a new digit value column for each subject
grouped_df = grouped_df .withColumn('subject_no', F.rowNumber().over(Window.partitionBy('subject'))
Step 4: create separate rows
grouped_df.select(grouped_df.subject_no, explode(grouped_df.students_array)).show(truncate=False)

convert row to column in spark

I have a date like below :- I have to display year_month column column wise. How should I use this, I am new to spark.
scala> spark.sql("""select sum(actual_calls_count),year_month from ph_com_b_gbl_dice.dm_rep_customer_call group by year_month""")
res0: org.apache.spark.sql.DataFrame = [sum(actual_calls_count): bigint, year_month: string]
scala> res0.show
+-----------------------+----------+
|sum(actual_calls_count)|year_month|
+-----------------------+----------+
| 1| 2019-10|
| 3693| 2018-10|
| 7| 2019-11|
| 32| 2017-10|
| 94| 2019-03|
| 10527| 2018-06|
| 4774| 2017-05|
| 1279| 2017-11|
| 331982| 2018-03|
| 315767| 2018-02|
| 7097| 2017-03|
| 8| 2017-08|
| 3| 2019-07|
| 3136| 2017-06|
| 6088| 2017-02|
| 6344| 2017-04|
| 223426| 2018-05|
| 9819| 2018-08|
| 1| 2017-07|
| 68| 2019-05|
+-----------------------+----------+
only showing top 20 rows
My output should be like this :-
sum(actual_calls_count)|year_month1 | year_month2 | year_month3 and so on..
scala> df.groupBy(lit(1)).pivot(col("year_month")).agg(concat_ws("",collect_list(col("sum")))).drop("1").show(false)
+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|2017-02|2017-03|2017-04|2017-05|2017-06|2017-07|2017-08|2017-10|2017-11|2018-02|2018-03|2018-05|2018-06|2018-08|2018-10|2019-03|2019-05|2019-07|2019-10|2019-11|
+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|6088 |7097 |6344 |4774 |3136 |1 |8 |32 |1279 |315767 |331982 |223426 |10527 |9819 |3693 |94 |68 |3 |1 |7 |
+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

Storing values of multiples columns in pyspark dataframe under a new column

I am importing data from a csv file where I have columns Reading1 and Reading2 and storing it into a pyspark dataframe.
My objective is to have a new column name Reading and its value as a array containing values of Reading1 and Reading2. How can I achieve the same in pyspark.
+---+-----------+-----------+
| id| Reading A| Reading B|
+---+-----------------------+
|01 | 0.123 | 0.145 |
|02 | 0.546 | 0.756 |
+---+-----------+-----------+
Desired Output:
+---+------------------+
| id| Reading |
+---+------------------+
|01 | [0.123, 0.145] |
|02 | [0.546, 0.756 |
+---+------------------+-
try this
import pyspark.sql.functions as f
df.withColumn('reading',f.array([f.col("reading a"), f.col("reading b")]))