How to filter by count after groupby in Pyspark dataframe? - dataframe

I have a pyspark dataframe like this.
data = [("1", "a"), ("2", "a"), ("3", "b"), ("4", "a")]
df = spark.createDataFrame(data).toDF(*("id", "name"))
df.show()
+---+----+
| id|name|
+---+----+
| 1| a|
| 2| a|
| 3| b|
| 4| a|
+---+----+
I group by this dataframe by name column.
df.groupBy("name").count().show()
+----+-----+
|name|count|
+----+-----+
| a| 3|
| b| 1|
+----+-----+
Now, after I groupby the dataframe, I am trying to filter the names that their count is lower than 3. For example, here I am looking to get something like this:
+----+-----+
|name|count|
+----+-----+
| b| 1|
+----+-----+

try this:
from pyspark.sql import functions as F
data = [("1", "a"), ("2", "a"), ("3", "b"), ("4", "a")]
df = spark.createDataFrame(data).toDF(*("id", "name"))
df.groupBy("name").count().where(F.col('count') < 3).show()
F is the alias of functions, you can use any identifier you want, but it is usually written as F or func, which is just a personal habit.
result:
+----+-----+
|name|count|
+----+-----+
| b| 1|
+----+-----+

Related

How to add (explode) a new column from a list to a Spark Dataframe?

Currently I have a dataframe like below, and I want to add a new column called product_id.
+---+
| id|
+---+
| 0|
| 1|
+---+
The values for product_id is derived from a List[String](), an example of this List can be:
sampleList = List(A, B, C)
For each id in the dataframe, I want to add all product_id:
+---+----------+
| id|product_id|
+---+----------+
| 0| A|
| 0| B|
| 0| C|
| 1| A|
| 1| B|
| 1| C|
+---+----------+
Is there a way to do this?
You can use the crossJoin method.
val ls1 = List(0,1)
val df1 = ls1.toDF("id")
val sampleList = List("A", "B", "C")
val df2 = sampleList.toDF("product_id")
val df = df1.crossJoin(df2)
df.show()
Generation of a sample dataframe & list
val sampleList = List("A", "B", "C")
val df = spark.range(2)
df.show()
+---+
| id|
+---+
| 0|
| 1|
+---+
Solution
import org.apache.spark.sql.functions.{explode,array,lit}
val explode_df = df.withColumn("product_id",explode(array(sampleList map lit: _*)))
explode_df.show()
+---+----------+
| id|product_id|
+---+----------+
| 0| A|
| 0| B|
| 0| C|
| 1| A|
| 1| B|
| 1| C|
+---+----------+

How to set all columns of dataframe as null values

I have a dataframe which has n number of columns with all datatypes
I want to have a empty dataframe with same number of columns/column names
After creating the columns ; is there any way I can set the columns values to null
You can achieve it in following way.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName('stackoverflow')\
.getOrCreate()
sc= spark.sparkContext
df1 = sc.parallelize([
(1, 2, 3), (3,2, 4), (5,6, 7)
]).toDF(["a", "b", "c"])
df1.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 3| 2| 4|
| 5| 6| 7|
+---+---+---+
df2 = df1.select( *[F.lit(None).alias(col) for col in df1.columns])
df2.show()
+----+----+----+
| a| b| c|
+----+----+----+
|null|null|null|
|null|null|null|
|null|null|null|
+----+----+----+

Merging two dataframes having the same number of columns

I'm looking for a way to merge two dataframes df1 and df2 without any condition, knowing that df1 and df2 have the same length For example:
df1:
+--------+
|Index |
+--------+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
+--------+
df2
+--------+
|Value |
+--------+
| a|
| b|
| c|
| d|
| e|
| f|
+--------+
The result must be:
+--------+---------+
|Index | Value |
+--------+---------+
| 0| a|
| 1| b|
| 2| c|
| 3| d|
| 4| e|
| 5| f|
+--------+---------+
Thank you
As you have same number of rows in both the datafram
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
_w1 = W.partitionBy('index')
_w2 = W.partitionBy('value')
Df1 = df1.withColumn('rn_no', F.row_number().over(_w1))
Df2 = df2.withColumn('rn_no', F.row_number().over(_w2))
Df_final = Df1.join(Df2, 'rn_no' , 'left')
Df_final = Df_final.drop('rn_no')
Here it is the solution proposed by #dsk and #anky
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
rnum=F.row_number().over(W.orderBy(F.lit(0)))
Df1 = df1.withColumn('rn_no',rnum)
Df2 = df2.withColumn('rn_no',rnum)
DF= Df1.join(Df2, 'rn_no' , 'left')
DF= sjrDF.drop('rn_no')
I guess this isn't the same as pandas? I would have thought you could simply say:
df_new=pd.DataFrame()
df_new['Index']=df1['Index']
df_new['Value']=df2['Value']
Mind you, it has been a while since I've used pandas.

Spark - query dataframe based on values from a column in another dataframe

I can think of a few wrong ways to do this, but I am trying to find the best performing way to do this. Let me explain:
Table A
id topScore
A 13
B 24
C 15
Table B
id score
A 6
A 3
A 18
A 8
B 8
B 18
B 26
B 12
C 1
C 4
C 20
C 9
I want to be able to get the top score from Table B without exceeding the score for that id in Table A.
The end result should look like this:
Table c
id score
A 8
B 18
C 9
So, I am thinking, all I want to do is basically filter the DF of Table B by saying. For id, get the MAX(TableB.score) where score < TableA.topScore. Is that possible to do?
Hope this snippet helps you:
scala> val tableA = spark.sparkContext.parallelize(List(
| ("A",13),
| ("B",24),
| ("C",15))).toDF("id","topScore")
tableA: org.apache.spark.sql.DataFrame = [id: string, topScore: int]
scala> val tableB = spark.sparkContext.parallelize(List(
| ("A",6),
| ("A",3),
| ("A",18),
| ("A",8),
| ("B",8),
| ("B",18),
| ("B",26),
| ("B",12),
| ("C",1),
| ("C",4),
| ("C",20),
| ("C",9))).toDF("id","topScore")
tableB: org.apache.spark.sql.DataFrame = [id: string, topScore: int]
scala> val tableC = tableB.withColumnRenamed("topScore","topScoreB").withColumnRenamed("id","id1")
scala> tableC.show
+---+---------+
|id1|topScoreB|
+---+---------+
| A| 6|
| A| 3|
| A| 18|
| A| 8|
| B| 8|
| B| 18|
| B| 26|
| B| 12|
| C| 1|
| C| 4|
| C| 20|
| C| 9|
+---+---------+
scala> tableA.join(tableC, tableA("id")===tableC("id1"), "left").filter($"topScore" >= $"topScoreB").select("id","topScoreB").groupBy("id").agg(max($"topScoreB")).show
+---+--------------+
| id|max(topScoreB)|
+---+--------------+
| B| 18|
| C| 9|
| A| 8|
+---+--------------+
Another approach using the window functions.
scala> val dfa = Seq(("A","13"),("B","24"),("C","15")).toDF("id","topscore").withColumn("topscore",'topscore.cast("int")).withColumn("c",lit("a"))
dfa: org.apache.spark.sql.DataFrame = [id: string, topscore: int, c: string]
scala> val dfb = Seq(("A","6"), ("A","3"), ("A","18"), ("A","8"), ("B","8"), ("B","18"), ("B","26"), ("B","12"), ("C","1"), ("C","4"), ("C","20"), ("C","9")).toDF("id","score").withColumn("score",'score.cast("int")).withColumn("c",lit("b"))
dfb: org.apache.spark.sql.DataFrame = [id: string, score: int, c: string]
scala> dfa.unionAll(dfb).withColumn("x",rank().over(Window.partitionBy('c,'id) orderBy('topscore).desc )).filter('c==="b" and 'x===2).show
+---+--------+---+---+
| id|topscore| c| x|
+---+--------+---+---+
| A| 8| b| 2|
| B| 18| b| 2|
| C| 9| b| 2|
+---+--------+---+---+
scala>
Join both tables by "id", filter "tableB" by "tableA.topScore", and then take "max":
val tableA = List(("A", 13), ("B", 24), ("C", 15)).toDF("id", "topScore")
val tableB = List(("A", 6), ("A", 3), ("A", 18), ("A", 8),
("B", 8), ("B", 18), ("B", 26), ("B", 12),
("C", 1), ("C", 4), ("C", 20), ("C", 9)).toDF("id", "topScore")
// action
val result = tableA.alias("a")
.join(tableB.alias("b"), Seq("id"), "left")
.where($"a.topScore" > $"b.topScore" || $"b.topScore".isNull)
.groupBy("a.id").agg(max($"b.topScore").alias("topScore"))
result.show(false)
Output:
+---+--------+
|id |topScore|
+---+--------+
|A |8 |
|B |18 |
|C |9 |
+---+--------+

Concatenate two dataframes pyspark

I'm trying to concatenate two dataframes, which look like that:
df1:
+---+---+
| a| b|
+---+---+
| a| b|
| 1| 2|
+---+---+
only showing top 2 rows
df2:
+---+---+
| c| d|
+---+---+
| c| d|
| 7| 8|
+---+---+
only showing top 2 rows
They both have the same number of rows, and I would like to do something like:
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| a| b| c| d|
| 1| 2| 7| 8|
+---+---+---+---+
I tried:
df1=df1.withColumn('c', df2.c).collect()
df1=df1.withColumn('d', df2.d).collect()
But without success, gives me this error:
Traceback (most recent call last):
File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o2804.withColumn.
Is there a way to that ?
Thanks
Here is example of #Suresh proposal, add column rownumber
from pyspark.sql import functions as F
df1 = sqlctx.createDataFrame([('a','b'),('1','2')],['a','b']).withColumn("row_number", F.row_number().over(Window.partitionBy().orderBy("a")))
df2 = sqlctx.createDataFrame([('c','d'),('7','8')],['c','d']).withColumn("row_number", F.row_number().over(Window.partitionBy().orderBy("c")))
df3=df1.join(df2,df1.row_number==df2.row_number,'inner')\
.select(df1.a,df1.b,df2.c,df2.d)
df3=df1.join(df2,df1.row_number==df2.row_number,'inner').select(df1.a,df1.b,df2.c,df2.d)
df3.show()