groupby count in pyspark data frame - apache-spark-sql

My data frame looks like -
id age gender category
1 34 m b
1 34 m c
1 34 m b
2 28 f a
2 28 f b
3 23 f c
3 23 f c
3 23 f c
I want my data frame looks like -
id age gender a b c
1 34 m 0 2 1
2 28 f 1 1 0
3 23 f 0 0 2
I have done -
from pyspark.sql import functions as F
df = df.groupby(['id','age','gender']).pivot('category').agg(F.count('category')).fillna(0)
df.show()
How to manage in pyspark?Is there any correct way through I can manage this thing

your code looks fine to me but when I tried running it, I see this
df = spark.read.csv('dbfs:/FileStore/tables/txt_sample.txt',header=True,inferSchema=True,sep="\t")
df = df.groupby(['id','age','gender']).pivot('category').agg(count('category')).fillna(0)
df.show()
df:pyspark.sql.dataframe.DataFrame = [id: integer, age: integer ... 5 more fields]
+---+---+------+---+---+---+---+
| id|age|gender| a| b| c| c |
+---+---+------+---+---+---+---+
| 2| 28| f| 1| 1| 0| 0|
| 1| 34| m| 0| 2| 1| 0|
| 3| 23| f| 0| 0| 1| 2|
+---+---+------+---+---+---+---+
It is because of an extra space character after c in the last two rows.
just trim the spaces by using rtrim()
df = spark.read.csv('dbfs:/FileStore/tables/txt_sample.txt',header=True,inferSchema=True,sep='\t')
df = df.withColumn('Category',rtrim(df['category'])).drop(df['category'])
df = df.groupby(['id','age','gender']).pivot('Category').agg(count('Category')).fillna(0)
df.show()
df:pyspark.sql.dataframe.DataFrame = [id: integer, age: integer ... 4 more fields]
+---+---+------+---+---+---+
| id|age|gender| a| b| c|
+---+---+------+---+---+---+
| 2| 28| f| 1| 1| 0|
| 1| 34| m| 0| 2| 1|
| 3| 23| f| 0| 0| 3|
+---+---+------+---+---+---+

Related

PySpark Create Relationship between DataFrame Columns

I am trying to implement some logic to get a relationship between ID and Link based on the logic below.
Logic -
if id 1 has link with 2 and 2 has link with 3, then relation is 1 -> 2, 1 -> 3, 2 -> 1, 2 -> 3, 3 -> 1, 3 -> 2
Similarly if 1 with 4, 4 with 7 and 7 with 5 then relation is 1 -> 4, 1 -> 5, 1 -> 7, 4 -> 1, 4 -> 5, 4 -> 7, 5 -> 1, 5 -> 4, 5 -> 7
Input DataFrame -
+---+----+
| id|link|
+---+----+
| 1| 2|
| 3| 1|
| 4| 2|
| 6| 5|
| 9| 7|
| 9| 10|
+---+----+
I am trying to achieve below output-
+---+----+
| Id|Link|
+---+----+
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 3|
| 2| 4|
| 3| 1|
| 3| 2|
| 3| 4|
| 4| 1|
| 4| 2|
| 4| 3|
| 5| 6|
| 6| 5|
| 7| 9|
| 7| 10|
| 9| 7|
| 9| 10|
| 10| 7|
| 10| 9|
+---+----+
I have tried many way, but it's not at all working. I have tried following codes as well
df = spark.createDataFrame([(1, 2), (3, 1), (4, 2), (6, 5), (9, 7), (9, 10)], ["id", "link"])
ids = df.select("Id").distinct().rdd.flatMap(lambda x: x).collect()
links = df.select("Link").distinct().rdd.flatMap(lambda x: x).collect()
combinations = [(id, link) for id in ids for link in links]
df_combinations = spark.createDataFrame(combinations, ["Id", "Link"])
result = df_combinations.join(df, ["Id", "Link"], "left_anti").union(df).dropDuplicates()
result = result.sort(asc("Id"), asc("Link"))
and
df = spark.createDataFrame([(1, 2), (3, 1), (4, 2), (6, 5), (9, 7), (9, 10)], ["id", "link"])
combinations = df.alias("a").crossJoin(df.alias("b")) \
.filter(F.col("a.id") != F.col("b.id"))\
.select(col("a.id").alias("a_id"), col("b.id").alias("b_id"), col("a.link").alias("a_link"), col("b.link").alias("b_link"))
window = Window.partitionBy("a_id").orderBy("a_id", "b_link")
paths = combinations.groupBy("a_id", "b_link") \
.agg(F.first("b_id").over(window).alias("id")) \
.groupBy("id").agg(F.collect_list("b_link").alias("links"))
result = paths.select("id", F.explode("links").alias("link"))
result = result.union(df.selectExpr("id as id_", "link as link_"))
Any help would be much appreciated.
This is not a general approach but you can use the graphframes package. You might struggle to set it up but one can use it, the result is simple.
import os
sc.addPyFile(os.path.expanduser('graphframes-0.8.1-spark3.0-s_2.12.jar'))
from graphframes import *
e = df.select('id', 'link').toDF('src', 'dst')
v = e.select('src').toDF('id') \
.union(e.select('dst')) \
.distinct()
g = GraphFrame(v, e)
sc.setCheckpointDir("/tmp/graphframes")
df = g.connectedComponents()
df.join(df.withColumnRenamed('id', 'link'), ['component'], 'inner') \
.drop('component') \
.filter('id != link') \
.show()
+---+----+
| id|link|
+---+----+
| 7| 10|
| 7| 9|
| 3| 2|
| 3| 4|
| 3| 1|
| 5| 6|
| 6| 5|
| 9| 10|
| 9| 7|
| 1| 2|
| 1| 4|
| 1| 3|
| 10| 9|
| 10| 7|
| 4| 2|
| 4| 1|
| 4| 3|
| 2| 4|
| 2| 1|
| 2| 3|
+---+----+
connectedComponents method returns the component id for each vertex, that is unique for each vertex group (that is connected by edge and seperated if there is no edge to the other component). So you can do the cartesian product for each component without the vertex itself.
Added answer
Inspired from the above approach, I looked up and found the networkx package.
import networkx as nx
df = df.toPandas()
G = nx.from_pandas_edgelist(df, 'id', 'link')
components = [[list(c)] for c in nx.connected_components(G)]
df2 = spark.createDataFrame(components, ['array']) \
.withColumn('component', f.monotonically_increasing_id()) \
.select('component', f.explode('array').alias('id'))
df2.join(df2.withColumnRenamed('id', 'link'), ['component'], 'inner') \
.drop('component') \
.filter('id != link') \
.show()
+---+----+
| id|link|
+---+----+
| 1| 2|
| 1| 3|
| 1| 4|
| 2| 1|
| 2| 3|
| 2| 4|
| 3| 1|
| 3| 2|
| 3| 4|
| 4| 1|
| 4| 2|
| 4| 3|
| 5| 6|
| 6| 5|
| 9| 10|
| 9| 7|
| 10| 9|
| 10| 7|
| 7| 9|
| 7| 10|
+---+----+

Spark Dataframe - Create 12 rows for each cell of a master table

I have a table containing Employee IDs and I'd like to add an additional column for Month containing 12 values (1 for each month). I'd like to create a new table where there is 12 rows for each ID in my list.
Take the following example:
+-----+
|GFCID|
+-----+
| 1|
| 2|
| 3|
+-----+
+---------+
|Yearmonth|
+---------+
| 202101|
| 202102|
| 202203|
| 202204|
| 202205|
+---------+
My desired output is something on the lines of
ID Month
1 Jan
1 Feb
1 March
2 jan
2 March
and so on. I am using pyspark and my current syntax is as follows:
data = [["1"], ["2"], ["3"]]
df = spark.createDataFrame(data, ["GFCID"])
df.show()
data2 = [["202101"], ["202102"], ["202203"], ["202204"], ["202205"]]
df2 = spark.createDataFrame(data2, ["Yearmonth"])
df2.show()
df3 = df.join(df2, df.GFCID == df2.Yearmonth, "outer")
df3.show()
And the output is
+-----+---------+
|GFCID|Yearmonth|
+-----+---------+
| null| 202101|
| 3| null|
| null| 202205|
| null| 202102|
| null| 202204|
| 1| null|
| null| 202203|
| 2| null|
+-----+---------+
I understand this is wrong because there is no common key for the dataframes to join on. Would appreciate your help on this
Here is your code modified with the proper join crossJoin
data = [["1"], ["2"], ["3"]]
df = spark.createDataFrame(data, ["GFCID"])
df.show()
data2 = [["202101"], ["202102"], ["202203"], ["202204"], ["202205"]]
df2 = spark.createDataFrame(data2, ["Yearmonth"])
df2.show()
df3 = df.crossJoin(df2)
df3.show()
+-----+---------+
|GFCID|Yearmonth|
+-----+---------+
| 1| 202101|
| 1| 202102|
| 1| 202203|
| 1| 202204|
| 1| 202205|
| 2| 202101|
| 2| 202102|
| 2| 202203|
| 2| 202204|
| 2| 202205|
| 3| 202101|
| 3| 202102|
| 3| 202203|
| 3| 202204|
| 3| 202205|
+-----+---------+
Another way of doing it without using a join :
from pyspark.sql import functions as F
df2.withColumn("GFCID", F.explode(F.array([F.lit(i) for i in range(1, 13)]))).show()
+---------+-----+
|Yearmonth|GFCID|
+---------+-----+
| 202101| 1|
| 202101| 2|
| 202101| 3|
| 202101| 4|
| 202101| 5|
| 202101| 6|
| 202101| 7|
| 202101| 8|
| 202101| 9|
| 202101| 10|
| 202101| 11|
| 202101| 12|
| 202102| 1|
| 202102| 2|
| 202102| 3|
| 202102| 4|
...

Pyspark crossJoin with specific condition

The crossJoin of two dataframes of 5 rows for each one gives a dataframe of 25 rows (5*5).
What I want is to do a crossJoin but which is "not full".
For example:
df1: df2:
+-----+ +-----+
|index| |value|
+-----+ +-----+
| 0| | A|
| 1| | B|
| 2| | C|
| 3| | D|
| 4| | E|
+-----+ +-----+
The result must be a dataframe of number of rows < 25, while for each row in index choosing randomly the number of rows in value with which the crossJoin is done.
It will be something like that:
+-----+-----+
|index|value|
+-----+-----+
| 0| D|
| 0| A|
| 1| A|
| 1| D|
| 1| B|
| 1| C|
| 2| A|
| 2| E|
| 3| D|
| 4| A|
| 4| B|
| 4| E|
+-----+-----+
Thank you
You can try with sample(withReplacement, fraction, seed=None) to get the less number of rows after cross join.
Example:
spark.sql("set spark.sql.crossJoin.enabled=true")
df.join(df1).sample(False,0.6).show()

Map Spark DF to (row_number, column_number, value) format

I have a Dataframe in the following shape
1 2
5 9
How can I convert it to (row_num, col_num, value) format
0 0 1
0 1 2
1 0 5
1 1 9
Is there any way to apply some function or any mapper?
Thanks in advance
Check below code.
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
scala> val colExpr = array(df.columns.zipWithIndex.map(c => struct(lit(c._2).as("col_name"),col(c._1).as("value"))):_*)
colExpr: org.apache.spark.sql.Column = array(named_struct(col_name, 0 AS `col_name`, NamePlaceholder(), a AS `value`), named_struct(col_name, 1 AS `col_name`, NamePlaceholder(), b AS `value`))
scala> df.withColumn("row_number",lit(row_number().over(Window.orderBy(lit(1)))-1)).withColumn("data",explode(colExpr)).select($"row_number",$"data.*").show(false)
+----------+--------+-----+
|row_number|col_name|value|
+----------+--------+-----+
|0 |0 |1 |
|0 |1 |2 |
|1 |0 |5 |
|1 |1 |9 |
+----------+--------+-----+
You can do it by transposing the data as:
from pyspark.sql.functions import *
from pyspark.sql import Window
df = spark.createDataFrame([(1,2),(5,9)],['col1','col2'])
#renaming the columns based on their position
df = df.toDF(*list(map(lambda x: str(x),[*range(len(df.columns))])))
#Transposing the dataframe as required
col_list = ','.join([f'{i},`{i}`'for i in df.columns])
rows = len(df.columns)
df.withColumn('row_id',lit(row_number().over(Window.orderBy(lit(1)))-1)).select('row_id',
expr(f'''stack({rows},{col_list}) as (col_id,col_value)''')).show()
+------+------+---------+
|row_id|col_id|col_value|
+------+------+---------+
| 0| 0| 1|
| 0| 1| 2|
| 1| 0| 5|
| 1| 1| 9|
+------+------+---------+
In pyspark, row_number() and pos_explode will be helpful. Try this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
tst= sqlContext.createDataFrame([(1,7,80),(1,8,40),(1,5,100),(5,8,90),(7,6,50),(0,3,60)],schema=['col1','col2','col3'])
tst1= tst.withColumn("row_number",F.row_number().over(Window.orderBy(F.lit(1)))-1)
#%%
tst_arr = tst1.withColumn("arr",F.array(tst.columns))
tst_new = tst_arr.select('row_number','arr').select('row_number',F.posexplode('arr'))
results:
In [47]: tst_new.show()
+----------+---+---+
|row_number|pos|col|
+----------+---+---+
| 0| 0| 1|
| 0| 1| 7|
| 0| 2| 80|
| 1| 0| 1|
| 1| 1| 8|
| 1| 2| 40|
| 2| 0| 1|
| 2| 1| 5|
| 2| 2|100|
| 3| 0| 5|
| 3| 1| 8|
| 3| 2| 90|
| 4| 0| 7|
| 4| 1| 6|
| 4| 2| 50|
| 5| 0| 0|
| 5| 1| 3|
| 5| 2| 60|
+----------+---+---+

spark sql spark.range(7).select('*,'id % 3 as "bucket").show // how to understand ('*,'id % 3 as "bucket")

spark.range(7).select('*,'id % 3 as "bucket").show
// result:
+---+------+
| id|bucket|
+---+------+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 0|
| 4| 1|
| 5| 2|
| 6| 0|
+---+------+
spark.range(7).withColumn("bucket",$"id" % 3).show
///result:
+---+------+
| id|bucket|
+---+------+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 0|
| 4| 1|
| 5| 2|
| 6| 0|
+---+------+
I want to know what to make of *, and the whole select statement
Is the bottom of these two ways equivalent?
spark.range(7).select('*,'id % 3 as "bucket").show
spark.range(7).select($"*",$"id" % 3 as "bucket").show
spark.range(7).select(col("*"),col("id") % 3 as "bucket").show
val df = spark.range(7)
df.select(df("*"),df("id") % 3 as "bucket").show
These four ways are equivalent;
// https://spark.apache.org/docs/2.4.4/api/scala/index.html#org.apache.spark.sql.Column