Merging two dataframes having the same number of columns - dataframe

I'm looking for a way to merge two dataframes df1 and df2 without any condition, knowing that df1 and df2 have the same length For example:
df1:
+--------+
|Index |
+--------+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
+--------+
df2
+--------+
|Value |
+--------+
| a|
| b|
| c|
| d|
| e|
| f|
+--------+
The result must be:
+--------+---------+
|Index | Value |
+--------+---------+
| 0| a|
| 1| b|
| 2| c|
| 3| d|
| 4| e|
| 5| f|
+--------+---------+
Thank you

As you have same number of rows in both the datafram
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
_w1 = W.partitionBy('index')
_w2 = W.partitionBy('value')
Df1 = df1.withColumn('rn_no', F.row_number().over(_w1))
Df2 = df2.withColumn('rn_no', F.row_number().over(_w2))
Df_final = Df1.join(Df2, 'rn_no' , 'left')
Df_final = Df_final.drop('rn_no')

Here it is the solution proposed by #dsk and #anky
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
rnum=F.row_number().over(W.orderBy(F.lit(0)))
Df1 = df1.withColumn('rn_no',rnum)
Df2 = df2.withColumn('rn_no',rnum)
DF= Df1.join(Df2, 'rn_no' , 'left')
DF= sjrDF.drop('rn_no')

I guess this isn't the same as pandas? I would have thought you could simply say:
df_new=pd.DataFrame()
df_new['Index']=df1['Index']
df_new['Value']=df2['Value']
Mind you, it has been a while since I've used pandas.

Related

Spark Dataframe - Create 12 rows for each cell of a master table

I have a table containing Employee IDs and I'd like to add an additional column for Month containing 12 values (1 for each month). I'd like to create a new table where there is 12 rows for each ID in my list.
Take the following example:
+-----+
|GFCID|
+-----+
| 1|
| 2|
| 3|
+-----+
+---------+
|Yearmonth|
+---------+
| 202101|
| 202102|
| 202203|
| 202204|
| 202205|
+---------+
My desired output is something on the lines of
ID Month
1 Jan
1 Feb
1 March
2 jan
2 March
and so on. I am using pyspark and my current syntax is as follows:
data = [["1"], ["2"], ["3"]]
df = spark.createDataFrame(data, ["GFCID"])
df.show()
data2 = [["202101"], ["202102"], ["202203"], ["202204"], ["202205"]]
df2 = spark.createDataFrame(data2, ["Yearmonth"])
df2.show()
df3 = df.join(df2, df.GFCID == df2.Yearmonth, "outer")
df3.show()
And the output is
+-----+---------+
|GFCID|Yearmonth|
+-----+---------+
| null| 202101|
| 3| null|
| null| 202205|
| null| 202102|
| null| 202204|
| 1| null|
| null| 202203|
| 2| null|
+-----+---------+
I understand this is wrong because there is no common key for the dataframes to join on. Would appreciate your help on this
Here is your code modified with the proper join crossJoin
data = [["1"], ["2"], ["3"]]
df = spark.createDataFrame(data, ["GFCID"])
df.show()
data2 = [["202101"], ["202102"], ["202203"], ["202204"], ["202205"]]
df2 = spark.createDataFrame(data2, ["Yearmonth"])
df2.show()
df3 = df.crossJoin(df2)
df3.show()
+-----+---------+
|GFCID|Yearmonth|
+-----+---------+
| 1| 202101|
| 1| 202102|
| 1| 202203|
| 1| 202204|
| 1| 202205|
| 2| 202101|
| 2| 202102|
| 2| 202203|
| 2| 202204|
| 2| 202205|
| 3| 202101|
| 3| 202102|
| 3| 202203|
| 3| 202204|
| 3| 202205|
+-----+---------+
Another way of doing it without using a join :
from pyspark.sql import functions as F
df2.withColumn("GFCID", F.explode(F.array([F.lit(i) for i in range(1, 13)]))).show()
+---------+-----+
|Yearmonth|GFCID|
+---------+-----+
| 202101| 1|
| 202101| 2|
| 202101| 3|
| 202101| 4|
| 202101| 5|
| 202101| 6|
| 202101| 7|
| 202101| 8|
| 202101| 9|
| 202101| 10|
| 202101| 11|
| 202101| 12|
| 202102| 1|
| 202102| 2|
| 202102| 3|
| 202102| 4|
...

How to filter by count after groupby in Pyspark dataframe?

I have a pyspark dataframe like this.
data = [("1", "a"), ("2", "a"), ("3", "b"), ("4", "a")]
df = spark.createDataFrame(data).toDF(*("id", "name"))
df.show()
+---+----+
| id|name|
+---+----+
| 1| a|
| 2| a|
| 3| b|
| 4| a|
+---+----+
I group by this dataframe by name column.
df.groupBy("name").count().show()
+----+-----+
|name|count|
+----+-----+
| a| 3|
| b| 1|
+----+-----+
Now, after I groupby the dataframe, I am trying to filter the names that their count is lower than 3. For example, here I am looking to get something like this:
+----+-----+
|name|count|
+----+-----+
| b| 1|
+----+-----+
try this:
from pyspark.sql import functions as F
data = [("1", "a"), ("2", "a"), ("3", "b"), ("4", "a")]
df = spark.createDataFrame(data).toDF(*("id", "name"))
df.groupBy("name").count().where(F.col('count') < 3).show()
F is the alias of functions, you can use any identifier you want, but it is usually written as F or func, which is just a personal habit.
result:
+----+-----+
|name|count|
+----+-----+
| b| 1|
+----+-----+

How to add (explode) a new column from a list to a Spark Dataframe?

Currently I have a dataframe like below, and I want to add a new column called product_id.
+---+
| id|
+---+
| 0|
| 1|
+---+
The values for product_id is derived from a List[String](), an example of this List can be:
sampleList = List(A, B, C)
For each id in the dataframe, I want to add all product_id:
+---+----------+
| id|product_id|
+---+----------+
| 0| A|
| 0| B|
| 0| C|
| 1| A|
| 1| B|
| 1| C|
+---+----------+
Is there a way to do this?
You can use the crossJoin method.
val ls1 = List(0,1)
val df1 = ls1.toDF("id")
val sampleList = List("A", "B", "C")
val df2 = sampleList.toDF("product_id")
val df = df1.crossJoin(df2)
df.show()
Generation of a sample dataframe & list
val sampleList = List("A", "B", "C")
val df = spark.range(2)
df.show()
+---+
| id|
+---+
| 0|
| 1|
+---+
Solution
import org.apache.spark.sql.functions.{explode,array,lit}
val explode_df = df.withColumn("product_id",explode(array(sampleList map lit: _*)))
explode_df.show()
+---+----------+
| id|product_id|
+---+----------+
| 0| A|
| 0| B|
| 0| C|
| 1| A|
| 1| B|
| 1| C|
+---+----------+

How to set all columns of dataframe as null values

I have a dataframe which has n number of columns with all datatypes
I want to have a empty dataframe with same number of columns/column names
After creating the columns ; is there any way I can set the columns values to null
You can achieve it in following way.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName('stackoverflow')\
.getOrCreate()
sc= spark.sparkContext
df1 = sc.parallelize([
(1, 2, 3), (3,2, 4), (5,6, 7)
]).toDF(["a", "b", "c"])
df1.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 3| 2| 4|
| 5| 6| 7|
+---+---+---+
df2 = df1.select( *[F.lit(None).alias(col) for col in df1.columns])
df2.show()
+----+----+----+
| a| b| c|
+----+----+----+
|null|null|null|
|null|null|null|
|null|null|null|
+----+----+----+

Use list and replace a pyspark column

Suppose I have a list new_id_acc = [6,8,1,2,4] and I have PySpark DataFrame
like
id_acc | name |
10 | ABC |
20 | XYZ |
21 | KBC |
34 | RAH |
19 | SPD |
I want to replace the pyspark column id_acc with new_id_acc value how can I achieve and do this.
I tried and found that lit() can be used but for a constant
value but didn't find anything how to do for list.
After replacement I want my PySpark Dataframe to look like this
id_acc | name |
6 | ABC |
8 | XYZ |
1 | KBC |
2 | RAH |
4 | SPD |
Probably long answer but it works.
df = spark.sparkContext.parallelize([(10,'ABC'),(20,'XYZ'),(21,'KBC'),(34,'ABC'),(19,'SPD')]).toDF(('id_acc', 'name'))
df.show()
+------+----+
|id_acc|name|
+------+----+
| 10| ABC|
| 20| XYZ|
| 21| KBC|
| 34| ABC|
| 19| SPD|
+------+----+
new_id_acc = [6,8,1,2,4]
indx = ['ABC','XYZ','KBC','ABC','SPD']
from pyspark.sql.types import *
myschema= StructType([ StructField("indx", StringType(), True),StructField("new_id_ac", IntegerType(), True)])
df1=spark.createDataFrame(zip(indx,new_id_acc),schema = myschema)
df1.show()
+----+---------+
|indx|new_id_ac|
+----+---------+
| ABC| 6|
| XYZ| 8|
| KBC| 1|
| ABC| 2|
| SPD| 4|
+----+---------+
dfnew = df.join(df1, df.name == df1.indx,how='left').drop(df1.indx).select('new_id_ac','name').sort('name').dropDuplicates(['new_id_ac'])
dfnew.show()
+---------+----+
|new_id_ac|name|
+---------+----+
| 1| KBC|
| 6| ABC|
| 4| SPD|
| 8| XYZ|
| 2| ABC|
+---------+----+
The idea is to create a column of consecutive serial/row numbers and then use them to get the corresponding values from the list.
# Creating the requisite DataFrame
from pyspark.sql.functions import row_number,lit, udf
from pyspark.sql.window import Window
valuesCol = [(10,'ABC'),(20,'XYZ'),(21,'KBC'),(34,'RAH'),(19,'SPD')]
df = spark.createDataFrame(valuesCol,['id_acc','name'])
df.show()
+------+----+
|id_acc|name|
+------+----+
| 10| ABC|
| 20| XYZ|
| 21| KBC|
| 34| RAH|
| 19| SPD|
+------+----+
You can create row/serial numbers like done here.
Note that A below is just a dummy value, as we don't need to order tha values. We just want the row number.
w = Window().orderBy(lit('A'))
df = df.withColumn('serial_number', row_number().over(w))
df.show()
+------+----+-------------+
|id_acc|name|serial_number|
+------+----+-------------+
| 10| ABC| 1|
| 20| XYZ| 2|
| 21| KBC| 3|
| 34| RAH| 4|
| 19| SPD| 5|
+------+----+-------------+
As a final step, we will access the elements from the list provided by the OP using the row number. For this we use udf.
new_id_acc = [6,8,1,2,4]
mapping = udf(lambda x: new_id_acc[x-1])
df = df.withColumn('id_acc', mapping(df.serial_number)).drop('serial_number')
df.show()
+------+----+
|id_acc|name|
+------+----+
| 6| ABC|
| 8| XYZ|
| 1| KBC|
| 2| RAH|
| 4| SPD|
+------+----+