I have a dataframe which has n number of columns with all datatypes
I want to have a empty dataframe with same number of columns/column names
After creating the columns ; is there any way I can set the columns values to null
You can achieve it in following way.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName('stackoverflow')\
.getOrCreate()
sc= spark.sparkContext
df1 = sc.parallelize([
(1, 2, 3), (3,2, 4), (5,6, 7)
]).toDF(["a", "b", "c"])
df1.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
| 3| 2| 4|
| 5| 6| 7|
+---+---+---+
df2 = df1.select( *[F.lit(None).alias(col) for col in df1.columns])
df2.show()
+----+----+----+
| a| b| c|
+----+----+----+
|null|null|null|
|null|null|null|
|null|null|null|
+----+----+----+
Related
I have a pyspark dataframe like this.
data = [("1", "a"), ("2", "a"), ("3", "b"), ("4", "a")]
df = spark.createDataFrame(data).toDF(*("id", "name"))
df.show()
+---+----+
| id|name|
+---+----+
| 1| a|
| 2| a|
| 3| b|
| 4| a|
+---+----+
I group by this dataframe by name column.
df.groupBy("name").count().show()
+----+-----+
|name|count|
+----+-----+
| a| 3|
| b| 1|
+----+-----+
Now, after I groupby the dataframe, I am trying to filter the names that their count is lower than 3. For example, here I am looking to get something like this:
+----+-----+
|name|count|
+----+-----+
| b| 1|
+----+-----+
try this:
from pyspark.sql import functions as F
data = [("1", "a"), ("2", "a"), ("3", "b"), ("4", "a")]
df = spark.createDataFrame(data).toDF(*("id", "name"))
df.groupBy("name").count().where(F.col('count') < 3).show()
F is the alias of functions, you can use any identifier you want, but it is usually written as F or func, which is just a personal habit.
result:
+----+-----+
|name|count|
+----+-----+
| b| 1|
+----+-----+
Currently I have a dataframe like below, and I want to add a new column called product_id.
+---+
| id|
+---+
| 0|
| 1|
+---+
The values for product_id is derived from a List[String](), an example of this List can be:
sampleList = List(A, B, C)
For each id in the dataframe, I want to add all product_id:
+---+----------+
| id|product_id|
+---+----------+
| 0| A|
| 0| B|
| 0| C|
| 1| A|
| 1| B|
| 1| C|
+---+----------+
Is there a way to do this?
You can use the crossJoin method.
val ls1 = List(0,1)
val df1 = ls1.toDF("id")
val sampleList = List("A", "B", "C")
val df2 = sampleList.toDF("product_id")
val df = df1.crossJoin(df2)
df.show()
Generation of a sample dataframe & list
val sampleList = List("A", "B", "C")
val df = spark.range(2)
df.show()
+---+
| id|
+---+
| 0|
| 1|
+---+
Solution
import org.apache.spark.sql.functions.{explode,array,lit}
val explode_df = df.withColumn("product_id",explode(array(sampleList map lit: _*)))
explode_df.show()
+---+----------+
| id|product_id|
+---+----------+
| 0| A|
| 0| B|
| 0| C|
| 1| A|
| 1| B|
| 1| C|
+---+----------+
I have a PySpark dataframe
simpleData = [("person0",1, 1, 0), \
("person1",1, 1, 1), \
("person2",1, 0, 0), \
("person3",0 ,0, 0 ), \
]
columns= ['persons_name','A', 'B', 'C']
exp = spark.createDataFrame(data = simpleData, schema = columns)
exp.show()
It contains only binary values (0 and 1)
This looks like-
+------------+---+---+---+
|persons_name| A| B| C|
+------------+---+---+---+
| person0| 1| 1| 0|
| person1| 1| 1| 1|
| person2| 1| 0| 0|
| person3| 0| 0| 0|
+------------+---+---+---+
We need to initialize the confusion matrix with zeros like,
+---+---+---+---+
| | A| B| C|
+---+---+---+---+
| A| 0| 0| 0|
| B| 0| 0| 0|
| C| 0| 0| 0|
+---+---+---+---+
Now I want to populate the confusion matrix in the following way-
For every row in our dataframe exp, I want to increase the counter of the confusion matrix for all the pairs of columns having values = 1 in dataframe.
For example, for person0, there is only 1 pair of columns, A and B, which have value = 1. So we increase the value of the confusion matrix at (A, B) and (B, A).
This would look like-
+---+---+---+---+
| | A| B| C|
+---+---+---+---+
| A| 0| 1| 0|
| B| 1| 0| 0|
| C| 0| 0| 0|
+---+---+---+---+
For person1, there are 3 pairs of columns, (A, B), (A, C) and (B, C), which have value = 1. So we increase the value of the confusion matrix at (A, B), (B, A), (A, C), (C, A), (B, C), and (C, B).
Now the updated confusion matrix would look like-
+---+---+---+---+
| | A| B| C|
+---+---+---+---+
| A| 0| 2| 1|
| B| 2| 0| 1|
| C| 1| 1| 0|
+---+---+---+---+
There are no such pairs for person2 and person3. So we don't update the confusion matrix.
The final confusion matrix would look like-
+---+---+---+---+
| | A| B| C|
+---+---+---+---+
| A| 0| 2| 1|
| B| 2| 0| 1|
| C| 1| 1| 0|
+---+---+---+---+
How can I achieve this is PySpark?
Imagine your original data frame is a matrix with columns A, B, C, the confusion matrix can be calculated by multiplying the original matrix with its own transpose. In other words, the entry for row A column B is simply the dot product of column A and B except the diagonal, so you can basically have a nested loop over the columns, calculate the dot product for every pair of columns.
Also In general, the number of columns should be small enough to be manageable on local host, you can collect the result into a 2D list or numpy array:
import numpy as np
import pyspark.sql.functions as f
cols = ['A', 'B', 'C']
res = np.array([
[exp.agg(f.sum(f.col(x) * f.col(y))).first()[0] if x != y else 0 for y in cols]
for x in cols
])
res
#[[0 2 1]
# [2 0 1]
# [1 1 0]]
I'm looking for a way to merge two dataframes df1 and df2 without any condition, knowing that df1 and df2 have the same length For example:
df1:
+--------+
|Index |
+--------+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
+--------+
df2
+--------+
|Value |
+--------+
| a|
| b|
| c|
| d|
| e|
| f|
+--------+
The result must be:
+--------+---------+
|Index | Value |
+--------+---------+
| 0| a|
| 1| b|
| 2| c|
| 3| d|
| 4| e|
| 5| f|
+--------+---------+
Thank you
As you have same number of rows in both the datafram
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
_w1 = W.partitionBy('index')
_w2 = W.partitionBy('value')
Df1 = df1.withColumn('rn_no', F.row_number().over(_w1))
Df2 = df2.withColumn('rn_no', F.row_number().over(_w2))
Df_final = Df1.join(Df2, 'rn_no' , 'left')
Df_final = Df_final.drop('rn_no')
Here it is the solution proposed by #dsk and #anky
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
rnum=F.row_number().over(W.orderBy(F.lit(0)))
Df1 = df1.withColumn('rn_no',rnum)
Df2 = df2.withColumn('rn_no',rnum)
DF= Df1.join(Df2, 'rn_no' , 'left')
DF= sjrDF.drop('rn_no')
I guess this isn't the same as pandas? I would have thought you could simply say:
df_new=pd.DataFrame()
df_new['Index']=df1['Index']
df_new['Value']=df2['Value']
Mind you, it has been a while since I've used pandas.
I have two dataframes.
One is coming from groupBy and the other is the total summary:
a = data.groupBy("bucket").agg(sum(a.total))
b = data.agg(sum(a.total))
I want to put the total from b to a dataframe so that I can calculate the % on each bucket.
Do you know what kind of join I shall use?
Use .crossJoin then you will get the total from b added to all rows of df a, then you can calculate the percentage.
Example:
a.crossJoin(b).show()
#+------+----------+----------+
#|bucket|sum(total)|sum(total)|
#+------+----------+----------+
#| c| 4| 10|
#| b| 3| 10|
#| a| 3| 10|
#+------+----------+----------+
Instead of CrossJoin you can try using window functions as mentioned below.
df.show()
#+-----+------+
#|total|bucket|
#+-----+------+
#| 1| a|
#| 2| a|
#| 3| b|
#| 4| c|
#+-----+------+
from pyspark.sql.functions import *
from pyspark.sql import *
from pyspark.sql.window import *
import sys
w=Window.partitionBy(col("bucket"))
w1=Window.orderBy(lit("1")).rowsBetween(-sys.maxsize,sys.maxsize)
df.withColumn("sum_b",sum(col("total")).over(w)).withColumn("sum_c",sum(col("total")).over(w1)).show()
#+-----+------+-----+-----+
#|total|bucket|sum_b|sum_c|
#+-----+------+-----+-----+
#| 4| c| 4| 10|
#| 3| b| 3| 10|
#| 1| a| 3| 10|
#| 2| a| 3| 10|
#+-----+------+-----+-----+
You can use also collect() as you will return to the driver just a simple result
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.getOrCreate()
df = spark.sql("select 'A' as bucket, 5 as value union all select 'B' as bucket, 8 as value")
df_total = spark.sql("select 9 as total")
df=df.withColumn('total',lit(df_total.collect()[0]['total']))
+------+-----+-----+
|bucket|value|total|
+------+-----+-----+
| A| 5| 9|
| B| 8| 9|
+------+-----+-----+
df= df.withColumn('pourcentage', col('total') / col('value'))
+------+-----+-----+-----------+
|bucket|value|total|pourcentage|
+------+-----+-----+-----------+
| A| 5| 9| 1.8|
| B| 8| 9| 1.125|
+------+-----+-----+-----------+