How to pivot by value in pyspark - pandas

Here's my input
+----+-----+---+------+----+------+-------+--------+
|year|month|day|new_ts|hour|minute|ts_rank| label|
+----+-----+---+------+----+------+-------+--------+
|2022| 1| 1| 13| 13| 24| 1| 7|
|2022| 1| 1| 14| 13| 24| 1| 8|
|2022| 1| 2| 15| 13| 24| 1| 7|
|2022| 1| 2| 16| 13| 44| 7| 8|
+----+-----+---+------+----+------+-------+--------+
Here's my output
+----+-----+---+-------+--------+
|year|month|day| 7 | 8|
+----+-----+---+-------+--------+
|2022| 1| 1| 13| 14|
|2022| 1| 2| 15| 16|
+----+-----+---+-------+--------+
Here's the pandas code
df_pivot = df.pivot(index=["year","month","day"], columns="label", values="new_ts").reset_index()
What I try
df_pivot = df.groupBy(["year","month","day"]).pivot("label").value("new_ts")
Note: sorry I can't show my error message here, because I'm using cloud solution and its only show the line of error not error message

df.groupBy("year","month","day").pivot('label').agg(first('new_ts')).show()
+----+-----+---+---+---+
|year|month|day| 7| 8|
+----+-----+---+---+---+
|2022| 1| 1| 13| 14|
|2022| 1| 2| 15| 16|
+----+-----+---+---+---+

Related

assign max+1 to Sequence field if the Register Number set reappears

I have a dataframe as like below:
+-------+--------------+----+-------------+
|recType|registerNumber|mnId| sequence|
+-------+--------------+----+-------------+
| 01| 13578999| 0| 1|
| 11| 13578999| 1| 1|
| 13| 13578999| 2| 1|
| 14| 13578999| 3| 1|
| 14| 13578999| 4| 1|
| 01| 11121000| 5| 2|
| 11| 11121000| 6| 2|
| 13| 11121000| 7| 2|
| 14| 11121000| 8| 2|
| 01| OC387520| 9| 3|
| 11| OC387520| 10| 3|
| 13| OC387520| 11| 3|
| 01| 11121000| 12| 2|
| 11| 11121000| 13| 2|
| 13| 11121000| 14| 2|
| 14| 11121000| 15| 2|
| 01| OC321000| 16| 4|
| 11| OC321000| 17| 4|
| 13| OC321000| 18| 4|
| 01| OC522000| 19| 5|
| 11| OC522000| 20| 5|
| 13| OC522000| 21| 5|
+-------+--------------+----+-------------+
Each record set starts with recType equal to 01 and ends either with recType equal to 13 or 14.
In some cases, we have some duplicates registerNumber which assigns a duplicates sequence field to the record set.
In the given dataframe, the registerNumber value 11121000 is duplicate.
I want to assign a new sequence value to the duplicate registerNumber value 11121000. So the output dataframe should look as below:
+-------+--------------+----+-------------+
|recType|registerNumber|mnId| sequence|
+-------+--------------+----+-------------+
| 01| 13578999| 0| 1|
| 11| 13578999| 1| 1|
| 13| 13578999| 2| 1|
| 14| 13578999| 3| 1|
| 14| 13578999| 4| 1|
| 01| 11121000| 5| 2|
| 11| 11121000| 6| 2|
| 13| 11121000| 7| 2|
| 14| 11121000| 8| 2|
| 01| OC387520| 9| 3|
| 11| OC387520| 10| 3|
| 13| OC387520| 11| 3|
| 01| 11121000| 12| 6|
| 11| 11121000| 13| 6|
| 13| 11121000| 14| 6|
| 14| 11121000| 15| 6|
| 01| OC321000| 16| 4|
| 11| OC321000| 17| 4|
| 13| OC321000| 18| 4|
| 01| OC522000| 19| 5|
| 11| OC522000| 20| 5|
| 13| OC522000| 21| 5|
+-------+--------------+----+-------------+
Please guide me, how to approach this problem.

sum of row values within a window range in spark dataframe

I have a dataframe as shown below where count column is having number of columns that has to added to get a new column.
+---+----+----+------------+
| ID|date| A| count|
+---+----+----+------------+
| 1| 1| 10| null|
| 1| 2| 10| null|
| 1| 3|null| null|
| 1| 4| 20| 1|
| 1| 5|null| null|
| 1| 6|null| null|
| 1| 7| 60| 2|
| 1| 7| 60| null|
| 1| 8|null| null|
| 1| 9|null| null|
| 1| 10|null| null|
| 1| 11| 80| 3|
| 2| 1| 10| null|
| 2| 2| 10| null|
| 2| 3|null| null|
| 2| 4| 20| 1|
| 2| 5|null| null|
| 2| 6|null| null|
| 2| 7| 60| 2|
+---+----+----+------------+
The expected output is as shown below.
+---+----+----+-----+-------+
| ID|date| A|count|new_col|
+---+----+----+-----+-------+
| 1| 1| 10| null| null|
| 1| 2| 10| null| null|
| 1| 3| 10| null| null|
| 1| 4| 20| 2| 30|
| 1| 5| 10| null| null|
| 1| 6| 10| null| null|
| 1| 7| 60| 3| 80|
| 1| 7| 60| null| null|
| 1| 8|null| null| null|
| 1| 9|null| null| null|
| 1| 10| 10| null| null|
| 1| 11| 80| 2| 90|
| 2| 1| 10| null| null|
| 2| 2| 10| null| null|
| 2| 3|null| null| null|
| 2| 4| 20| 1| 20|
| 2| 5|null| null| null|
| 2| 6| 20| null| null|
| 2| 7| 60| 2| 80|
+---+----+----+-----+-------+
I tried with window function as follows
val w2 = Window.partitionBy("ID").orderBy("date")
val newDf = df
.withColumn("new_col", when(col("A").isNotNull && col("count").isNotNull, sum(col("A).over(Window.partitionBy("ID").orderBy("date").rowsBetween(Window.currentRow - (col("count")), Window.currentRow)))
But I am getting error as below.
error: overloaded method value - with alternatives:
(x: Long)Long <and>
(x: Int)Long <and>
(x: Char)Long <and>
(x: Short)Long <and>
(x: Byte)Long
cannot be applied to (org.apache.spark.sql.Column)
seems like the column value provided inside window function is causing the issue.
Any idea about how to resolve this error to achieve the requirement or any other alternative solutions?
Any leads appreciated !

How can count occurrence frequency of records in Spark data frame and add it as new column to data frame without affecting on index column?

I'm trying to add a new column named Freq to the given spark dataframe without affecting on index column or records' order of frame to assign back results of Statistic frequency (which is counts) to right row/incident/event/record in dataframe.
This is my data frame:
+---+-------------+------+------------+-------------+-----------------+
| id| Type|Length|Token_number|Encoding_type|Character_feature|
+---+-------------+------+------------+-------------+-----------------+
| 0| Sentence| 4014| 198| false| 136|
| 1| contextid| 90| 2| false| 15|
| 2| Sentence| 172| 11| false| 118|
| 3| String| 12| 0| true| 11|
| 4|version-style| 16| 0| false| 13|
| 5| Sentence| 339| 42| false| 110|
| 6|version-style| 16| 0| false| 13|
| 7| url_variable| 10| 2| false| 9|
| 8| url_variable| 10| 2| false| 9|
| 9| Sentence| 172| 11| false| 117|
| 10| contextid| 90| 2| false| 15|
| 11| Sentence| 170| 11| false| 114|
| 12|version-style| 16| 0| false| 13|
| 13| Sentence| 68| 10| false| 59|
| 14| String| 12| 0| true| 11|
| 15| Sentence| 173| 11| false| 118|
| 16| String| 12| 0| true| 11|
| 17| Sentence| 132| 8| false| 96|
| 18| String| 12| 0| true| 11|
| 19| contextid| 88| 2| false| 15|
+---+-------------+------+------------+-------------+-----------------+
I tried following scripts unsuccessfully due to presence of index column id:
from pyspark.sql import functions as F
from pyspark.sql import Window
bo = features_sdf.select('id', 'Type', 'Length', 'Token_number', 'Encoding_type', 'Character_feature')
sdf2 = (
bo.na.fill(0).withColumn(
'Freq',
F.count("*").over(Window.partitionBy(bo.columns))
).withColumn(
'MaxFreq',
F.max('Freq').over(Window.partitionBy())
).withColumn(
'MinFreq',
F.min('Freq').over(Window.partitionBy())
)
)
sdf2.show()
#bad result due to existence of id column which makes every record unique and causes Freq=1
+---+-------------+------+------------+-------------+-----------------+----+-------+-------+
| id| Type|Length|Token_number|Encoding_type|Character_feature|Freq|MaxFreq|MinFreq|
+---+-------------+------+------------+-------------+-----------------+----+-------+-------+
| 0| Sentence| 4014| 198| false| 136| 1| 1| 1|
| 1| contextid| 90| 2| false| 15| 1| 1| 1|
| 2| Sentence| 172| 11| false| 118| 1| 1| 1|
| 3| String| 12| 0| true| 11| 1| 1| 1|
| 4|version-style| 16| 0| false| 13| 1| 1| 1|
| 5| Sentence| 339| 42| false| 110| 1| 1| 1|
| 6|version-style| 16| 0| false| 13| 1| 1| 1|
| 7| url_variable| 10| 2| false| 9| 1| 1| 1|
| 8| url_variable| 10| 2| false| 9| 1| 1| 1|
| 9| Sentence| 172| 11| false| 117| 1| 1| 1|
| 10| contextid| 90| 2| false| 15| 1| 1| 1|
| 11| Sentence| 170| 11| false| 114| 1| 1| 1|
| 12|version-style| 16| 0| false| 13| 1| 1| 1|
| 13| Sentence| 68| 10| false| 59| 1| 1| 1|
| 14| String| 12| 0| true| 11| 1| 1| 1|
| 15| Sentence| 173| 11| false| 118| 1| 1| 1|
| 16| String| 12| 0| true| 11| 1| 1| 1|
| 17| Sentence| 132| 8| false| 96| 1| 1| 1|
| 18| String| 12| 0| true| 11| 1| 1| 1|
| 19| contextid| 88| 2| false| 15| 1| 1| 1|
+---+-------------+------+------------+-------------+-----------------+----+-------+-------+
If I exclude index column id the code works but somehow it messes up the order (due to unwanted sorting/ordering) and results are not going to be assigned to the right record/row as follows:
+--------+------+------------+-------------+-----------------+----+-------+-------+
| Type|Length|Token_number|Encoding_type|Character_feature|Freq|MaxFreq|MinFreq|
+--------+------+------------+-------------+-----------------+----+-------+-------+
|Sentence| 7| 1| false| 6| 2| 1665| 1|
|Sentence| 7| 1| false| 6| 2| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
+--------+------+------------+-------------+-----------------+----+-------+-------+
In the end, I wanted to add this function and normalized it between 0 and 1 using simple mathematic formula and use it as a new feature. during normalizing I faced problems also and get null values. I already implemented the pythonic version and it is so easy but I'm so fed up in spark:
#Statistical Preprocessing
def add_freq_to_features(df):
frequencies_df = df.groupby(list(df.columns)).size().to_frame().rename(columns={0: "Freq"})
frequencies_df["Freq"] = frequencies_df["Freq"] / frequencies_df["Freq"].sum() # Normalzing 0 & 1
new_df = pd.merge(df, frequencies_df, how='left', on=list(df.columns))
return new_df
# Apply frequency allocation and merge with extracted features df
features_df = add_freq_to_features(oba)
features_df.head(20)
and it turns following right results as I expected:
I also tried to litterally translted the pythonic scripts using df.groupBy(df.columns).count() but I couldn't:
# this is to build "raw" Freq based on #pltc answer
sdf2 = (sdf
.groupBy(sdf.columns)
.agg(F.count('*').alias('Freq'))
.withColumn('Encoding_type', F.col('Encoding_type').cast('string'))
)
sdf2.cache().count()
sdf2.show()
Here is the PySpark full code of what we have tried on simplified example available in this colab notebook based on the answer of #ggordon:
def add_freq_to_features_(df):
from pyspark.sql import functions as F
from pyspark.sql import Window
sdf_pltc = df.select('id', 'Type', 'Length', 'Token_number', 'Encoding_type', 'Character_feature')
print("Before Any Modification") # only included for debugging purposes
sdf_pltc.show(5,truncate=0)
# fill missing values with 0 using `na.fill(0)` before applying count as window function
sdf2 = (
sdf_pltc.na.fill(0).withColumn(
'Freq',
F.count("*").over(Window.partitionBy(sdf_pltc.columns))
).withColumn(
'MaxFreq',
F.max('Freq').over(Window.partitionBy())
).withColumn(
'MinFreq',
F.min('Freq').over(Window.partitionBy())
)
.withColumn('id' , F.col('id'))
)
print("After replacing null with 0 and counting by partitions") # only included for debugging purposes
# use orderby as your last operation, only included here for debugging purposes
#sdf2 = sdf2.orderBy(F.col('Type').desc(),F.col('Length').desc() )
sdf2.show(5,truncate=False) # only included for debugging purposes
sdf2 = (
sdf2.withColumn('Freq' , F.when(
F.col('MaxFreq')==0.000000000 , 0
).otherwise(
(F.col('Freq')-F.col('MinFreq')) / (F.col('MaxFreq') - F.col('MinFreq'))
)
) # Normalzing between 0 & 1
)
sdf2 = sdf2.drop('MinFreq').drop('MaxFreq')
sdf2 = sdf2.withColumn('Encoding_type', F.col('Encoding_type').cast('string'))
#sdf2 = sdf2.orderBy(F.col('Type').desc(),F.col('Length').desc() )
print("After normalization, encoding transformation and order by ") # only included for debugging purposes
sdf2.show(50,truncate=False)
return sdf2
Sadly due to dealing BigData I can't hack it with df.toPandas() it is inexpensive and cause OOM error.
Any help will be forever appreciated.
The pandas behavior is different because the ID field is the DataFrame index so it does not count in the "group by all" you do. You can get the same behavior in Spark with one change.
partitionBy takes any ordinary list of strings, Try removing the id column from your partition key list like this:
bo = features_sdf.select('id', 'Type', 'Length', 'Token_number', 'Encoding_type', 'Character_feature')
partition_columns = bo.columns.remove('id')
sdf2 = (
bo.na.fill(0).withColumn(
'Freq',
F.count("*").over(Window.partitionBy(partition_columns))
).withColumn(
'MaxFreq',
F.max('Freq').over(Window.partitionBy())
).withColumn(
'MinFreq',
F.min('Freq').over(Window.partitionBy())
)
)
That will give you the results you said worked but keep the ID field. You'll need to figure out how to do the division for your frequencies but that should get you started.

Getting Memory Error in PySpark during Filter & GroupBy computation

This is the Error :
Job aborted due to stage failure: Task 12 in stage 37.0 failed 4 times, most recent failure: Lost task 12.3 in stage 37.0 (TID 325, 10.139.64.5, executor 20): ExecutorLostFailure (executor 20 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages*
So is there any alternative, more efficient way to apply those function without causing out-of-memory error? I have the data in Billions to be computed.
Input Dataframe on which filtering is to be done:
+------+-------+-------+------+-------+-------+
|Pos_id|level_p|skill_p|Emp_id|level_e|skill_e|
+------+-------+-------+------+-------+-------+
| 1| 2| a| 100| 2| a|
| 1| 2| a| 100| 3| f|
| 1| 2| a| 100| 2| d|
| 1| 2| a| 101| 4| a|
| 1| 2| a| 101| 5| b|
| 1| 2| a| 101| 1| e|
| 1| 2| a| 102| 5| b|
| 1| 2| a| 102| 3| d|
| 1| 2| a| 102| 2| c|
| 2| 2| d| 100| 2| a|
| 2| 2| d| 100| 3| f|
| 2| 2| d| 100| 2| d|
| 2| 2| d| 101| 4| a|
| 2| 2| d| 101| 5| b|
| 2| 2| d| 101| 1| e|
| 2| 2| d| 102| 5| b|
| 2| 2| d| 102| 3| d|
| 2| 2| d| 102| 2| c|
| 2| 4| b| 100| 2| a|
| 2| 4| b| 100| 3| f|
+------+-------+-------+------+-------+-------+
Filtering Code:
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf
from pyspark.sql import functions as sf
function = udf(lambda item, items: 1 if item in items else 0, IntegerType())
df_result = new_df.withColumn('result', function(sf.col('skill_p'), sf.col('skill_e')))
df_filter = df_result.filter(sf.col("result") == 1)
df_filter.show()
res = df_filter.groupBy("Pos_id", "Emp_id").agg(
sf.collect_set("skill_p").alias("SkillsMatch"),
sf.sum("result").alias("SkillsMatchedCount"))
res.show()
This needs to be done on Billions of rows.

Pyspark: Add new Column contain a value in a column counterpart another value in another column that meets a specified condition

Add new Column contain a value in a column counterpart another value in another column that meets a specified condition
For instance,
original DF as follows:
+-----+-----+-----+
|col1 |col2 |col3 |
+-----+-----+-----+
| A| 17| 1|
| A| 16| 2|
| A| 18| 2|
| A| 30| 3|
| B| 35| 1|
| B| 34| 2|
| B| 36| 2|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
+-----+-----+-----+
I need to repeat the value in col2 that counterpart to 1 in col3 for each col1's groups. and if there are more value =1 in col3 for any group from col1 repeat the minimum value
the desired Df as follows:
+----+----+----+----------+
|col1|col2|col3|new_column|
+----+----+----+----------+
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
+----+----+----+----------+
df3=df.filter(df.col3==1)
+----+----+----+
|col1|col2|col3|
+----+----+----+
| B| 35| 1|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
| A| 17| 1|
+----+----+----+
df3.createOrReplaceTempView("mytable")
To obtain minimum value of col2 I followed the accepted answer in this link How to find exact median for grouped data in Spark
df6=spark.sql("select col1, min(col2) as minimum from mytable group by col1 order by col1")
df6.show()
+----+-------+
|col1|minimum|
+----+-------+
| A| 17|
| B| 35|
| C| 20|
+----+-------+
df_a=df.join(df6,['col1'],'leftouter')
+----+----+----+-------+
|col1|col2|col3|minimum|
+----+----+----+-------+
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
+----+----+----+-------+
Is there way better than this solution?