How to do reshape operation on a single column in pyspark dataframe? - pandas

I have a long pyspark dataframe as shown below:
+------+
|number|
+------+
|12.4 |
|13.4 |
|42.3 |
|33.4 |
|42.3 |
|32.4 |
|44.2 |
|12.3 |
|45.4 |
+------+
Ideally, I want this to be reshaped to an nxn matrix where n is sqrt(length of pyspark dataframe).
While there is a solution by converting it into a numpy array and then reshaping it to nxn matrix but I want that to be done in pyspark. Because my data is super long (about 100 million rows).
So the expected output I am looking for is something along these lines:
+------+------+------+
|12.4 | 13.4 | 42.3 |
|33.4 | 42.3 | 32.4 |
|44.2 | 12.3 | 45.4 |
+------+------+------+
While I was able to do it properly by converting it to pandas then to numpy and then doing the reshape operation. But I want to do this transformation in Pyspark itself. Because the below code works fine for only a few thousand rows.
covarianceMatrix_pd = covarianceMatrix_df.toPandas()
nrows = np.sqrt(len(covarianceMatrix_pd))
covarianceMatrix_pd = covarianceMatrix_pd.to_numpy().reshape((int(nrows),int(nrows)))
covarianceMatrix_pd

One way to do this is using row_number with pivot after we have a count of the dataframe:
from pyspark.sql import functions as F, Window
from math import sqrt
c = int(sqrt(df.count())) #this gives 3
rnum = F.row_number().over(Window.orderBy(F.lit(1)))
out = (df.withColumn("Rnum",((rnum-1)/c).cast("Integer"))
.withColumn("idx",F.row_number().over(Window.partitionBy("Rnum").orderBy("Rnum")))
.groupby("Rnum").pivot("idx").agg(F.first("number")))
out.show()
+----+----+----+----+
|Rnum| 1| 2| 3|
+----+----+----+----+
| 0|12.4|13.4|42.3|
| 1|33.4|42.3|32.4|
| 2|44.2|12.3|45.4|
+----+----+----+----+

Related

Join/Add data to MultiIndex dataframe in pandas

I have some measurement data from different dust analytics.
Two Locations MC174 and MC042
Two fractions PM2.5 and PM10
several analytic results [Cl,Na, K,...]
I created a multicolumn dataframe like this:
| MC174 | MC042 |
| PM2.5 | PM10 | PM2.4 | PM10 |
| Cl | Na| K | Cl | Na| K | Cl | Na| K | Cl | Na| K |
location = ['MC174','MC042']
fraction = ['PM10','PM2.5']
value = [ 'date' ,'Cl','NO3', 'SO4','Na', 'NH4','K', 'Mg','Ca', 'masse','OC_R', 'E_CR','OC_T', 'EC_T']
midx = pd.MultiIndex.from_product([location, fraction,value],names=['location','fraction','value'])
df = pd.DataFrame(columns=midx)
df
and i prepared 4 Dataframes with matching colums for those four locations and fractions.
date | Cl | Na | K |
______________________________
01-01-2021 | 3.1 | 4.3 | 1.0|
... ...
31-12-2021 | 4.9 | 3.8 | 0.8
Now i want to fill the large dataframe with the data from the four locations/fractions:
DF1 -> MainDF[MC174][PM10]
DF2 -> MainDF[MC174][PM2.5]
and so on...
My goal is to have one dataframe with the dates of the year in its index and the multilevel columnstructure i discribed at the top and all the data inside it.
I tried:
main_df['MC174']['PM10'].append(data_MC174_PM10)
pd.concat([main_df['MC174']['PM10'], data_MC174_PM10],axis=0)
main_df.loc[:,['MC174'],['PM10']] = data_MC174_PM10
but the dataframe is never filled.
Thanks in advance!

How to update a dataframe in PySpark with random values from another dataframe?

I have two dataframes in PySpark as below:
Dataframe A: total 1000 records
+-----+
|Name |
+-----+
| a|
| b|
| c|
+-----+
Dataframe B: Total 3 records
+-----+
|Zip |
+-----+
|06905|
|06901|
|06902|
+-----+
I need to add a new column named Zip in Dataframe A and populate the values with a randomly selected value from Dataframe B. So the Dataframe A will look something like this:
+-----+-----+
|Name |Zip |
+-----+-----+
| a|06901|
| b|06905|
| c|06902|
| d|06902|
+-----+-----+
I am running this on Azure Databricks and apparently, quinn isn't a module in there. So can't use quinn unfortunately.
If b is small (3 rows), you can just collect it into a Python list and add it as an array column to a. Then you can get a random element using shuffle.
import pyspark.sql.functions as F
df = a.withColumn(
'Zip',
F.shuffle(
F.array(*[F.lit(r[0]) for r in b.collect()])
)[0]
)
df.show()
+----+-----+
|Name| Zip|
+----+-----+
| a|06901|
| b|06905|
| c|06902|
| d|06901|
+----+-----+
You can agg the dataframe with zips and collect the values into one array column, then do a cross join and select a random element from the array of zips using for example shuffle on the array before picking the first element:
from pyspark.sql import functions as F
df_result = df_a.crossJoin(
df_b.agg(F.collect_list("Zip").alias("Zip"))
).withColumn(
"Zip",
F.expr("shuffle(Zip)[0]")
)
#+----+-----+
#|Name| Zip|
#+----+-----+
#| a|06901|
#| b|06902|
#| c|06901|
#| d|06901|
#+----+-----+

Using PySpark window functions with conditions to add rows

I have a need to be able to add new rows to a PySpark df will values based upon the contents of other rows with a common id. There will eventually millions of ids with lots rows for each id. I have tried the below method which works but seems overly complicated.
I start with a df in the format below (but in reality have more columns):
+-------+----------+-------+
| id | variable | value |
+-------+----------+-------+
| 1 | varA | 30 |
| 1 | varB | 1 |
| 1 | varC | -9 |
+-------+----------+-------+
Currently I am pivoting this df to get it in the following format:
+-----+------+------+------+
| id | varA | varB | varC |
+-----+------+------+------+
| 1 | 30 | 1 | -9 |
+-----+------+------+------+
On this df I can then use the standard withColumn and when functionality to add new columns based on the values in other columns. For example:
df = df.withColumn("varD", when((col("varA") > 16) & (col("varC") != -9)), 2).otherwise(1)
Which leads to:
+-----+------+------+------+------+
| id | varA | varB | varC | varD |
+-----+------+------+------+------+
| 1 | 30 | 1 | -9 | 1 |
+-----+------+------+------+------+
I can then pivot this df back to the original format leading to this:
+-------+----------+-------+
| id | variable | value |
+-------+----------+-------+
| 1 | varA | 30 |
| 1 | varB | 1 |
| 1 | varC | -9 |
| 1 | varD | 1 |
+-------+----------+-------+
This works but seems like it could, with millions of rows, lead to expensive and unnecessary operations. It feels like it should be doable without the need to pivot and unpivot the data. Do I need to do this?
I have read about Window functions and it sounds as if they may be another way to achieve the same result but to be honest I am struggling to get started with them. I can see how they can be used to generate a value, say a sum, for each id, or to find a maximum value but have not found a way to even get started on applying complex conditions that lead to a new row.
Any help to get started with this problem would be gratefully received.
You can use pandas_udf for adding/deleting rows/col on grouped data, and implement your processing logic in pandas udf.
import pyspark.sql.functions as F
row_schema = StructType(
[StructField("id", IntegerType(), True),
StructField("variable", StringType(), True),
StructField("value", IntegerType(), True)]
)
#F.pandas_udf(row_schema, F.PandasUDFType.GROUPED_MAP)
def addRow(pdf):
val = 1
if (len(pdf.loc[(pdf['variable'] == 'varA') & (pdf['value'] > 16)]) > 0 ) & \
(len(pdf.loc[(pdf['variable'] == 'varC') & (pdf['value'] != -9)]) > 0):
val = 2
return pdf.append(pd.Series([1, 'varD', val], index=['id', 'variable', 'value']), ignore_index=True)
df = spark.createDataFrame([[1, 'varA', 30],
[1, 'varB', 1],
[1, 'varC', -9]
], schema=['id', 'variable', 'value'])
df.groupBy("id").apply(addRow).show()
which resuts
+---+--------+-----+
| id|variable|value|
+---+--------+-----+
| 1| varA| 30|
| 1| varB| 1|
| 1| varC| -9|
| 1| varD| 1|
+---+--------+-----+

Adding a column to a PySpark dataframe contans standard deviations of a column based on the grouping on two another columns

Suppose that we have a csv file which has been imported as a dataframe in PysPark as follows
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.csv("file path and name.csv", inferSchema = True, header = True)
df.show()
output
+-----+----+----+
|lable|year|val |
+-----+----+----+
| A|2003| 5.0|
| A|2003| 6.0|
| A|2003| 3.0|
| A|2004|null|
| B|2000| 2.0|
| B|2000|null|
| B|2009| 1.0|
| B|2000| 6.0|
| B|2009| 6.0|
+-----+----+----+
Now, we want to add another column to df which contains the standard deviation of val based on the grouping on two columns lable and year. So, the output must be as follows:
+-----+----+----+-----+
|lable|year|val | std |
+-----+----+----+-----+
| A|2003| 5.0| 1.53|
| A|2003| 6.0| 1.53|
| A|2003| 3.0| 1.53|
| A|2004|null| null|
| B|2000| 2.0| 2.83|
| B|2000|null| 2.83|
| B|2009| 1.0| 3.54|
| B|2000| 6.0| 2.83|
| B|2009| 6.0| 3.54|
+-----+----+----+-----+
I have the following codes which works for a small dataframe but it does not work for a very large dataframe (with about 40 million rows) which I am working with now.
import pyspark.sql.functions as f
a = df.groupby('lable','year').agg(f.round(f.stddev("val"),2).alias('std'))
df = df.join(a, on = ['lable', 'year'], how = 'inner')
I get Py4JJavaError Traceback (most recent call last) error after running on my large dataframe.
Does anyone knows any alternative way? I hope your way works on my dataset.
I am using python3.7.1, pyspark2.4, and jupyter4.4.0
The join on dataframe causes a lot of data shuffle between executors. In your case, you can do without the join.
Use a window specification to partition data by 'lable' and 'year' and aggregate on the window.
from pyspark.sql.window import *
windowSpec = Window.partitionBy('lable','year')\
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df = df.withColumn("std", f.round(f.stddev("val").over(windowSpec), 2))

spark dataframe drop duplicates and keep first

Question: in pandas when dropping duplicates you can specify which columns to keep. Is there an equivalent in Spark Dataframes?
Pandas:
df.sort_values('actual_datetime', ascending=False).drop_duplicates(subset=['scheduled_datetime', 'flt_flightnumber'], keep='first')
Spark dataframe (I use Spark 1.6.0) doesn't have the keep option
df.orderBy(['actual_datetime']).dropDuplicates(subset=['scheduled_datetime', 'flt_flightnumber'])
Imagine scheduled_datetime and flt_flightnumber are columns 6 ,17. By creating keys based on the values of these columns we can also deduplicate
def get_key(x):
return "{0}{1}".format(x[6],x[17])
df= df.map(lambda x: (get_key(x),x)).reduceByKey(lambda x,y: (x))
but how to specify to keep the first row and get rid of the other duplicates ? What about the last row ?
To everyone saying that dropDuplicates keeps the first occurrence - this is not strictly correct.
dropDuplicates keeps the 'first occurrence' of a sort operation - only if there is 1 partition. See below for some examples.
However this is not practical for most Spark datasets. So I'm also including an example of 'first occurrence' drop duplicates operation using Window function + sort + rank + filter.
See bottom of post for example.
This is tested in Spark 2.4.0 using pyspark.
dropDuplicates examples
import pandas as pd
# generating some example data with pandas, will convert to spark df below
df1 = pd.DataFrame({'col1':range(0,5)})
df1['datestr'] = '2018-01-01'
df2 = pd.DataFrame({'col1':range(0,5)})
df2['datestr'] = '2018-02-01'
df3 = pd.DataFrame({'col1':range(0,5)})
df3['datestr'] = '2018-03-01'
dfall = pd.concat([df1,df2,df3])
print(dfall)
col1 datestr
0 0 2018-01-01
1 1 2018-01-01
2 2 2018-01-01
3 3 2018-01-01
4 4 2018-01-01
0 0 2018-02-01
1 1 2018-02-01
2 2 2018-02-01
3 3 2018-02-01
4 4 2018-02-01
0 0 2018-03-01
1 1 2018-03-01
2 2 2018-03-01
3 3 2018-03-01
4 4 2018-03-01
# first example
# does not give first (based on datestr)
(spark.createDataFrame(dfall)
.orderBy('datestr')
.dropDuplicates(subset = ['col1'])
.show()
)
# dropDuplicates NOT based on occurrence of sorted datestr
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-03-01|
| 1|2018-02-01|
| 3|2018-02-01|
| 2|2018-02-01|
| 4|2018-01-01|
+----+----------+
# second example
# testing what happens with repartition
(spark.createDataFrame(dfall)
.orderBy('datestr')
.repartition('datestr')
.dropDuplicates(subset = ['col1'])
.show()
)
# dropDuplicates NOT based on occurrence of sorted datestr
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-02-01|
| 1|2018-01-01|
| 3|2018-02-01|
| 2|2018-02-01|
| 4|2018-02-01|
+----+----------+
#third example
# testing with coalesce(1)
(spark
.createDataFrame(dfall)
.orderBy('datestr')
.coalesce(1)
.dropDuplicates(subset = ['col1'])
.show()
)
# dropDuplicates based on occurrence of sorted datestr
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-01-01|
| 1|2018-01-01|
| 2|2018-01-01|
| 3|2018-01-01|
| 4|2018-01-01|
+----+----------+
# fourth example
# testing with reverse sort then coalesce(1)
(spark
.createDataFrame(dfall)
.orderBy('datestr', ascending = False)
.coalesce(1)
.dropDuplicates(subset = ['col1'])
.show()
)
# dropDuplicates based on occurrence of sorted datestr```
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-03-01|
| 1|2018-03-01|
| 2|2018-03-01|
| 3|2018-03-01|
| 4|2018-03-01|
+----+----------+
window, sort, rank, filter example
# generating some example data with pandas
df1 = pd.DataFrame({'col1':range(0,5)})
df1['datestr'] = '2018-01-01'
df2 = pd.DataFrame({'col1':range(0,5)})
df2['datestr'] = '2018-02-01'
df3 = pd.DataFrame({'col1':range(0,5)})
df3['datestr'] = '2018-03-01'
dfall = pd.concat([df1,df2,df3])
# into spark df
df_s = (spark.createDataFrame(dfall))
from pyspark.sql import Window
from pyspark.sql.functions import rank
window = Window.partitionBy("col1").orderBy("datestr")
(df_s.withColumn('rank', rank().over(window))
.filter(col('rank') == 1)
.drop('rank')
.show()
)
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-01-01|
| 1|2018-01-01|
| 3|2018-01-01|
| 2|2018-01-01|
| 4|2018-01-01|
+----+----------+
# however this fails if ties/duplicates exist in the windowing paritions
# and so a tie breaker for the 'rank' function must be added
# generating some example data with pandas, will convert to spark df below
df1 = pd.DataFrame({'col1':range(0,5)})
df1['datestr'] = '2018-01-01'
df2 = pd.DataFrame({'col1':range(0,5)})
df2['datestr'] = '2018-01-01' # note duplicates in this dataset
df3 = pd.DataFrame({'col1':range(0,5)})
df3['datestr'] = '2018-03-01'
dfall = pd.concat([df1,df2,df3])
print(dfall)
col1 datestr
0 0 2018-01-01
1 1 2018-01-01
2 2 2018-01-01
3 3 2018-01-01
4 4 2018-01-01
0 0 2018-01-01
1 1 2018-01-01
2 2 2018-01-01
3 3 2018-01-01
4 4 2018-01-01
0 0 2018-03-01
1 1 2018-03-01
2 2 2018-03-01
3 3 2018-03-01
4 4 2018-03-01
# this will fail, since duplicates exist within the window partitions
# and no way to specify ranking style exists in pyspark rank() fn
window = Window.partitionBy("col1").orderBy("datestr")
(df_s.withColumn('rank', rank().over(window))
.filter(col('rank') == 1)
.drop('rank')
.show()
)
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-01-01|
| 0|2018-01-01|
| 1|2018-01-01|
| 1|2018-01-01|
| 3|2018-01-01|
| 3|2018-01-01|
| 2|2018-01-01|
| 2|2018-01-01|
| 4|2018-01-01|
| 4|2018-01-01|
+----+----------+
# to deal with ties within window partitions, a tiebreaker column is added
from pyspark.sql import Window
from pyspark.sql.functions import rank, col, monotonically_increasing_id
window = Window.partitionBy("col1").orderBy("datestr",'tiebreak')
(df_s
.withColumn('tiebreak', monotonically_increasing_id())
.withColumn('rank', rank().over(window))
.filter(col('rank') == 1).drop('rank','tiebreak')
.show()
)
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-01-01|
| 1|2018-01-01|
| 3|2018-01-01|
| 2|2018-01-01|
| 4|2018-01-01|
+----+----------+
Use window and row_number functions.
Order by ascending or descending to select first or last.
from pyspark.sql import Window
from pyspark.sql import functions as f
window = Window.partitionBy("col1").orderBy("datestr").asc()
df = (df.withColumn('row', f.row_number().over(window))\
.filter(col('row') == 1)
.drop('row')
.show())
I did the following:
dataframe.groupBy("uniqueColumn").min("time")
This will group by the given column, and within the same group choose the one with min time (this will keep the first and remove others)
solution 1
add a new column row num(incremental column) and drop duplicates based the min row after grouping on all the columns you are interested in.(you can include all the columns for dropping duplicates except the row num col)
solution 2:
turn the data-frame into a rdd (df.rdd) then group the rdd on one or more or all keys and then run a lambda function on the group and drop the rows the way you want and return only the row that you are interested in.
One of my friend (sameer) mentioned that below(old solution) didn't work for him.
use dropDuplicates method by default it keeps the first occurance.
You can use a window with row_number:
import pandas as pd
df1 = pd.DataFrame({'col1':range(0,5)})
df1['datestr'] = '2018-01-01'
df2 = pd.DataFrame({'col1':range(0,5)})
df2['datestr'] = '2018-02-01'
df3 = pd.DataFrame({'col1':range(0,5)})
df3['datestr'] = '2018-03-01'
dfall = spark.createDataFrame(pd.concat([df1,df2,df3]))
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col,row_number
window = Window.partitionBy('col1').orderBy(col('datestr'))
dfall.select('*', row_number().over(window).alias('posicion')).show()
dfall.select('*', row_number().over(window).alias('posicion')).where('posicion ==1').show()
+----+----------+--------+
|col1| datestr|posicion|
+----+----------+--------+
| 0|2018-01-01| 1|
| 0|2018-02-01| 2|
| 0|2018-03-01| 3|
| 1|2018-01-01| 1|
| 1|2018-02-01| 2|
| 1|2018-03-01| 3|
| 3|2018-01-01| 1|
| 3|2018-02-01| 2|
| 3|2018-03-01| 3|
| 2|2018-01-01| 1|
| 2|2018-02-01| 2|
| 2|2018-03-01| 3|
| 4|2018-01-01| 1|
| 4|2018-02-01| 2|
| 4|2018-03-01| 3|
+----+----------+--------+
+----+----------+--------+
|col1| datestr|posicion|
+----+----------+--------+
| 0|2018-01-01| 1|
| 1|2018-01-01| 1|
| 3|2018-01-01| 1|
| 2|2018-01-01| 1|
| 4|2018-01-01| 1|
+----+----------+--------+
I just did something perhaps similar to what you guys need, using drop_duplicates pyspark.
Situation is this. I have 2 dataframes (coming from 2 files) which are exactly same except 2 columns file_date(file date extracted from the file name) and data_date(row date stamp). Annoyingly I have rows which are with same data_date (and all other column cells too) but different file_date as they get replicated on every newcomming file with an addition of one new row.
I needed to capture all rows from the new file, plus that one row left over from the previous file. That row is not in the new file. Remaining columns on the right from data_date are same between the two files for the same data_date.
file_1_20190122 - df1
+------------+----------+----------+
|station_code| file_date| data_date|
+------------+----------+----------+
| AGGH|2019-01-22|2019-01-16| <- One row we want to keep where file_date 22nd
| AGGH|2019-01-22|2019-01-17|
| AGGH|2019-01-22|2019-01-18|
| AGGH|2019-01-22|2019-01-19|
| AGGH|2019-01-22|2019-01-20|
| AGGH|2019-01-22|2019-01-21|
| AGGH|2019-01-22|2019-01-22|
file_2_20190123 - df2
+------------+----------+----------+
|station_code| file_date| data_date|
+------------+----------+----------+
| AGGH|2019-01-23|2019-01-17| \/ ALL rows we want to keep where file_date 23rd
| AGGH|2019-01-23|2019-01-18|
| AGGH|2019-01-23|2019-01-19|
| AGGH|2019-01-23|2019-01-20|
| AGGH|2019-01-23|2019-01-21|
| AGGH|2019-01-23|2019-01-22|
| AGGH|2019-01-23|2019-01-23|
This will require us to sort and concat df's, then deduplicate them on all columns but one.
Let me walk you through.
union_df = df1.union(df2) \
.sort(['station_code', 'data_date'], ascending=[True, True])
+------------+----------+----------+
|station_code| file_date| data_date|
+------------+----------+----------+
| AGGH|2019-01-22|2019-01-16| <- keep
| AGGH|2019-01-23|2019-01-17| <- keep
| AGGH|2019-01-22|2019-01-17| x- drop
| AGGH|2019-01-22|2019-01-18| x- drop
| AGGH|2019-01-23|2019-01-18| <- keep
| AGGH|2019-01-23|2019-01-19| <- keep
| AGGH|2019-01-22|2019-01-19| x- drop
| AGGH|2019-01-23|2019-01-20| <- keep
| AGGH|2019-01-22|2019-01-20| x- drop
| AGGH|2019-01-22|2019-01-21| x- drop
| AGGH|2019-01-23|2019-01-21| <- keep
| AGGH|2019-01-23|2019-01-22| <- keep
| AGGH|2019-01-22|2019-01-22| x- drop
| AGGH|2019-01-23|2019-01-23| <- keep
Here we drop already sorted duped rows excluding keys ['file_date', 'data_date'].
nonduped_union_df = union_df \
.drop_duplicates(['station_code', 'data_date', 'time_zone',
'latitude', 'longitude', 'elevation',
'highest_temperature', 'lowest_temperature',
'highest_temperature_10_year_normal',
'another_50_columns'])
And the result holds ONE row with earliest date from DF1 which is not in DF2 and ALL rows from DF2
nonduped_union_df.select(['station_code', 'file_date', 'data_date',
'highest_temperature', 'lowest_temperature']) \
.sort(['station_code', 'data_date'], ascending=[True, True]) \
.show(30)
+------------+----------+----------+-------------------+------------------+
|station_code| file_date| data_date|highest_temperature|lowest_temperature|
+------------+----------+----------+-------------------+------------------+
| AGGH|2019-01-22|2019-01-16| 90| 77| <- df1 22nd
| AGGH|2019-01-23|2019-01-17| 90| 77| \/- df2 23rd
| AGGH|2019-01-23|2019-01-18| 91| 75|
| AGGH|2019-01-23|2019-01-19| 88| 77|
| AGGH|2019-01-23|2019-01-20| 88| 77|
| AGGH|2019-01-23|2019-01-21| 88| 77|
| AGGH|2019-01-23|2019-01-22| 90| 75|
| AGGH|2019-01-23|2019-01-23| 90| 75|
| CWCA|2019-01-22|2019-01-15| 23| -2|
| CWCA|2019-01-23|2019-01-16| 7| -8|
| CWCA|2019-01-23|2019-01-17| 28| -6|
| CWCA|2019-01-23|2019-01-18| 0| -13|
| CWCA|2019-01-23|2019-01-19| 25| -15|
| CWCA|2019-01-23|2019-01-20| -4| -18|
| CWCA|2019-01-23|2019-01-21| 27| -6|
| CWCA|2019-01-22|2019-01-22| 30| 17|
| CWCA|2019-01-23|2019-01-22| 30| 13|
| CWCO|2019-01-22|2019-01-15| 34| 29|
| CWCO|2019-01-23|2019-01-16| 33| 13|
| CWCO|2019-01-22|2019-01-16| 33| 13|
| CWCO|2019-01-22|2019-01-17| 23| 7|
| CWCO|2019-01-23|2019-01-17| 23| 7|
+------------+----------+----------+-------------------+------------------+
only showing top 30 rows
It may not be best suitable answer for this case, but it's the one worked for me.
Let me know, if stuck somewhere.
BTW - if anyone can tell me how to select all columns in a df, except one without listing them in a list - I will be very thankful.
Regards
G
I would try this way:
Assuming your data_df looks like this, and we want to keep the rows with the highest value in col1 per datestr:
col1 datestr
0 2018-01-01
1 2018-01-01
2 2018-01-01
3 2018-01-01
4 2018-01-01
0 2018-02-01
1 2018-02-01
2 2018-02-01
3 2018-02-01
4 2018-02-01
0 2018-03-01
1 2018-03-01
2 2018-03-01
3 2018-03-01
4 2018-03-01
you can do:
from pyspark.sql import Window
import pyspark.sql.functions as F
w = Window.partitionBy('datestr')
data_df = data_df.withColumn("max", F.max(F.col("col1"))\
.over(w))\
.where(F.col('max') == F.col('col1'))\
.drop("max")
this results in:
col1 datestr
4 2018-01-01
4 2018-02-01
4 2018-03-01
Given the below table:
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-01-01|
| 1|2018-01-01|
| 2|2018-01-01|
| 3|2018-01-01|
| 4|2018-01-01|
| 0|2018-02-01|
| 1|2018-02-01|
| 2|2018-02-01|
| 3|2018-02-01|
| 4|2018-02-01|
| 0|2018-03-01|
| 1|2018-03-01|
| 2|2018-03-01|
| 3|2018-03-01|
| 4|2018-03-01|
+----+----------+
You can do it in two steps:
Group by the given table based upon the col1 and pick min date.
+----+----------+
|col1| datestr|
+----+----------+
| 0|2018-01-01|
| 1|2018-01-01|
| 2|2018-01-01|
| 3|2018-01-01|
| 4|2018-01-01|
+----+----------+
left Join the resultant table with original table on col1 and min_datestr.
if datasets isnt not large convert to pandas data frame and drop duplicates keeping last or first then convert back.