Equivalent of `takeWhile` for Spark dataframe - dataframe

I have a dataframe looking like this:
scala> val df = Seq((1,.5), (2,.3), (3,.9), (4,.0), (5,.6), (6,.0)).toDF("id", "x")
scala> df.show()
+---+---+
| id| x|
+---+---+
| 1|0.5|
| 2|0.3|
| 3|0.9|
| 4|0.0|
| 5|0.6|
| 6|0.0|
+---+---+
I would like to take the first rows of the data as long as the x column is nonzero (note that the dataframe is sorted by id so talking about the first rows is relevant). For this given dataframe, it would give something like that:
+---+---+
| id| x|
+---+---+
| 1|0.5|
| 2|0.3|
| 3|0.9|
+---+---+
I only kept the 3 first rows, as the 4th row was zero.
For a simple Seq, I can do something like Seq(0.5, 0.3, 0.9, 0.0, 0.6, 0.0).takeWhile(_ != 0.0). So for my dataframe I thought of something like this:
df.takeWhile('x =!= 0.0)
But unfortunately, the takeWhile method is not available for dataframes.
I know that I can transform my dataframe to a Seq to solve my problem, but I would like to avoid gathering all the data to the driver as it will likely crash it.
The take and the limit methods allow to get the n first rows of a dataframe, but I can't specify a predicate. Is there a simple way to do this?

Can you guarantee that ID's will be in ascending order? New data is not necessarily guaranteed to be added in a specific order. If you can guarantee the order then you can use this query to achieve what you want. It's not going to perform well on large data sets, but it may be the only way to achieve what you are interested in.
We'll mark all 0's as '1' and everything else as '0'. We'll then do a rolling total over the entire data awr. As the numbers only increase in value on a zero it will partition the dataset into sections with number between zero's.
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy().orderBy("id")
df.select(
col("id"),
col("x"),
sum( // creates a running total which will be 0 for the first partition --> All numbers before the first 0
when( col("x") === lit(0), lit(1) ).otherwise(lit(0)) // mark 0's to help partition the data set.
).over(windowSpec).as("partition")
).where(col("partition") === lit(0) )
.show()
---+---+---------+
| id| x|partition|
+---+---+---------+
| 1|0.5| 0|
| 2|0.3| 0|
| 3|0.9| 0|
+---+---+---------+

Related

Compare substrings of different columns

I have a dataframe like this:
+----------------------+--------------------------------------------------+-------------------+
| column_1 |column_2| |Required_column |
+----------------------+--------------------------------------------------+-------------------+
|K12B-45-84-6 |K12B-02-36-504, I05O-21-65-312, A301-21-25-363 | True |
|J020-35-2-9 |P12K-05-31-602, M002-22-22-636,L630-51-32-544 | False |
|L006-85-00-694 |M10P-22-94-349,L006-85-00-694, I553-35-12-240 | True |
|M002-22-36-989 |U985-12-45-363, M002-19-14-964 | True |
+----------------------+--------------------------------------------------+-------------------+
Explaination: column_1 and column_2 are a string,
for easy understanding let us call the values in the dataframe as "switch".
Column_1 always has only one switch value per row but column_2 may have multiple switch values in it. The value should be returned True or False only by comparing the first 4 strings(ex: K12B == K12B see row one)
Note: Even though the switch values in column_2 are comma seperated, there is never a common logic(sometimes there maybe a space or two spaces etc)
The hint is every switch value either in column_1 or column_2 starts with a letter, Therefore a logic is required based on that hint
The aim is to have the required column which either returns True or False, The solution is required in Pyspark
Thanks in Advance
Here is a solution using Pyspark's substring and contains functions. In this way, you don't have to worry about cleanliness of column_2, you just need to be sure that column_1 is cleaned:
import pyspark.sql.functions as F
data = [
("K12B-45-84-6", "K12B-02-36-504, I05O-21-65-312, A301-21-25-363"),
("J020-35-2-9", "P12K-05-31-602, M002-22-22-636,L630-51-32-544"),
("L006-85-00-694", "M10P-22-94-349,L006-85-00-694, I553-35-12-240"),
("M002-22-36-989", "U985-12-45-363, M002-19-14-964")]
columns = ["column_1", "column_2"]
df = spark.createDataFrame(data = data, schema = columns)
df = df.withColumn("Required_column", F.when(
F.col("column_2").contains(F.substring(F.col("column_1"), 1, 4)), True
).otherwise(False)
)
df.show()
Output:
+--------------+--------------------+---------------+
| column_1| column_2|Required_column|
+--------------+--------------------+---------------+
| K12B-45-84-6|K12B-02-36-504, I...| true|
| J020-35-2-9|P12K-05-31-602, M...| false|
|L006-85-00-694|M10P-22-94-349,L0...| true|
|M002-22-36-989|U985-12-45-363, ...| true|
+--------------+--------------------+---------------+

How to find the uncommon rows between two Pyspark DataFrames? [duplicate]

I have to compare two dataframes to find out the columns differences based on one or more key fields using pyspark in a most performance efficient approach since I have to deal with huge dataframes
I have already built a solution for comparing two dataframes using hash match without key field matching like data_compare.df_subtract(self.df_db1_hash,self.df_db2_hash)
but scenario is different if I want to use key field match
Note: I have provided sample expected dataframe. Actual requirement is any differences from DataFrame 2 in any columns should be retrieved in output/expected dataframe.
DataFrame 1:
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sanjose| romin|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
DataFrame 2:
+------+---------+--------+----------+-------+--------+
|emp_id| emp_city|emp_name| emp_phone|emp_sal|emp_site|
+------+---------+--------+----------+-------+--------+
| 3| Chennai| rahman|9848022330| 45000|SanRamon|
| 1|Hyderabad| ram|9848022338| 50000| SF|
| 2|Hyderabad| robin|9848022339| 40000| LA|
| 4| sandiego| romino|9848022331| 45123|SanRamon|
+------+---------+--------+----------+-------+--------+
Expected dataframe after comparing dataframe 1 and 2
+------+---------+--------+----------+
|emp_id| emp_city|emp_name| emp_phone|
+------+---------+--------+----------+
| 4| sandiego| romino|9848022331|
+------+---------+--------+----------+
subract function is what you are looking for, which will check all the columns value for each row and gives you a dataframe which are different from the other dataframe.
df2.subtract(df1).select("emp_id","emp_city","emp_name","emp_phone")
As the api document says
Return a new :class:DataFrame containing rows in this frame but not in another frame.
This is equivalent to EXCEPT in SQL.

Extract key value from dataframe in PySpark

I have the below dataframe which I have read from a JSON file.
1
2
3
4
{"todo":["wakeup", "shower"]}
{"todo":["brush", "eat"]}
{"todo":["read", "write"]}
{"todo":["sleep", "snooze"]}
I need my output to be as below Key and Value. How do I do this? Do I need to create a schema?
ID
todo
1
wakeup, shower
2
brush, eat
3
read, write
4
sleep, snooze
The key-value which you refer to is a struct. "keys" are struct field names, while "values" are field values.
What you want to do is called unpivoting. One of the ways to do it in PySpark is using stack. The following is a dynamic approach, where you don't need to provide existent column names.
Input dataframe:
df = spark.createDataFrame(
[((['wakeup', 'shower'],),(['brush', 'eat'],),(['read', 'write'],),(['sleep', 'snooze'],))],
'`1` struct<todo:array<string>>, `2` struct<todo:array<string>>, `3` struct<todo:array<string>>, `4` struct<todo:array<string>>')
Script:
to_melt = [f"\'{c}\', `{c}`.todo" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) (ID, todo)")
df.show()
# +---+----------------+
# | ID| todo|
# +---+----------------+
# | 1|[wakeup, shower]|
# | 2| [brush, eat]|
# | 3| [read, write]|
# | 4| [sleep, snooze]|
# +---+----------------+
Use from_json to convert string to array. Explode to cascade each unique element to row.
data
df = spark.createDataFrame(
[(('{"todo":"[wakeup, shower]"}'),('{"todo":"[brush, eat]"}'),('{"todo":"[read, write]"}'),('{"todo":"[sleep, snooze]"}'))],
('value1','values2','value3','value4'))
code
new = (df.withColumn('todo', explode(flatten(array(*[map_values(from_json(x, "MAP<STRING,STRING>")) for x in df.columns])))) #From string to array to indivicual row
.withColumn('todo', translate('todo',"[]",'')#Remove corner brackets
) ).show(truncate=False)
outcome
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|value1 |values2 |value3 |value4 |todo |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|wakeup, shower|
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|brush, eat |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|read, write |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|sleep, snooze |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+

Pyspark dataframe conversion to pandas drops data?

I have a fairly involved process of creating a pyspark dataframe, converting it to a pandas dataframe, and outputting the result to a flat file. I am not sure at which point the error is introduced, so I'll describe the whole process.
Starting out I have a pyspark dataframe that contains pairwise similarity for sets of ids. It looks like this:
+------+-------+-------------------+
| ID_A| ID_B| EuclideanDistance|
+------+-------+-------------------+
| 1| 1| 0.0|
| 1| 2|0.13103884200454394|
| 1| 3| 0.2176246463836219|
| 1| 4| 0.280568636550471|
...
I'like to group it by ID_A, sort each group by EuclideanDistance, and only grab the top N pairs for each group. So first I do this:
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col, row_number
window = Window.partitionBy(df['ID_A']).orderBy(df_sim['EuclideanDistance'])
result = (df.withColumn('row_num', row_number().over(window)))
I make sure ID_A = 1 is still in the "result" dataframe. Then I do this to limit each group to just 20 rows:
result1 = result.where(result.row_num<20)
result1.toPandas().to_csv("mytest.csv")
and ID_A = 1 is NOT in the resultant .csv file (although it's still there in result1). Is there a problem somewhere in this chain of conversions that could lead to a loss of data?
You are referencing 2 dataframes in the window of your solution. Not sure this is causing your error, but it's worth cleaning up. In any case, you don't need to reference a particular dataframe in a window definition. In any case, try
window = Window.partitionBy('ID_A').orderBy('EuclideanDistance')
As David mentioned, you reference a second dataframe "df_sim" in your window function.
I tested the following and it works on my machine (famous last words):
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col, row_number
import pandas as pd
#simulate some data
df = pd.DataFrame({'ID_A': pd.np.arange(100)%5,
'ID_B': pd.np.repeat(pd.np.arange(20),5),
'EuclideanDistance': pd.np.random.rand(100)*5}
)
#artificially set distance between point and self to 0
df['EuclideanDistance'][df['ID_A'] == df['ID_B']] = 0
df = spark.createDataFrame(df)
#end simulation
window = Window.partitionBy(df['ID_A']).orderBy(df['EuclideanDistance'])
output = df.select('*', row_number().over(window).alias('rank')).filter(col('rank') <= 10)
output.show(50)
The simulation code is there just to make this a self-contained example. You can of course use your actual dataframe and ignore the simulation when you test it. Hope that works!

Fetching distinct values on a column using Spark DataFrame

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger.
I understand that doing a distinct.collect() will bring the call back to the driver program. Currently I am performing this task as below, is there a better approach?
import sqlContext.implicits._
preProcessedData.persist(StorageLevel.MEMORY_AND_DISK_2)
preProcessedData.select(ApplicationId).distinct.collect().foreach(x => {
val applicationId = x.getAs[String](ApplicationId)
val selectedApplicationData = preProcessedData.filter($"$ApplicationId" === applicationId)
// DO SOME TASK PER applicationId
})
preProcessedData.unpersist()
Well to obtain all different values in a Dataframe you can use distinct. As you can see in the documentation that method returns another DataFrame. After that you can create a UDF in order to transform each record.
For example:
val df = sc.parallelize(Array((1, 2), (3, 4), (1, 6))).toDF("age", "salary")
// I obtain all different values. If you show you must see only {1, 3}
val distinctValuesDF = df.select(df("age")).distinct
// Define your udf. In this case I defined a simple function, but they can get complicated.
val myTransformationUDF = udf(value => value / 10)
// Run that transformation "over" your DataFrame
val afterTransformationDF = distinctValuesDF.select(myTransformationUDF(col("age")))
In Pyspark try this,
df.select('col_name').distinct().show()
This solution demonstrates how to transform data with Spark native functions which are better than UDFs. It also demonstrates how dropDuplicates which is more suitable than distinct for certain queries.
Suppose you have this DataFrame:
+-------+-------------+
|country| continent|
+-------+-------------+
| china| asia|
| brazil|south america|
| france| europe|
| china| asia|
+-------+-------------+
Here's how to take all the distinct countries and run a transformation:
df
.select("country")
.distinct
.withColumn("country", concat(col("country"), lit(" is fun!")))
.show()
+--------------+
| country|
+--------------+
|brazil is fun!|
|france is fun!|
| china is fun!|
+--------------+
You can use dropDuplicates instead of distinct if you don't want to lose the continent information:
df
.dropDuplicates("country")
.withColumn("description", concat(col("country"), lit(" is a country in "), col("continent")))
.show(false)
+-------+-------------+------------------------------------+
|country|continent |description |
+-------+-------------+------------------------------------+
|brazil |south america|brazil is a country in south america|
|france |europe |france is a country in europe |
|china |asia |china is a country in asia |
+-------+-------------+------------------------------------+
See here for more information about filtering DataFrames and here for more information on dropping duplicates.
Ultimately, you'll want to wrap your transformation logic in custom transformations that can be chained with the Dataset#transform method.
df = df.select("column1", "column2",....,..,"column N").distinct.[].collect()
in the empty list, you can insert values like [ to_JSON()] if you want the df in a JSON format.