Pyspark replace a column from another dataframe which has one column

Pyspark replace a column from another dataframe which has one column - dataframe

I have a dataframe with some columns in which I want to replace one column value from another dataframe. For me the problem is I don't have any common column so I can do joins, etc. I simply would like to drop column if dataframe1, and replace with the same column value from 2nd dataframe, can you please help me.
I have tried union, but its adding rows from both dataframes which is not desired.
I am providing herewith a sample dataframe for reference please:
df1 = spark.createDataFrame(
[("USA","Unknown"), ("UK","Unknown"), ("Canada","Unknown")], ["country","count"])
df1.show()
+-------+-------+
|country| count|
+-------+-------+
| USA|Unknown|
| UK|Unknown|
| Canada|Unknown|
+-------+-------+
df2 = spark.createDataFrame(
[("1000"), ("2000"), ("3000")], ["count"])
df2.show()
+-------+
| count|
+-------+
|1000|
|2000|
|3000|
+-------+
Expected Result: I would like to drop/delete the count column in df1, and append it from df2.
+-------+-------+
|country| count|
+-------+-------+
| USA|1000|
| UK|2000|
| Canada|3000|
+-------+-------+
Appreciate any help please. Thanks

Related

Pyspark how to compare row by row based on hash from two data frame and group the result

I have bellow two data frame with hash added as additional column to identify differences for same id from both data frame
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
#identify unmatched row for same id from both data frame
df1_un_match_indf2=df1.join(df2,df1.hash==df2.hash,"leftanti")
df2_un_match_indf1=df2.join(df1,df2.hash==df1.hash,"leftanti")
#The above case list both data frame, since all hash for same id are different
Now i am trying to find difference of row value against the same id from 'df1_un_match_indf1,df2_un_match_indf1' data frame, so that it shows differences row by row
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff=df3.join(df4,df3.id==df4.id,"inner")
common_dff.show()
but result show difference like this
+--------+----------+-----+----+-----+-----------+-------+---+---+----+
|name |department|state|id |hash |name | department|state| id|hash
+--------+----------+-----+----+-----+-----+-----------+-----+---+-----+
|James |Sales |NY |101 | c123|James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234|Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34|Jen | |NY2 |103| 2f34
What i am expecting is
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
I tried with different ways, but didn't find right solution to make this expected format
Can anyone give a solution or idea to this?
Thanks

What you want to use is likely collect_list or maybe 'collect_set'
This is really well described here:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
+---+-----+-----+
| id| code| name|
+---+-----+-----+
| a| null| null|
| a|code1| null|
| a|code2|name2|
+---+-----+-----+
(df
.groupby("id")
.agg(F.collect_set("code"),
F.collect_list("name"))
.show())
+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
| a| [code1, code2]| [name2]|
+---+-----------------+------------------+
In your case you need to slightly change your join into a union to enable you to group the data.
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff = df3.union(df4)
(common_diff
.groupby("id")
.agg(F.collect_set("name"),
F.collect_list("department"))
.show())
If you can do a union just use an array:
from pyspark.sql.functions import array
common_diff.select(
df.id,
array(
common_diff.thisState,
common_diff.thatState
).alias("State"),
array(
common_diff.thisDept,
common_diff.thatDept
).alias("Department")
)
It a lot more typing and a little more fragile. I suggest that renaming columns and using the groupby is likely cleaner and clearer.

How to identify numeric values in a dataframe column with more than 10+ digits using pyspark

I am trying to identify numeric values from a column. I did the below option to achieve the same.
But for '7877177450' it is showing as non-numeric.According to my scenario the ID's can be number with 10+ digits also.
How to make that work?
values = [('695435',),('7877177450',),('PA-098',),('asv',),('23456123',)]
df = sqlContext.createDataFrame(values,['ID',])
df.show()
df = df.withColumn("Status",F.when((col("ID").cast("int").isNotNull()) ,lit("numeric")).otherwise(lit("non-numeric")))
df.show()
+----------+
| ID|
+----------+
| 695435|
|7877177450|
| PA-098|
| asv|
| 23456123|
+----------+
+----------+-----------+
| ID| Status|
+----------+-----------+
| 695435| numeric|
|7877177450|non-numeric|
| PA-098|non-numeric|
| asv|non-numeric|
| 23456123| numeric|
+----------+-----------+

You can cast to long instead:
df2 = df.withColumn("Status", F.when((F.col("ID").cast("long").isNotNull()), F.lit("numeric")).otherwise(F.lit("non-numeric")))
int has a maximum value of 2147483647, so it cannot handle values greater than this, and you'll get null.
Or you can use a regular expression:
df2 = df.withColumn("Status",F.when(F.col('ID').rlike('^(\\d)+$'), F.lit("numeric")).otherwise(F.lit("non-numeric")))

How to update a dataframe in PySpark with random values from another dataframe?

I have two dataframes in PySpark as below:
Dataframe A: total 1000 records
+-----+
|Name |
+-----+
| a|
| b|
| c|
+-----+
Dataframe B: Total 3 records
+-----+
|Zip |
+-----+
|06905|
|06901|
|06902|
+-----+
I need to add a new column named Zip in Dataframe A and populate the values with a randomly selected value from Dataframe B. So the Dataframe A will look something like this:
+-----+-----+
|Name |Zip |
+-----+-----+
| a|06901|
| b|06905|
| c|06902|
| d|06902|
+-----+-----+
I am running this on Azure Databricks and apparently, quinn isn't a module in there. So can't use quinn unfortunately.

If b is small (3 rows), you can just collect it into a Python list and add it as an array column to a. Then you can get a random element using shuffle.
import pyspark.sql.functions as F
df = a.withColumn(
'Zip',
F.shuffle(
F.array(*[F.lit(r[0]) for r in b.collect()])
)[0]
)
df.show()
+----+-----+
|Name| Zip|
+----+-----+
| a|06901|
| b|06905|
| c|06902|
| d|06901|
+----+-----+

You can agg the dataframe with zips and collect the values into one array column, then do a cross join and select a random element from the array of zips using for example shuffle on the array before picking the first element:
from pyspark.sql import functions as F
df_result = df_a.crossJoin(
df_b.agg(F.collect_list("Zip").alias("Zip"))
).withColumn(
"Zip",
F.expr("shuffle(Zip)[0]")
)
#+----+-----+
#|Name| Zip|
#+----+-----+
#| a|06901|
#| b|06902|
#| c|06901|
#| d|06901|
#+----+-----+

How to iterate through an Array value of a Dataframe?

I have a Pyspark dataframe that looks like this
>>> df1.show(1,False)
+---------------------------+
|col1 |
+---------------------------+
|[this, is, a, sample, text]| => Not a fixed array elements
+---------------------------+
And a lookup table/df like this
>>> lookup.show()
+------+
|lookup|
+------+
| this|
| is|
| a|
|sample|
+------+
For each row, each array element of the df1 I need to do a look up in lookup dataframe and return true or false [T,T,T,T,F]
How can I loop through df1?

Concatenate two DataFrames via column [PySpark]

I have two columns, i.e. (both with the same number of entries per column)
df1 =
+-------+
| col1 |
+-------+
| 10 |
+-------+
| 3 |
+-------+
...
df2 =
+-------+
| col2 |
+-------+
| 6 |
+-------+
| 1 |
+-------+
...
I wish to merge them such that the final DataFrame is of the following shape:
df3 =
+-------+-------+
| col1 | col2 |
+-------+-------+
| 10 | 6 |
+-------+-------+
| 3 | 1 |
+-------+-------+
...
But I am not able to do so with the join method since I am not attempting to merge columns based on column header. If anybody has any tips on how to achieve this easily that would be greatly helpful!

One way to do this if you are able to get your columns as list is to use python zip method. For example:
list1 = [1,2,3]
list2 = ['foo','baz','bar']
data_tuples = list(zip(list1,list2))
df = spark.createDataFrame(data_tuples)
df.show()
+---+---+
| _1| _2|
+---+---+
| 1|foo|
| 2|baz|
| 3|bar|
+---+---+
However I'm not sure how it behave with big dataset.

try this
df1 = df1.withColumn("code", monotonically_increasing_id())
df2 = df2.withColumn("code", monotonically_increasing_id())
this way you give them both a column code which you can use to classically merge both df's with.
df3 = df2.join(df1, ["code"])

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pyspark replace a column from another dataframe which has one column - dataframe

Related

Pyspark how to compare row by row based on hash from two data frame and group the result

How to identify numeric values in a dataframe column with more than 10+ digits using pyspark

How to update a dataframe in PySpark with random values from another dataframe?

How to iterate through an Array value of a Dataframe?

Concatenate two DataFrames via column [PySpark]

Categories

Resources