How to iterate through an Array value of a Dataframe? - apache-spark-sql

I have a Pyspark dataframe that looks like this
>>> df1.show(1,False)
+---------------------------+
|col1 |
+---------------------------+
|[this, is, a, sample, text]| => Not a fixed array elements
+---------------------------+
And a lookup table/df like this
>>> lookup.show()
+------+
|lookup|
+------+
| this|
| is|
| a|
|sample|
+------+
For each row, each array element of the df1 I need to do a look up in lookup dataframe and return true or false [T,T,T,T,F]
How can I loop through df1?

Related

Pyspark replace a column from another dataframe which has one column

I have a dataframe with some columns in which I want to replace one column value from another dataframe. For me the problem is I don't have any common column so I can do joins, etc. I simply would like to drop column if dataframe1, and replace with the same column value from 2nd dataframe, can you please help me.
I have tried union, but its adding rows from both dataframes which is not desired.
I am providing herewith a sample dataframe for reference please:
df1 = spark.createDataFrame(
[("USA","Unknown"), ("UK","Unknown"), ("Canada","Unknown")], ["country","count"])
df1.show()
+-------+-------+
|country| count|
+-------+-------+
| USA|Unknown|
| UK|Unknown|
| Canada|Unknown|
+-------+-------+
df2 = spark.createDataFrame(
[("1000"), ("2000"), ("3000")], ["count"])
df2.show()
+-------+
| count|
+-------+
|1000|
|2000|
|3000|
+-------+
Expected Result: I would like to drop/delete the count column in df1, and append it from df2.
+-------+-------+
|country| count|
+-------+-------+
| USA|1000|
| UK|2000|
| Canada|3000|
+-------+-------+
Appreciate any help please. Thanks

How to do reshape operation on a single column in pyspark dataframe?

I have a long pyspark dataframe as shown below:
+------+
|number|
+------+
|12.4 |
|13.4 |
|42.3 |
|33.4 |
|42.3 |
|32.4 |
|44.2 |
|12.3 |
|45.4 |
+------+
Ideally, I want this to be reshaped to an nxn matrix where n is sqrt(length of pyspark dataframe).
While there is a solution by converting it into a numpy array and then reshaping it to nxn matrix but I want that to be done in pyspark. Because my data is super long (about 100 million rows).
So the expected output I am looking for is something along these lines:
+------+------+------+
|12.4 | 13.4 | 42.3 |
|33.4 | 42.3 | 32.4 |
|44.2 | 12.3 | 45.4 |
+------+------+------+
While I was able to do it properly by converting it to pandas then to numpy and then doing the reshape operation. But I want to do this transformation in Pyspark itself. Because the below code works fine for only a few thousand rows.
covarianceMatrix_pd = covarianceMatrix_df.toPandas()
nrows = np.sqrt(len(covarianceMatrix_pd))
covarianceMatrix_pd = covarianceMatrix_pd.to_numpy().reshape((int(nrows),int(nrows)))
covarianceMatrix_pd
One way to do this is using row_number with pivot after we have a count of the dataframe:
from pyspark.sql import functions as F, Window
from math import sqrt
c = int(sqrt(df.count())) #this gives 3
rnum = F.row_number().over(Window.orderBy(F.lit(1)))
out = (df.withColumn("Rnum",((rnum-1)/c).cast("Integer"))
.withColumn("idx",F.row_number().over(Window.partitionBy("Rnum").orderBy("Rnum")))
.groupby("Rnum").pivot("idx").agg(F.first("number")))
out.show()
+----+----+----+----+
|Rnum| 1| 2| 3|
+----+----+----+----+
| 0|12.4|13.4|42.3|
| 1|33.4|42.3|32.4|
| 2|44.2|12.3|45.4|
+----+----+----+----+

How to identify numeric values in a dataframe column with more than 10+ digits using pyspark

I am trying to identify numeric values from a column. I did the below option to achieve the same.
But for '7877177450' it is showing as non-numeric.According to my scenario the ID's can be number with 10+ digits also.
How to make that work?
values = [('695435',),('7877177450',),('PA-098',),('asv',),('23456123',)]
df = sqlContext.createDataFrame(values,['ID',])
df.show()
df = df.withColumn("Status",F.when((col("ID").cast("int").isNotNull()) ,lit("numeric")).otherwise(lit("non-numeric")))
df.show()
+----------+
| ID|
+----------+
| 695435|
|7877177450|
| PA-098|
| asv|
| 23456123|
+----------+
+----------+-----------+
| ID| Status|
+----------+-----------+
| 695435| numeric|
|7877177450|non-numeric|
| PA-098|non-numeric|
| asv|non-numeric|
| 23456123| numeric|
+----------+-----------+
You can cast to long instead:
df2 = df.withColumn("Status", F.when((F.col("ID").cast("long").isNotNull()), F.lit("numeric")).otherwise(F.lit("non-numeric")))
int has a maximum value of 2147483647, so it cannot handle values greater than this, and you'll get null.
Or you can use a regular expression:
df2 = df.withColumn("Status",F.when(F.col('ID').rlike('^(\\d)+$'), F.lit("numeric")).otherwise(F.lit("non-numeric")))

How to update a dataframe in PySpark with random values from another dataframe?

I have two dataframes in PySpark as below:
Dataframe A: total 1000 records
+-----+
|Name |
+-----+
| a|
| b|
| c|
+-----+
Dataframe B: Total 3 records
+-----+
|Zip |
+-----+
|06905|
|06901|
|06902|
+-----+
I need to add a new column named Zip in Dataframe A and populate the values with a randomly selected value from Dataframe B. So the Dataframe A will look something like this:
+-----+-----+
|Name |Zip |
+-----+-----+
| a|06901|
| b|06905|
| c|06902|
| d|06902|
+-----+-----+
I am running this on Azure Databricks and apparently, quinn isn't a module in there. So can't use quinn unfortunately.
If b is small (3 rows), you can just collect it into a Python list and add it as an array column to a. Then you can get a random element using shuffle.
import pyspark.sql.functions as F
df = a.withColumn(
'Zip',
F.shuffle(
F.array(*[F.lit(r[0]) for r in b.collect()])
)[0]
)
df.show()
+----+-----+
|Name| Zip|
+----+-----+
| a|06901|
| b|06905|
| c|06902|
| d|06901|
+----+-----+
You can agg the dataframe with zips and collect the values into one array column, then do a cross join and select a random element from the array of zips using for example shuffle on the array before picking the first element:
from pyspark.sql import functions as F
df_result = df_a.crossJoin(
df_b.agg(F.collect_list("Zip").alias("Zip"))
).withColumn(
"Zip",
F.expr("shuffle(Zip)[0]")
)
#+----+-----+
#|Name| Zip|
#+----+-----+
#| a|06901|
#| b|06902|
#| c|06901|
#| d|06901|
#+----+-----+

Concatenate two DataFrames via column [PySpark]

I have two columns, i.e. (both with the same number of entries per column)
df1 =
+-------+
| col1 |
+-------+
| 10 |
+-------+
| 3 |
+-------+
...
df2 =
+-------+
| col2 |
+-------+
| 6 |
+-------+
| 1 |
+-------+
...
I wish to merge them such that the final DataFrame is of the following shape:
df3 =
+-------+-------+
| col1 | col2 |
+-------+-------+
| 10 | 6 |
+-------+-------+
| 3 | 1 |
+-------+-------+
...
But I am not able to do so with the join method since I am not attempting to merge columns based on column header. If anybody has any tips on how to achieve this easily that would be greatly helpful!
One way to do this if you are able to get your columns as list is to use python zip method. For example:
list1 = [1,2,3]
list2 = ['foo','baz','bar']
data_tuples = list(zip(list1,list2))
df = spark.createDataFrame(data_tuples)
df.show()
+---+---+
| _1| _2|
+---+---+
| 1|foo|
| 2|baz|
| 3|bar|
+---+---+
However I'm not sure how it behave with big dataset.
try this
df1 = df1.withColumn("code", monotonically_increasing_id())
df2 = df2.withColumn("code", monotonically_increasing_id())
this way you give them both a column code which you can use to classically merge both df's with.
df3 = df2.join(df1, ["code"])