How to convert a dataframe to a variable

How to convert a dataframe to a variable - dataframe

Is there any direct function to convert a dataframe and assign to a variable?
For example below returns this
>>> partitionRecordCount= spark.sql("select count(*) from mydb.mytable where partition_date="yyyymmdd")
>>> partitionRecordCount.show()
+--------+
|count(1)|
+--------+
| 206157|
+--------+
what i need is like below
>>> partitionRecordCount
206157
I need that record count integer value directly in that variable on the left hand side rather than a dataframe. Please advice

See this answer
get value out of dataframe
So for your example you can just change it to:
partitionRecordCount = partitionRecordCount.collect()[0]

Try
partitionRecordCount.collect()[0][0]

Related

PySpark: Transform values of given column in the DataFrame

I am new to PySpark and Spark in general.
I would like to apply transformation on a given column in the DataFrame, essentially call a function for each value on that specific column.
I have my DataFrame df that looks like this:
df.show()
+------------+--------------------+
|version | body |
+------------+--------------------+
| 1|9gIAAAASAQAEAAAAA...|
| 2|2gIAAAASAQAEAAAAA...|
| 3|3gIAAAASAQAEAAAAA...|
| 1|7gIAKAASAQAEAAAAA...|
+------------+--------------------+
I need to read value of body column for each row where the version is 1 and then decrypt it (I have my own logic/function which takes a string and returns a decrypted string). Finally, write the decrypted values in csv format to a S3 bucket.
def decrypt(encrypted_string: str):
# code that returns decrypted string
So, When I do following, I get the corresponding filtered values to which I need to apply my decrypt function.
df.where(col('version') =='1')\
.select(col('body')).show()
+--------------------+
| body|
+--------------------+
|9gIAAAASAQAEAAAAA...|
|7gIAKAASAQAEAAAAA...|
+--------------------+
However, I am not clear how to do that. I tried to use collect() but then it defeats the purpose of using Spark.
I also tried using .rdd.map as follows but that did not work.
df.where(col('version') =='1')\
.select(col('body'))\
.rdd.map(lambda x: decrypt).toDF().show()
OR
.rdd.map(decrypt).toDF().show()
Could someone please help with this.

Please try:
from pyspark.sql.functions import udf
decrypt_udf = udf(decrypt, StringType())
df.where(col('version') =='1').withColumn('body', decrypt_udf('body'))

Got some clue from this post: Pyspark DataFrame UDF on Text Column.
Looks like I can simply get it with following. I was doing it without using udf earlier, so it wasn't working.
dummy_function_udf = udf(decrypt, StringType())
df.where(col('version') == '1')\
.select(col('body')) \
.withColumn('decryptedBody', dummy_function_udf('body')) \
.show()

In Spark, how to check the date format?

How can we check the date format in below code.
DF = DF.withColumn("DATE", to_date(trim(col("DATE")), "yyyyMMdd"))
Error:
Caused by: java.time.format.DateTimeParseException: Text '2171121' could not be parsed at index 6
Expectation:
If the format is correct use the same data otherwise populate null in the same column.

In Spark 3.1, from_unixtime, unix_timestamp,to_unix_timestamp, to_timestamp and to_date will fail if the specified datetime pattern is invalid. In Spark 3.0 or earlier, they result NULL. Check documentaion here.
To switch back to previous behavior you can use below configuration.
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
Read what has been changed from spark 3.0 w.r.t datetime parser here.
You can use when() and otherwise() functions to get desired result, after using above configuration.
>>> from pyspark.sql.functions import *
>>> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
>>> df = spark.createDataFrame([(20210822,),(1234,)]).toDF("date")
# casting column to string as to_date function will accept string or date or timestamp type columns
>>> df.withColumn("date", when(to_date(df["date"].cast("string"),"yyyyMMdd").isNull(), None).otherwise(df["date"])).show()
+--------+
| date|
+--------+
|20210822|
| null|
+--------+

How to add Extra column with current date in Spark dataframe

I am trying to add one column in my existing Pyspark Dataframe using withColumn method.I want to insert current date in this column.From my Source I don't have any date column so i am adding this current date column in my dataframe and saving this dataframe in my table so later for tracking purpose i can use this current date column.
I am using below code
df2=df.withColumn("Curr_date",datetime.now().strftime('%Y-%m-%d'))
here df is my existing Dataframe and i want to save df2 as table with Curr_date column.
but here its expecting existing column or lit method instead of datetime.now().strftime('%Y-%m-%d').
someone please guide me how should i add this Date column in my dataframe.?

use either lit or current_date
from pyspark.sql import functions as F
df2 = df.withColumn("Curr_date", F.lit(datetime.now().strftime("%Y-%m-%d")))
# OR
df2 = df.withColumn("Curr_date", F.current_date())

current_timestamp() is good but it is evaluated during the serialization time.
If you prefer to use the timestamp of the processing time of a row, then you may use the below method,
withColumn('current', expr("reflect('java.time.LocalDateTime', 'now')"))

There is a spark function current_timestamp().
from pyspark.sql.functions import *
df.withColumn('current', date_format(current_timestamp(), 'yyyy-MM-dd')).show()
+----+----------+
|test| current|
+----+----------+
|test|2020-09-09|
+----+----------+

drop record based on multile columns value using pyspark

I have a pyspark dataframe like below :
I wanted to keep only one record if two column uniq_id and date_time have same value.
Expected Output :
I wanted to achieve this using pyspark.
Thank you

You can group by uniq_id and date_time and use first()
from pyspark.sql import functions as F
df.groupBy("uniq_id", "date_time").agg(F.first("col_1"), F.first("col_2"), F.first("col_3")).show()

I can't get how you compare int column and timestamp one(though it can be done with casting timestamp to int) but such a filtering can be made via
from pyspark.sql import functions as F
# assume you already have your DataFrame
df = df.filter(F.col('first_column_name') == F.col('second_column_name'))
or just
df = df.filter('first_column_name = second_column_name')

Spark DataFrame equivalent to Pandas Dataframe `.iloc()` method?

Is there a way to reference Spark DataFrame columns by position using an integer?
Analogous Pandas DataFrame operation:
df.iloc[:0] # Give me all the rows at column position 0

The equivalent of Python df.iloc is collect
PySpark examples:
X = df.collect()[0]['age']
or
X = df.collect()[0][1] #row 0 col 1

Not really, but you can try something like this:
Python:
df = sc.parallelize([(1, "foo", 2.0)]).toDF()
df.select(*df.columns[:1]) # I assume [:1] is what you really want
## DataFrame[_1: bigint]
or
df.select(df.columns[1:3])
## DataFrame[_2: string, _3: double]
Scala
val df = sc.parallelize(Seq((1, "foo", 2.0))).toDF()
df.select(df.columns.slice(0, 1).map(col(_)): _*)
Note:
Spark SQL doesn't support and it is unlikely to ever support row indexing so it is not possible to index across row dimension.

You can use like this in spark-shell.
scala>: df.columns
Array[String] = Array(age, name)
scala>: df.select(df.columns(0)).show()
+----+
| age|
+----+
|null|
| 30|
| 19|
+----+

As of Spark 3.1.1 on Databricks, it's a matter of selecting the column of interest, and applying limit:
%python
retDF = (inputDF
.select(col(inputDF
.columns[0]))
.limit(100)
)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to convert a dataframe to a variable - dataframe

See this answer get value out of dataframe So for your example you can just change it to: partitionRecordCount = partitionRecordCount.collect()[0]

Try partitionRecordCount.collect()[0][0]

Related

PySpark: Transform values of given column in the DataFrame

In Spark, how to check the date format?

How to add Extra column with current date in Spark dataframe

drop record based on multile columns value using pyspark

Spark DataFrame equivalent to Pandas Dataframe `.iloc()` method?

Categories

Resources