I've an issue with Spark SQL, where the column type if I typecast from string to timestamp, the value becomes NULL. Below are the details:
val df2 = sql("""select FROM_UNIXTIME(UNIX_TIMESTAMP(to_date(LAST_DAY(ADD_MONTHS(CONCAT_WS('-','2018','10','01'),0))),'yyyy-MM-dd'),'yyyyMMdd HH:mm:ss')""")
df2: org.apache.spark.sql.DataFrame = [from_unixtime(unix_timestamp(to_date(last_day(add_months(CAST(concat_ws(-, 2018, 10, 01) AS DATE), 0))), yyyy-MM-dd), yyyyMMdd HH:mm:ss): string]
scala> df2.show
+----------------------------------------------------------------------------------------------------------------------------------------+
|from_unixtime(unix_timestamp(to_date(last_day(add_months(CAST(concat_ws(-, 2018, 10, 01) AS DATE), 0))), yyyy-MM-dd), yyyyMMdd HH:mm:ss)|
+----------------------------------------------------------------------------------------------------------------------------------------+
| 20181001 00:00:00|
+----------------------------------------------------------------------------------------------------------------------------------------+
When typecasting to timestamp explicitly, it won't give me the desired result.
val df2 = sql("""select cast(FROM_UNIXTIME(UNIX_TIMESTAMP(to_date(LAST_DAY(ADD_MONTHS(CONCAT_WS('-','2018','10','01'),0))),'yyyy-MM-dd'),'yyyyMMdd HH:mm:ss') as timestamp)""")
df2: org.apache.spark.sql.DataFrame = [CAST(from_unixtime(unix_timestamp(to_date(last_day(add_months(CAST(concat_ws(-, 2018, 10, 01) AS DATE), 0))), yyyy-MM-dd), yyyyMMdd HH:mm:ss) AS TIMESTAMP): timestamp]
scala> df2.show
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|CAST(from_unixtime(unix_timestamp(to_date(last_day(add_months(CAST(concat_ws(-, 2018, 10, 01) AS DATE), 0))), yyyy-MM-dd), yyyyMMdd HH:mm:ss) AS TIMESTAMP)|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
| null|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
Any idea to resolve it?
Try the following:
val df2 = spark.sql(
"""select CAST(unix_timestamp(FROM_UNIXTIME(UNIX_TIMESTAMP(to_date(LAST_DAY(ADD_MONTHS(CONCAT_WS('-','2018','10','01'),0))),'yyyy-MM-dd'),'yyyyMMdd HH:mm:ss'),'yyyyMMdd HH:mm:ss') as timestamp) as destination""".stripMargin)
df2.show(false)
df2.printSchema()
+-------------------+
|destination |
+-------------------+
|2018-10-31 00:00:00|
+-------------------+
root
|-- destination: timestamp (nullable = true)
I tried it like this, without using any spark internals.
val df2 = sql("""cast(FROM_UNIXTIME(UNIX_TIMESTAMP(cast(LAST_DAY(ADD_MONTHS(CONCAT_WS('-','2018','12','31'),0)) as timestamp))) as timestamp)""")
scala> df2.show
+--------------------+
|2018-12-31 00:00:...|
+--------------------+
Related
I have the following 2 dataframes:
df_a:
id
date
code
1
2021-06-27
A
df_b:
id
date
code
1
2021-05-19
A
1
2021-05-31
B
1
2021-08-27
C
I want to use df_b.code to update df_a.code by the following condition:
use the row from df_b where b.date is latest prior to the df_a.date.
so df_a.code will be updated to 'B' since the df_b.date '2021-05-31' is the latest prior to '2021-06-27'
I tried:
select a.id, b.code
from df_a left join df_b
on a.id = b.id
and b.date = (select max(b.date) from df_b where id = a.id and date <= a.date)
but I'm getting 'Correlated scalar sub-queries can only be used in a Filter/Aggregate/Project and a few commands' error
You can use a window function:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
win = Window.partitionBy(df_a.id).orderBy(df_b.date.desc())
(
df_a
.join(df_b,['id'])
.filter(df_a.date > df_b.date)
.withColumn("r", F.row_number().over(win))
.filter(F.col("r")==1)
.select(df_a.id, df_a.date, df_b.code)
).show()
Output:
| id| date|code|
+---+----------+----+
| 1|2021-06-27| B|
+---+----------+----+
Another approach is, get the lead date for the df1 first and join with between.
data1 = [[1, '2021-06-27', 'A']]
data2 = [[1, '2021-05-19', 'A'], [1, '2021-05-31', 'B'], [1, '2021-08-27', 'C']]
cols = ['id', 'date', 'code']
df1 = spark.createDataFrame(data1, cols).withColumn('date', f.col('date').cast('date'))
df2 = spark.createDataFrame(data2, cols).withColumn('date', f.col('date').cast('date'))
w = Window.partitionBy('id').orderBy('date')
df3 = df2.withColumn('date_after', f.lead('date', 1, '2999-12-31').over(w))
df3.show()
df1.alias('a') \
.join(df3.alias('b'), (f.col('a.id') == f.col('b.id')) & (f.col('a.date').between(f.col('b.date'), f.col('b.date_after'))), 'left') \
.withColumn('new_code', f.coalesce('b.code', 'a.code')) \
.select('a.id', 'a.date', 'new_code').toDF('id', 'date', 'code') \
.show()
+---+----------+----+----------+
| id| date|code|date_after|
+---+----------+----+----------+
| 1|2021-05-19| A|2021-05-31|
| 1|2021-05-31| B|2021-08-27|
| 1|2021-08-27| C|2999-12-31|
+---+----------+----+----------+
+---+----------+----+
| id| date|code|
+---+----------+----+
| 1|2021-06-27| B|
+---+----------+----+
I have a function that takes a LocalDate (it could take any other type) and returns a DataFrame, e.g.:
def genDataFrame(refDate: LocalDate): DataFrame = {
Seq(
(refDate,refDate.minusDays(7)),
(refDate.plusDays(3),refDate.plusDays(7))
).toDF("col_A","col_B")
}
genDataFrame(LocalDate.parse("2021-07-02")) output:
+----------+----------+
| col_A| col_B|
+----------+----------+
|2021-07-02|2021-06-25|
|2021-07-05|2021-07-09|
+----------+----------+
I wanna apply this function to each element in a dataframe column (which contains, obviously, LocalDate values), such as:
val myDate = LocalDate.parse("2021-07-02")
val df = Seq(
(myDate),
(myDate.plusDays(1)),
(myDate.plusDays(3))
).toDF("date")
df:
+----------+
| date|
+----------+
|2021-07-02|
|2021-07-03|
|2021-07-05|
+----------+
Required output:
+----------+----------+
| col_A| col_B|
+----------+----------+
|2021-07-02|2021-06-25|
|2021-07-05|2021-07-09|
|2021-07-03|2021-06-26|
|2021-07-06|2021-07-10|
|2021-07-05|2021-06-28|
|2021-07-08|2021-07-12|
+----------+----------+
How could I achieve that (without using collect)?
You can always convert your data frame to a lazily evaluated view and use Spark SQL:
val df_2 = df.map(x => x.getDate(0).toLocalDate()).withColumnRenamed("value", "col_A")
.withColumn("col_B", col("col_A"))
df_2.createOrReplaceTempView("test")
With that you can create a view like this one:
+----------+----------+
| col_A| col_B|
+----------+----------+
|2021-07-02|2021-07-02|
|2021-07-03|2021-07-03|
|2021-07-05|2021-07-05|
+----------+----------+
And then you can use SQL wich I find more intuitive:
spark.sql(s"""SELECT col_A, date_add(col_B, -7) as col_B FROM test
UNION
SELECT date_add(col_A, 3), date_add(col_B, 7) as col_B FROM test""")
.show()
This gives your expected output as a DataFrame:
+----------+----------+
| col_A| col_B|
+----------+----------+
|2021-07-02|2021-06-25|
|2021-07-03|2021-06-26|
|2021-07-05|2021-06-28|
|2021-07-05|2021-07-09|
|2021-07-06|2021-07-10|
|2021-07-08|2021-07-12|
+----------+----------+
I have two pyspark dataframe as follow :
df1 = spark.createDataFrame(
["yes","no","yes23", "no3", "35yes", """41no["maybe"]"""],
"string"
).toDF("location")
df2 = spark.createDataFrame(
["yes","no"],
"string"
).toDF("location")
i want to check if values in location col from df1, startsWith, values in location col of df2 and vice versa.
Something like :
df1.select("location").startsWith(df2.location)
Following is the output i am expecting here:
+-------------+
| location|
+-------------+
| yes|
| no|
| yes23|
| no3|
+-------------+
Using spark SQL looks the easiest to me:
df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')
joined = spark.sql("""
select df1.*
from df1
join df2
on df1.location rlike '^' || df2.location
""")
Im trying to find an exact string match in a dataframe column from employee dataframe
Employee days_present
Alex 1,2,11,23,
John 21,23,25,28
Need to find which employees are present on 2nd based on days_present column
expected output:
Alex
below is what i have tried
df = spark.sql("select * from employee where days_present RLIKE '2')
df.show()
This returns both Alex & John
Also i would like to find out who are present on 2 & 11, in this case expected ouput is only ALex
We can use array_intersect function starting from Spark-2.4+ and then check the array size if size >=2
Example:
df.show()
+--------+------------+
|Employee|days_present|
+--------+------------+
| Alex| 1,2,11,23|
| John| 21,23,25,28|
+--------+------------+
#DataFrame[Employee: string, days_present: string]
df.withColumn("tmp",split(col("days_present"),",")).\
withColumn("intersect",array_intersect(col("tmp"),array(lit("2"),lit("11")))).\
filter(size("intersect") >= 2).\
drop("tmp","intersect").\
show()
#+--------+------------+
#|Employee|days_present|
#+--------+------------+
#| Alex| 1,2,11,23|
#+--------+------------+
In spark-sql:
df.createOrReplaceTempView("tmp")
spark.sql("""select Employee,days_present from (select *,size(array_intersect(split(days_present,","),array("2","11")))size from tmp)e where size >=2""").show()
#+--------+------------+
#|Employee|days_present|
#+--------+------------+
#| Alex| 1,2,11,23|
#+--------+------------+
I have a table like
+---------------+------+
|id | value|
+---------------+------+
| 1|118.0|
| 2|109.0|
| 3|113.0|
| 4| 82.0|
| 5| 60.0|
| 6|111.0|
| 7|107.0|
| 8| 84.0|
| 9| 91.0|
| 10|118.0|
+---------------+------+
ans would like aggregate or bin the values to a range 0,10,20,30,40,...80,90,100,110,120how can I perform this in SQL or more specific spark sql?
Currently I have a lateral view join with the range but this seems rather clumsy / inefficient.
The quantile discretized is not really what I want, rather a CUT with this range.
edit
https://github.com/collectivemedia/spark-ext/blob/master/sparkext-mllib/src/main/scala/org/apache/spark/ml/feature/Binning.scala would perform dynamic bins, but I would rather need this specified range.
In the general case, static binning can be performed using org.apache.spark.ml.feature.Bucketizer:
val df = Seq(
(1, 118.0), (2, 109.0), (3, 113.0), (4, 82.0), (5, 60.0),
(6, 111.0), (7, 107.0), (8, 84.0), (9, 91.0), (10, 118.0)
).toDF("id", "value")
val splits = (0 to 12).map(_ * 10.0).toArray
import org.apache.spark.ml.feature.Bucketizer
val bucketizer = new Bucketizer()
.setInputCol("value")
.setOutputCol("bucket")
.setSplits(splits)
val bucketed = bucketizer.transform(df)
val solution = bucketed.groupBy($"bucket").agg(count($"id") as "count")
Result:
scala> solution.show
+------+-----+
|bucket|count|
+------+-----+
| 8.0| 2|
| 11.0| 4|
| 10.0| 2|
| 6.0| 1|
| 9.0| 1|
+------+-----+
The bucketizer throws errors when values lie outside the defined bins. It is possible to define split points as Double.NegativeInfinity or Double.PositiveInfinity to capture outliers.
Bucketizer is designed to work efficiently with arbitrary splits by performing binary search of the right bucket. In the case of regular bins like yours, one can simply do something like:
val binned = df.withColumn("bucket", (($"value" - bin_min) / bin_width) cast "int")
where bin_min and bin_width are the left interval of the minimum bin and the bin width, respectively.
Try "GROUP BY" with this
SELECT id, (value DIV 10)*10 FROM table_name ;
The following would be using the Dataset API for Scala:
df.select(('value divide 10).cast("int")*10)