spark sql date conversion and performing calculation - apache-spark-sql

I have created a data frame from CSV (as per the below screenshot) and registered as a table (createOrReplaceTempView), later I am trying to performing some calculations on DOB columns fo find the employees whose age is more than 30. And I am trying to run the below query, but getting error. Please correct or advise me. Thanks!
spark.sql(date_format(to_date(DOB, 'yyyy-dd-mm'), 'dd-MMM-yy') as new_dob).show()

Your DOB has multiple date format
Try-
spark.sql("select coalesce(to_date(DOB, 'MM/dd/yyyy'), to_date(DOB, 'dd-mm-yyyy')) as new_dob from <view_name>").show()

As previous answer said, the DOB field is in mixed format, so need to use coalesce().
Here is a solution in Scala.
import org.apache.spark.sql.functions._
val df = Seq(
("Adam", "8/14/1933"), ("Bob", "02-02-1948"), ("Cathy", "05-11-1941"), ("Donald", "5/26/1953")
, ("Edward", "07/26/1981"), ("Flora", "04-15-1992"), ("Gio", "3/8/2001")
).toDF("name", "DOB")
df.show
val currentDate = java.sql.Date.valueOf("2021-01-22") // Use currentDate as basis to calculate the age over 30
val df1 = df.withColumn("new_DOB", coalesce(to_date($"DOB", "MM/dd/yyyy"), to_date($"DOB", "MM-dd-yyyy"))) // convert the multi date format DOB field to date type column
.withColumn("months_in_between", months_between(lit(currentDate), $"new_DOB")) // not necessary - just to show the calculated # of months between from the function
.withColumn("years_in_between", months_between(lit(currentDate), $"new_DOB")/12) // not necessary - just to show the calculated # of years between from the function divide by 12
.filter(months_between(lit(currentDate), $"new_DOB")/lit(12) > 30.0) // Only keep the records with age over 30
df1.show
Here is the output:
+------+----------+----------+-----------------+-----------------+
| name| DOB| new_DOB|months_in_between| years_in_between|
+------+----------+----------+-----------------+-----------------+
| Adam| 8/14/1933|1933-08-14| 1049.25806452|87.43817204333334|
| Bob|02-02-1948|1948-02-02| 875.64516129|72.97043010750001|
| Cathy|05-11-1941|1941-05-11| 956.35483871|79.69623655916666|
|Donald| 5/26/1953|1953-05-26| 811.87096774|67.65591397833333|
|Edward|07/26/1981|1981-07-26| 473.87096774|39.48924731166667|
+------+----------+----------+-----------------+-----------------+
Here is a solution using Spark SQL Temp View:
import org.apache.spark.sql.functions._
val df = Seq(
("Adam", "8/14/1933"), ("Bob", "02-02-1948"), ("Cathy", "05-11-1941"), ("Donald", "5/26/1953")
, ("Edward", "07/26/1981"), ("Flora", "04-15-1992"), ("Gio", "3/8/2001")
).toDF("name", "DOB")
df.show
df.createOrReplaceTempView("temp_view_0")
val currentDate = "2021-01-22" // Use currentDate as basis to calculate the age over 30
val df1 = spark.sql("""
select *
from (select name, DOB, coalesce(to_date(DOB, 'MM/dd/yyyy'), to_date(DOB, 'MM-dd-yyyy')) as new_DOB from temp_view_0) s
where months_between('2021-01-22', new_DOB) / 12 > 30
""")
df1.show
Here is the output:
+------+----------+----------+
| name| DOB| new_DOB|
+------+----------+----------+
| Adam| 8/14/1933|1933-08-14|
| Bob|02-02-1948|1948-02-02|
| Cathy|05-11-1941|1941-05-11|
|Donald| 5/26/1953|1953-05-26|
|Edward|07/26/1981|1981-07-26|
+------+----------+----------+

Related

Kusto query to get percentage value of events over time

I have a Kusto / KQL query in azure log analytics that aggregates a count of events over time, e.g.:
customEvents
| where name == "EventICareAbout"
| extend channel = customDimensions["ChannelName"]
| summarize events=count() by bin(timestamp, 1m), tostring(channel)
This gives a good results set of a count of the events in each minute bucket.
But the count is fairly meaningless, what I want to know is if that count is different to the average of over the say last hour.
But I'm not even sure how to start constructing something like that.
Any pointers?
There are a couple of ways to achieve this, first, calculate the hourly avg as an additional column then calculate the diffs from the hourly average:
let minuteValues = customEvents
| where name == "EventICareAbout"
| extend channel = customDimensions["ChannelName"]
| summarize events=count() by bin(timestamp, 1m), tostring(channel)
| extend Day = startofday(timestamp), hour =hourofday(timestamp);
let hourlyAverage = customEvents
| where name == "EventICareAbout"
| extend channel = customDimensions["ChannelName"]
| summarize events=count() by bin(timestamp, 1m), tostring(channel)
| summarize hourlyAvgEvents = avg(events) by bin(timestamp,1h), tostring(channel)
| extend Day = startofday(timestamp),hour =hourofday(timestamp);
minuteValues
| lookup hourlyAverage on hour, Day
| extend Diff = events- hourlyAvgEvents
Another option is to use the built-in Anomaly detection

Select field only if it exists (SQL or Scala)

The input dataframe may not always have all the columns. In SQL or SCALA, I want to create a select statement where even if the dataframe does not have column, it won't error out and it will only output the columns that do exist.
For example, this statement will work.
Select store, prod, distance from table
+-----+------+--------+
|store|prod |distance|
+-----+------+--------+
|51 |42 |2 |
|51 |42 |5 |
|89 |44 |9 |
If the dataframe looks like below, I want the same statement to work, to just ignore what's not there, and just output the existing columns (in this case 'store' and 'prod')
+-----+------+
|store|prod |
+-----+------+
|51 |42 |
|51 |42 |
|89 |44 |
You can have list of all cols in list, either hard coded or prepare from other meta data and use intersect
val columnNames = Seq("c1","c2","c3","c4")
df.select( df.columns.intersect(columnNames).map(x=>col(x)): _* ).show()
You can make use of columns method on Dataframe. This would look like that:
val result = if(df.columns.contains("distance")) df.select("store", "prod", "distance")
else df.select("store", "prod")
Edit:
Having many such columns, you can keep them in array, for example cols and filter it:
val selectedCols = cols.filter(col -> df.columns.contains("distance")).map(col)
val result = df.select(selectedCols:_*)
Assuming you use the expanded SQL template, like select a,b,c from tab, you could do something like below to get the required results.
Get the sql string and convert it to lowercase.
Split the sql on space or comma to get the individual words in an array
Remove "select" and "from" from the above array as they are SQL keywords.
Now your last index is the table name
First to last index but one contains the list of select columns.
To get the required columns, just filter it against df2.columns. The columns that are there in SQL but not in table will be filtered out
Now construct the sql using the individual pieces.
Run it using spark.sql(reqd_sel_string) to get the results.
Check this out
scala> val df2 = Seq((51,42),(51,42),(89,44)).toDF("store","prod")
df2: org.apache.spark.sql.DataFrame = [store: int, prod: int]
scala> df2.createOrReplaceTempView("tab2")
scala> val sel_query="Select store, prod, distance from tab2".toLowerCase
sel_query: String = select store, prod, distance from tab2
scala> val tabl_parse = sel_query.split("[ ,]+").filter(_!="select").filter(_!="from")
tabl_parse: Array[String] = Array(store, prod, distance, tab2)
scala> val tab_name=tabl_parse(tabl_parse.size-1)
tab_name: String = tab2
scala> val tab_cols = (0 until tabl_parse.size-1).map(tabl_parse(_))
tab_cols: scala.collection.immutable.IndexedSeq[String] = Vector(store, prod, distance)
scala> val reqd_cols = tab_cols.filter( x=>df2.columns.contains(x))
reqd_cols: scala.collection.immutable.IndexedSeq[String] = Vector(store, prod)
scala> val reqd_sel_string = "select " + reqd_cols.mkString(",") + " from " + tab_name
reqd_sel_string: String = select store,prod from tab2
scala> spark.sql(reqd_sel_string).show(false)
+-----+----+
|store|prod|
+-----+----+
|51 |42 |
|51 |42 |
|89 |44 |
+-----+----+
scala>

how to generate unique weekid using weekofyear in hive

I have a table I"m just iterating dates of 50 years.
Using the values of weekofyear("date") -> week_no_in_this_year.
I would like to create a column using (week_no_in_this_year), it should be unique for a week. name it as -> week_id
which should be concatination of Year+two_digit_week_no_in_this_year+Some_number(to make this id unique for one week). I tried like below:
concat(concat(YEAR,IF(week_no_in_this_year<10,
concat(0,week_no_in_this_year),week_no_in_this_year)),'2') AS week_id.
But I'm facing issue for few dates for below scenario's:
SELECT weekofyear("2019-01-01") ;
SELECT concat(concat("2019",IF(1<10, concat(0,1),1)),'2') AS week_id;
Expected Result: 2019012
SELECT weekofyear("2019-12-31");
SELECT concat(concat("2019",IF(1<10, concat(0,1),1)),'2') AS week_id;
Expected Result: 2020012
One way to do it is with UDF. Create a python script and push it to HDFS
mypy.py
import sys
import datetime
for line in sys.stdin:
line = line.strip()
(y,m,d) = line.split("-")
d = datetime.date(int(y),int(m),int(d)).isocalendar()
print str(d[0])+str(d[1])
In Hive
add file hdfs:/user/cloudera/mypy.py;
select transform("2019-1-1") using "python mypy.py" as (week_id);
INFO : OK
+----------+--+
| week_id |
+----------+--+
| 20191 |
+----------+--+
select transform("2019-12-30") using "python mypy.py" as (week_id)
+----------+--+
| week_id |
+----------+--+
| 20201 |
+----------+--+
1 row selected (33.413 seconds)
This scenario only happens when there is a split between years at the end of a given year (that is Dec 31) and the week number is rollover to next year. If we put a condition for this case, then we get what you expect.
Right function is the same as substr (, -n).
SELECT DTE as Date,
CONCAT(IF(MONTH(DTE)=12 and WEEKOFYEAR(DTE)=1, year(DTE)+1, year(DTE)),
SUBSTR(CONCAT('0', WEEKOFYEAR(DTE)), -2), '2') as weekid
FROM tbl;
Result:
Date WeekId
2019-01-01 2019012
2019-11-01 2019442
2019-12-31 2020012

SQL dynamic column name

How do I declare a column name that changes?
I take some data from DB and I am interested in last 12 months, so I only take events that happend, let's say in '2016-07', '2016-06' and so on...
Then, I want my table to look like this:
event type | 2016-07 | 2016-06
-------------------------------
A | 12 | 13
B | 21 | 44
C | 98 | 12
How can I achieve this effect that the columns are named using previous YYYY-MM pattern, keeping in mind that the report with that query can be executed any time, so it would change.
Simplified query only for prev month:
select distinct
count(event),
date_year_month,
event_name
from
data_base
where date_year_month = TO_CHAR(add_months(current_date, -1),'YYYY-MM')
group by event_name, date_year_month
I don't think there is an automated way of pivoting the year-month columns, and change the number of columns in the result dynamically based on the data.
However if you are looking for pivoting solution, you accomplish using table functions in netezza.
select event_name, year_month, event_count
from event_counts_groupby_year_month, table(inza.inza.nzlua('
local rows={}
function processRow(y2016m06, y2016m07)
rows[1] = { 201606, y2016m06 }
rows[2] = { 201607, y2016m07 }
return rows
end
function getShape()
columns={}
columns[1] = { "year_month", integer }
columns[2] = { "event_count", double }
return columns
end',
y2016m06, y2016m07));
you could probably build a wrapper on this to dynamically generate the query based on the year month present in the table using shell script.

Creating custom event schedules. Should I use "LIKE"?

I'm creating a campaign event scheduler that allows for frequencies such as "Every Monday", "May 6th through 10th", "Every day except Sunday", etc.
I've come up with a solution that I believe will work fine (not yet implemented), however, it uses "LIKE" in the queries, which I've never been too fond of. If anyone else has a suggestion that can achieve the same result with a cleaner method, please suggest it!
+----------------------+
| Campaign Table |
+----------------------+
| id:int |
| event_id:foreign_key |
| start_at:datetime |
| end_at:datetime |
+----------------------+
+-----------------------------+
| Event Table |
+-----------------------------+
| id:int |
| valid_days_of_week:string | < * = ALL. 345 = Tue, Wed, Thur. etc.
| valid_weeks_of_month:string | < * = ALL. 25 = 2nd and 5th weeks of a month.
| valid_day_numbers:string | < * = ALL. L = last. 2,7,17,29 = 2nd day, 7th, 17th, 29th,. etc.
+-----------------------------+
A sample event schedule would look like this:
valid_days_of_week = '1357' (Sun, Tue, Thu, Sat)
valid_weeks_of_month = '*' (All weeks)
valid_day_numbers = ',1,2,5,6,8,9,25,30,'
Using today's date (6/25/15) as an example, we have the following information to query with:
Day of week: 5 (Thursday)
Week of month: 4 (4th week in June)
Day number: 25
Therefore, to fetch all of the events for today, the query would look something like this:
SELECT c.*
FROM campaigns AS c,
LEFT JOIN events AS e
ON c.event_id = e.id
WHERE
( e.valid_days_of_week = '*' OR e.valid_days_of_week LIKE '%5%' )
AND ( e.valid_weeks_of_month = '*' OR e.valid_weeks_of_month LIKE '%4%' )
AND ( e.valid_day_numbers = '*' OR e.valid_day_numbers LIKE '%,25,%' )
That (untested) query would ideally return the example event above. The "LIKE" queries are what have me worried. I want these queries to be fast.
By the way, I'm using PostgreSQL
Looking forward to excellent replies!
Use arrays:
CREATE TABLE events (id INT NOT NULL, dow INT[], wom INT[], dn INT[]);
CREATE INDEX ix_events_dow ON events USING GIST(dow);
CREATE INDEX ix_events_wom ON events USING GIST(wom);
CREATE INDEX ix_events_dn ON events USING GIST(dn);
INSERT
INTO events
VALUES (1, '{1,3,5,7}', '{0}', '{1,2,5,6,8,9,25,30}'); -- 0 means any
, then query:
SELECT *
FROM events
WHERE dow && '{0, 5}'::INT[]
AND wom && '{0, 4}'::INT[]
AND dn && '{0, 26}'::INT[]
This will allow using the indexes to filter the data.