Calculate number of days excluding sunday in Hive - hive

I have two timestamps as input. I want to calculate the time difference in hours between those timestamps excluding Sundays.
I can get the number of days using datediff function in hive.
I can get the day of a particular date using from_unixtime(unix_timestamp(startdate), 'EEEE').
But I dont know how to relate those functions to achieve my requirement or is there any other easy way to achieve this.
Thanks in Advance.

You can write one custom UDF which takes two columns containing the dates as inputs and counts the difference between the dates excluding sundays.
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.List;
import java.util.Date;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
public class IsoYearWeek extends UDF {
public LongWritable evaluate(Text dateString,Text dateString1) throws ParseException { //takes the two columns as inputs
SimpleDateFormat date = new SimpleDateFormat("dd/MM/yyyy");
/* String date1 = "20/07/2016";
String date2 = "28/07/2016";
*/ int count=0;
List<Date> dates = new ArrayList<Date>();
Date startDate = (Date)date.parse(dateString.toString());
Date endDate = (Date)date.parse(dateString1.toString());
long interval = 24*1000 * 60 * 60; // 1 hour in millis
long endTime =endDate.getTime() ; // create your endtime here, possibly using Calendar or Date
long curTime = startDate.getTime();
while (curTime <= endTime) {
dates.add(new Date(curTime));
curTime += interval;
}
for(int i=0;i<dates.size();i++){
Date lDate =(Date)dates.get(i);
if(lDate.getDay()==0){
count+=1; //counts the number of sundays in between
}
}
long days_diff = (endDate.getTime()-startDate.getTime())/(24 * 60 * 60 * 1000)-count; //displays the days difference excluding sundays
return new LongWritable(days_diff);
}
}

Use spark so that It will be more easy to implement and maintain
import org.joda.time.format.DateTimeFormat
def dayDiffWithExcludeWeekendAndHoliday(startDate:String,endDate:String,holidayExclusion:Seq[String]) ={
#transient val datePattern="yyyy-MM-dd"
#transient val dateformatter=DateTimeFormat.forPattern(datePattern)
var numWeekDaysValid=0
var numWeekends=0
var numWeekDaysInValid=0
val holidayExclusionJoda=holidayExclusion.map(dateformatter.parseDateTime(_))
val startDateJoda=dateformatter.parseDateTime(startDate)
var startDateJodaLatest=dateformatter.parseDateTime(startDate)
val endDateJoda=dateformatter.parseDateTime(endDate)
while (startDateJodaLatest.compareTo(endDateJoda) !=0)
{
startDateJodaLatest.getDayOfWeek match {
case value if value >5 => numWeekends=numWeekends+1
case value if value <= 5 => holidayExclusionJoda.contains(startDateJodaLatest) match {case value if value == true => numWeekDaysInValid=numWeekDaysInValid+1 case value if value == false => numWeekDaysValid=numWeekDaysValid+1 }
}
startDateJodaLatest = startDateJodaLatest.plusDays(1)
}
Array(numWeekDaysValid,numWeekends,numWeekDaysInValid)
}
spark.udf.register("dayDiffWithExcludeWeekendAndHoliday",dayDiffWithExcludeWeekendAndHoliday(_:String,_:String,_:Seq[String]))
case class tmpDateInfo(startDate:String,endDate:String,holidayExclusion:Array[String])
case class tmpDateInfoFull(startDate:String,endDate:String,holidayExclusion:Array[String],numWeekDaysValid:Int,numWeekends:Int,numWeekDaysInValid:Int)
def dayDiffWithExcludeWeekendAndHolidayCase(tmpInfo:tmpDateInfo) ={
#transient val datePattern="yyyy-MM-dd"
#transient val dateformatter=DateTimeFormat.forPattern(datePattern)
var numWeekDaysValid=0
var numWeekends=0
var numWeekDaysInValid=0
val holidayExclusionJoda=tmpInfo.holidayExclusion.map(dateformatter.parseDateTime(_))
val startDateJoda=dateformatter.parseDateTime(tmpInfo.startDate)
var startDateJodaLatest=dateformatter.parseDateTime(tmpInfo.startDate)
val endDateJoda=dateformatter.parseDateTime(tmpInfo.endDate)
while (startDateJodaLatest.compareTo(endDateJoda) !=0)
{
startDateJodaLatest.getDayOfWeek match {
case value if value >5 => numWeekends=numWeekends+1
case value if value <= 5 => holidayExclusionJoda.contains(startDateJodaLatest) match {case value if value == true => numWeekDaysInValid=numWeekDaysInValid+1 case value if value == false => numWeekDaysValid=numWeekDaysValid+1 }
}
startDateJodaLatest = startDateJodaLatest.plusDays(1)
}
tmpDateInfoFull(tmpInfo.startDate,tmpInfo.endDate,tmpInfo.holidayExclusion,numWeekDaysValid,numWeekends,numWeekDaysInValid)
}
//df way 1
val tmpDF=Seq(("2020-05-03","2020-06-08",List("2020-05-08","2020-06-05"))).toDF("startDate","endDate","holidayExclusion").select(col("startDate").cast(StringType),col("endDate").cast(StringType),col("holidayExclusion"))
tmpDF.as[tmpDateInfo].map(dayDiffWithExcludeWeekendAndHolidayCase).show(false)
//df way 2
tmpDF.selectExpr("*","dayDiffWithExcludeWeekendAndHoliday(cast(startDate as string),cast(endDate as string),cast(holidayExclusion as array<string>)) as resultDays").selectExpr("startDate","endDate","holidayExclusion","resultDays[0] as numWeekDaysValid","resultDays[1] as numWeekends","resultDays[2] as numWeekDaysInValid").show(false)
tmpDF.selectExpr("*","dayDiffWithExcludeWeekendAndHoliday(cast(startDate as string),cast(endDate as string),cast(holidayExclusion as array<string>)) as resultDays").selectExpr("startDate","endDate","holidayExclusion","resultDays[0] as numWeekDaysValid","resultDays[1] as numWeekends","resultDays[2] as numWeekDaysInValid").show(false)
// spark sql way, works with hive table when configured in hive metastore
tmpDF.createOrReplaceTempView("tmpTable")
spark.sql("select startDate,endDate,holidayExclusion,dayDiffWithExcludeWeekendAndHoliday(startDate,endDate,holidayExclusion) from tmpTable").show(false)

Related

Springboot - Weekdays count query

My Use-cases:
We have an installation schedule entity (check below code) and it has an installation date.
Once installation has completed, after 4 weekdays we will verify the installation status with customers.
Note: (4 weekdays - this count is configurable. So 'X' weekdays)
Weekdays means - Monday to Friday. We don't care about other holidays.
I have a scheduler, it will retrieve these orders after 'X' weekdays - I'm stuck here
I don't know how to make a query for after 'X' weekdays.
My code:
#Entity
#Table(schema = "myschema", name = "installation_dates")
#Getter
#Setter
#NoArgsConstructor
public class InstallDates extends TransEntity implements Serializable {
// other columns
#Column(name = "installation_schedule_datetime")
private LocalDateTime installationScheduleDatetime;//I use this column for calculation
#Formula("getWeekDaysCount(installationScheduleDatetime)")
private int weekDaysCount;
public int getWeekDaysCount(LocalDateTime installationScheduleDatetime) {
int totalWeekDays = 0;
LocalDateTime todayDate = LocalDateTime.now();
while (!installationScheduleDatetime.isAfter(todayDate)) {
switch (installationScheduleDatetime.getDayOfWeek()) {
case FRIDAY:
case SATURDAY:
break;
default:
totalWeekDays++;
break;
}
installationScheduleDatetime = installationScheduleDatetime.plusDays(1);
}
return totalWeekDays;
}
}
Question:
How to make a SQL or JPQL or JPA query for weekdays?
I knew its very basic question, I am a mobile app developer, I recently joined the Springboard team, it's really hard for me :(
Feel free to give your valuable feedback!
I have a following suggestion if I correctly got the problem.
Java:
Take the current date
Find the date of interest: count minus 4 workdays (so if it is Friday today - subtract 4 days, if it is Monday - subtract 2 days for weekend and 4 more days for weekdays)
Then write a query that will select all installations that were done on the date of interest.
In pseudo code:
select * from installations where installation_date = <date of interest>;.
Date of interest Java code:
public LocalDateTime getDateOfInterest(int workdays) {
LocalDateTime currentDate = LocalDateTime.now();
if (workdays < 1) {
return currentDate;
}
//it will subtract 'X' working days from current date
LocalDateTime result = currentDate;
int addedDays = 0;
while (addedDays < workdays) {
result = result.minusDays(1);
if (!(result.getDayOfWeek() == DayOfWeek.FRIDAY ||
result.getDayOfWeek() == DayOfWeek.SATURDAY)) {
++addedDays;
}
}
return result;
}
First lets answer your questions
No you cannot call a method from #Formula
You probably could (see here but that might depend on your database.
The fact that you use an entity and JPA doesn't mean everything has to be a JPA property.
You could:
Write a get method that calculates it on the fly
Write a getter which sets it lazily.
Use the #PostLoad to always set it.
#Entity
#Table(schema = "myschema", name = "installation_dates")
#Getter
#Setter
#NoArgsConstructor
public class InstallDates extends TransEntity implements Serializable {
// other columns
#Column(name = "installation_schedule_datetime")
private LocalDateTime installationScheduleDatetime;//I use this column for calculation
public int getWeekDaysCount() {
int totalWeekDays = 0;
LocalDateTime isdt = this.installationScheduleDatetime;
LocalDateTime todayDate = LocalDateTime.now();
while (!isdt.isAfter(todayDate)) {
switch (isdt.getDayOfWeek()) {
case FRIDAY:
case SATURDAY:
break;
default:
totalWeekDays++;
break;
}
isdt = isdt.plusDays(1);
}
return totalWeekDays;
}
}
Or if you really want it to be a property, you could use the getter to set it lazily.
#Entity
#Table(schema = "myschema", name = "installation_dates")
#Getter
#Setter
#NoArgsConstructor
public class InstallDates extends TransEntity implements Serializable {
// other columns
#Column(name = "installation_schedule_datetime")
private LocalDateTime installationScheduleDatetime;//I use this column for calculation
private int weekDaysCount = -1;
public int getWeekDaysCount() {
if (weekDaysCount == -1) {
int totalWeekDays = 0;
LocalDateTime isdt = this.installationScheduleDatetime;
LocalDateTime todayDate = LocalDateTime.now();
while (!isdt.isAfter(todayDate)) {
switch (isdt.getDayOfWeek()) {
case FRIDAY:
case SATURDAY:
break;
default:
totalWeekDays++;
break;
}
isdt = isdt.plusDays(1);
}
weekDaysCount = totalWeekDays;
}
return weekDaysCount;
}
}
Or if you always want to calculate that value you could even place it in an #PostLoad annotation on a method to initialize it (you could even reuse the above lazy getter for it). Or move the init code to the #PostLoad annotated method.
#PostLoad
private void initValues() {
getWeekDaysCount();
}
#Formula specifies an expression written in native SQL that is used to read the value of an attribute instead of storing the value in a Column. (https://docs.jboss.org/hibernate/orm/current/javadocs/org/hibernate/annotations/Formula.html)
As for your case, it doesn't look like there's much use in storing weekDaysCount in the DB if it's derived from installationScheduleDatetime. I'd just mark the weekDaysCount as #Transient and be done with it (#Formula should be removed).
Another solution would be to leave weekDaysCount non-transient and put your calculations in a #PreUpdate/#PrePersist method. See https://www.baeldung.com/jpa-entity-lifecycle-events for more info on that.

how to use spark sql udaf to implement window counting with condition?

I have a table with columns: timestamp and id and condition, and I want to count the number of each id per interval such as 10 seconds.
If condition is true, the count++, otherwise return the previous value.
the udaf code like:
public class MyCount extends UserDefinedAggregateFunction {
#Override
public StructType inputSchema() {
return DataTypes.createStructType(
Arrays.asList(
DataTypes.createStructField("condition", DataTypes.BooleanType, true),
DataTypes.createStructField("timestamp", DataTypes.LongType, true),
DataTypes.createStructField("interval", DataTypes.IntegerType, true)
)
);
}
#Override
public StructType bufferSchema() {
return DataTypes.createStructType(
Arrays.asList(
DataTypes.createStructField("timestamp", DataTypes.LongType, true),
DataTypes.createStructField("count", DataTypes.LongType, true)
)
);
}
#Override
public DataType dataType() {
return DataTypes.LongType;
}
#Override
public boolean deterministic() {
return true;
}
#Override
public void initialize(MutableAggregationBuffer mutableAggregationBuffer) {
mutableAggregationBuffer.update(0, 0L);
mutableAggregationBuffer.update(1, 0L);
}
public void update(MutableAggregationBuffer mutableAggregationBuffer, Row row) {
long timestamp = mutableAggregationBuffer.getLong(0);
long count = mutableAggregationBuffer.getLong(1);
long event_time = row.getLong(1);
int interval = row.getInt(2);
if (event_time > timestamp + interval) {
timestamp = event_time - event_time % interval;
count = 0;
}
if (row.getBoolean(0)) {
count++;
}
mutableAggregationBuffer.update(0, timestamp);
mutableAggregationBuffer.update(1, count);
}
#Override
public void merge(MutableAggregationBuffer mutableAggregationBuffer, Row row) {
}
#Override
public Object evaluate(Row row) {
return row.getLong(1);
}
}
Then I sumbit a sql like:
select timestamp, id, MyCount(true, timestamp, 10) over(PARTITION BY id ORDER BY timestamp) as count from xxx.xxx
the result is:
timestamp id count
1642760594 0 1
1642760596 0 2
1642760599 0 3
1642760610 0 2 --duplicate
1642760610 0 2
1642760613 0 3
1642760594 1 1
1642760597 1 2
1642760600 1 1
1642760603 1 2
1642760606 1 4 --duplicate
1642760606 1 4
1642760608 1 5
When the timestamp is repeated, I get 1,2,4,4,5 instead of 1,2,3,4,5
How to fix it?
And another requestion is that when to execute the merge method of udaf? I empty implement it but it runs normally. I try to add the log in the method but I haven't seen this log. Is it really necessary?
There is a similar question: Apache Spark SQL UDAF over window showing odd behaviour with duplicate input
However, row_number() does not have such a problem. row_number() is a hive udaf, then I try to create a hive udaf. But I also have the problem...Why hive udaf row_number() terminate() returns 'ArrayList'? I create my udaf row_number2() by copying its code then I got list return?
Finally I solved it by spark aggregateWindowFunction:
case class Count(condition: Expression) extends AggregateWindowFunction with Logging {
override def prettyName: String = "myCount"
override def dataType: DataType = LongType
override def children: Seq[Expression] = Seq(condition)
private val zero = Literal(0L)
private val one = Literal(1L)
private val count = AttributeReference("count", LongType, nullable = false)()
private val increaseCount = If(condition, Add(count, one), count)
override val initialValues: Seq[Expression] = zero :: Nil
override val updateExpressions: Seq[Expression] = increaseCount :: Nil
override val evaluateExpression: Expression = count
override val aggBufferAttributes: Seq[AttributeReference] = count :: Nil
Then use spark_session.functionRegistry.registerFunction to register it.
"select myCount(true) over(partition by window(timestamp, '10 seconds'), id order by timestamp) as count from xxx"

get count for rolling date value Using Apache Pig

How can we achieve using Apache Pig :
File :
A 2014/10/01
A 2014/09/01
A 2014/08/01
A 2014/02/01
Result should A count 3, since i want to count the number of records using rolling window of 30 days between records group by A.
Please find the solution, i hope you can do further enhancement if it required. Try to execute with your input and let me know how it works.
input.txt
A 2014/12/01
A 2014/11/01
A 2014/10/01
A 2014/07/01
A 2014/05/01
A 2014/04/01
B 2014/09/01
B 2014/07/01
B 2014/06/01
B 2014/02/01
C 2014/09/01
C 2014/07/01
C 2014/05/01
Expected output
A 5
B 2
C 0
PigScript:
REGISTER rollingCount.jar;
A = LOAD 'input.txt' Using PigStorage(' ') AS (f1:chararray,f2:chararray);
B = GROUP A BY f1;
C = FOREACH B GENERATE mypackage.ROLLINGCOUNT(BagToString($1)) AS rollingCnt;
DUMP C;
OutPut from the Script:
(A,5)
(B,2)
(C,0)
Java Code:
1. Compile the below java code and create jar file name rollingCount.jar
2. I just wrote the code temporarily, you can optimize if required.
ROLLINGCOUNT.java
package mypackage;
import java.io.*;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import java.text.SimpleDateFormat;
import java.util.concurrent.TimeUnit;
import java.util.*;
public class ROLLINGCOUNT extends EvalFunc<Integer> {
public Integer exec(Tuple input) throws IOException {
//Get the input String from request
String inputString = (String)input.get(0);
Date[] arrayOfDates = getArrayOfDate(inputString);
long diffDays[] = getDaysBetweenList(arrayOfDates);
int rollingCount = getRollingCount(diffDays);
return rollingCount;
}
//Function to convert strings to array of dates
static protected Date[] getArrayOfDate(String inputString)
{
//Get the 1st column, this will be the Id
String ID = inputString.split("_")[0];
//Replace all the Ids with Null, bcoz its a duplicate columns
String modifiedString = inputString.replace(ID+"_","");
//Split the string into multiple columns using '_' as delimiter
String list[] = modifiedString.split("_");
//Convert the string to list of array dates
Date[] dateList = new Date[list.length];
int index=0;
for (String dateString: list)
{
try
{
//Convert the date string to date object in the give format
SimpleDateFormat dFormat = new SimpleDateFormat("yyyy/MM/dd");
dateList[index++] = dFormat.parse(dateString);
}
catch(Exception e)
{
// error handling goes here
}
}
return dateList;
}
//Function to get difference between two dates
static protected long[] getDaysBetweenList(Date[] arrayOfDate)
{
long diffDays[] = new long[arrayOfDate.length-1];
int cnt=0;
for (int index=0; index<arrayOfDate.length-1;index++)
{
long diff = Math.abs(arrayOfDate[index+1].getTime() - arrayOfDate[index].getTime());
long days = TimeUnit.DAYS.convert(diff, TimeUnit.MILLISECONDS);
diffDays[cnt++] = days;
}
return diffDays;
}
//Function to get the total rolling count
static protected int getRollingCount(long diffDays[])
{
int result =0;
for(int index=0;index<diffDays.length;index++)
{
int cnt =0;
//hardcoded the values of 30 and 31 days, may need to handle Feb month 28 or 29 days
while((index<diffDays.length)&&((diffDays[index]==30)||(diffDays[index]==31)))
{
cnt++;
index++;
}
if(cnt>0)
{
result = result + cnt+1;
}
}
return result;
}
}

How to select all the days of the week from a given day?

Given a date I want to get all the other days of that same week, where in the week starts and ends on Saturday and Friday.
Model
public TimeModel
{
public int ID
public DateTime Day
}
What I'm currently doing
public Contrller{
private db = new ModelContext();
public AddDates(DateTime Date)
{
List<Model> list = new List<Model>();
int n = 0;
while(Date.DayofWeek != DayofWeek.Sauturday)
Date = Date.AddDats(-1) // keep subracting the date until I reach Saturday
while(Date.DayofWeek != DayofWeek.Friday
{
list.Add(Find(Date));
//Simply put for each date not Friday
// I find the corresponding model (the one with the same date)
//and add it to the list
Date = Date.AddDays(1)
}
list.Add(Find(Date)); // To add the Friday date to list
}
Note: Not exactly my code, just a simplification of my problem.
To summarize my solution:
a) Subtract given date until Saturday
b) Find model which corresponds to Date
c) Repeat until I reach Friday
d) Add to list once more to include Friday
Is it possible to create a linq/sql statement to simpyly select the needed models (with regards to Date)?
You can find a sample implementation that gets the current week.
List<TimeModel> list = new List<TimeModel>();
int n = 0;
for (int i = 0; i < 200; i++)
list.Add(new TimeModel{ID = i, Day = DateTime.Now.AddDays(-i)});
var currentDay = new TimeModel() {ID = 0, Day = DateTime.Now};
var previousSaturday = currentDay.Day.AddDays(-(int)currentDay.Day.DayOfWeek - 1);
var nextFriday = previousSaturday.AddDays(6);
var currentWeek = list.Where(p => p.Day.DayOfYear >= previousSaturday.DayOfYear && p.Day.DayOfYear <= nextFriday.DayOfYear).OrderBy(p => p.Day.DayOfYear).ToList();

How do i create a calculated measure that will filter data by days overdue

I have a field in my fact table called days overdue. I would like to create a set that will do the following: If the days due is between 0 - 29, then 0 - 29 days overdue, if between 30 and 59 days old, then '30 - 59 days overdue. How would i create this?
We need to know what kind of array you're using, or linked list, or my favorite for these things, a vector, etc.
If you were using a vector, you would create your own class to be used as a datatype with things like:
Class MyData
{
String name;
int daysPastDue; // how you want to factor this is up to you,
// i suggest looking into Java.util.date or Java.util.calendar
public MyData
{
name = "";
daysPastDue = 0;
}
}
Class DoWork
{
public void myWork() // excuse the indent, forgot to put in the class name
{
vector <MyData> input;
MyData 0To29 [] = new MyData[input.size()];
MyData 33To59 [] = new MyData[input.size()];
MyData item = new MyData();
int 0To29count = 0;
int 30To59count = 0;
for (i = 0; i <= list.size(); i++)
{
item = input.elementAt(i)
if (item.daysPastDue <= 29)
{
0To29[0To29Count] = input;
0To29Count ++;
}
elseif (item.daysPastDue >= 30 && item.daysPastDue <= 59)
{
30To59[30To59Count] = input;
30To59Count ++;
}
}
}
}
then you have your 2 arrays and can output them as you wish. however i would recommend starting at daysPastDue = 100000 and decrement it and check the number through the vector until you have all the items in the vector listed. That way they're all in order from the most past due, to the least and you get the output of exactly how long they've been past due.