how to use spark sql udaf to implement window counting with condition? - apache-spark-sql

I have a table with columns: timestamp and id and condition, and I want to count the number of each id per interval such as 10 seconds.
If condition is true, the count++, otherwise return the previous value.
the udaf code like:
public class MyCount extends UserDefinedAggregateFunction {
#Override
public StructType inputSchema() {
return DataTypes.createStructType(
Arrays.asList(
DataTypes.createStructField("condition", DataTypes.BooleanType, true),
DataTypes.createStructField("timestamp", DataTypes.LongType, true),
DataTypes.createStructField("interval", DataTypes.IntegerType, true)
)
);
}
#Override
public StructType bufferSchema() {
return DataTypes.createStructType(
Arrays.asList(
DataTypes.createStructField("timestamp", DataTypes.LongType, true),
DataTypes.createStructField("count", DataTypes.LongType, true)
)
);
}
#Override
public DataType dataType() {
return DataTypes.LongType;
}
#Override
public boolean deterministic() {
return true;
}
#Override
public void initialize(MutableAggregationBuffer mutableAggregationBuffer) {
mutableAggregationBuffer.update(0, 0L);
mutableAggregationBuffer.update(1, 0L);
}
public void update(MutableAggregationBuffer mutableAggregationBuffer, Row row) {
long timestamp = mutableAggregationBuffer.getLong(0);
long count = mutableAggregationBuffer.getLong(1);
long event_time = row.getLong(1);
int interval = row.getInt(2);
if (event_time > timestamp + interval) {
timestamp = event_time - event_time % interval;
count = 0;
}
if (row.getBoolean(0)) {
count++;
}
mutableAggregationBuffer.update(0, timestamp);
mutableAggregationBuffer.update(1, count);
}
#Override
public void merge(MutableAggregationBuffer mutableAggregationBuffer, Row row) {
}
#Override
public Object evaluate(Row row) {
return row.getLong(1);
}
}
Then I sumbit a sql like:
select timestamp, id, MyCount(true, timestamp, 10) over(PARTITION BY id ORDER BY timestamp) as count from xxx.xxx
the result is:
timestamp id count
1642760594 0 1
1642760596 0 2
1642760599 0 3
1642760610 0 2 --duplicate
1642760610 0 2
1642760613 0 3
1642760594 1 1
1642760597 1 2
1642760600 1 1
1642760603 1 2
1642760606 1 4 --duplicate
1642760606 1 4
1642760608 1 5
When the timestamp is repeated, I get 1,2,4,4,5 instead of 1,2,3,4,5
How to fix it?
And another requestion is that when to execute the merge method of udaf? I empty implement it but it runs normally. I try to add the log in the method but I haven't seen this log. Is it really necessary?
There is a similar question: Apache Spark SQL UDAF over window showing odd behaviour with duplicate input
However, row_number() does not have such a problem. row_number() is a hive udaf, then I try to create a hive udaf. But I also have the problem...Why hive udaf row_number() terminate() returns 'ArrayList'? I create my udaf row_number2() by copying its code then I got list return?

Finally I solved it by spark aggregateWindowFunction:
case class Count(condition: Expression) extends AggregateWindowFunction with Logging {
override def prettyName: String = "myCount"
override def dataType: DataType = LongType
override def children: Seq[Expression] = Seq(condition)
private val zero = Literal(0L)
private val one = Literal(1L)
private val count = AttributeReference("count", LongType, nullable = false)()
private val increaseCount = If(condition, Add(count, one), count)
override val initialValues: Seq[Expression] = zero :: Nil
override val updateExpressions: Seq[Expression] = increaseCount :: Nil
override val evaluateExpression: Expression = count
override val aggBufferAttributes: Seq[AttributeReference] = count :: Nil
Then use spark_session.functionRegistry.registerFunction to register it.
"select myCount(true) over(partition by window(timestamp, '10 seconds'), id order by timestamp) as count from xxx"

Related

Convert String to Date with null safety

I'm using #TypeConverter in Room to convert string to Date (datetime). Here is the code
public class DateTimeConverter {
#TypeConverter
public static Date stringToDate(String value) {
DateFormat df = new SimpleDateFormat(Constants.SQLITE_DATE_TIMEFORMAT, Locale.US);
if (value != null) {
try {
return df.parse(value);
} catch (ParseException e) {
e.printStackTrace();
}
}
return null;
}
#TypeConverter
public static String dateToString(Date value) {
DateFormat df = new SimpleDateFormat(Constants.SQLITE_DATE_TIMEFORMAT, Locale.US);
if (value != null) {
return df.format(value);
} else {
return null;
}
}
}
#Entity
#TypeConverters(DateTimeConverter::class)
data class Entity(
var writeDate: Date = Date() // java.util.Date
)
My current issues is
stringtoDate receives value = null which results in Entity.writeDate to be null which is a run-time exception
Question
How to convert string to Date with null safety? The value of writeDate in the table is never null, but stringToDate still receives value = null.
Note:
Using SDK > 23. So can't use DateTimeFormatter.ofPattern
The value of writeDate in the table is never null, but stringToDate still receives value = null.
You issue would be to determine why the writeDate is being extracted as null and that would be within the functions in classes that are annotated with #Dao.
However, you could us the following to ensure that nulls are never returned:-
public class DateTimeConverter {
#TypeConverter
public static Date stringToDate(String value) {
Date defaultDate = new Date(0); //1970-01-01 00:00:00
DateFormat df = new SimpleDateFormat(Constants.SQLITE_DATE_TIMEFORMAT, Locale.US);
if (value != null) {
try {
return df.parse(value);
} catch (ParseException e) {
e.printStackTrace();
}
}
return defaultDate;
}
#TypeConverter
public static String dateToString(Date value) {
DateFormat df = new SimpleDateFormat(Constants.SQLITE_DATE_TIMEFORMAT, Locale.US);
if (value != null) {
return df.format(value);
} else {
return "1970-01-01 00:00:00";
}
}
}
Here's a demonstration that uses the above in conjunction with:-
The #Dao class AllDao :-
#Dao
abstract class AllDao {
#Insert(onConflict = IGNORE)
abstract fun insert(entity: Entity)
/* Delete all rows */
#Query("DELETE FROM entity")
abstract fun clear()
/* get inserted data */
#Query("SELECT * FROM entity")
abstract fun getAllFromEntity(): List<Entity>
/* purposefully get 2 invalid dates (1 rubbish date, 1 NULL) */
#Query("SELECT -999 AS id,'invaliddate' AS writeDate UNION SELECT -123 AS id, NULL as writeDate")
abstract fun getMessedUpDate(): List<Entity>
}
The getMessedUpDate designed to do as it says and get dates that would result in nulls by the TypeConverters in the question but not by the modified TypeConverters in the Answer.
and then using :-
db = TheDatabase.getInstance(this)
dao = db.getAllDao()
dao.clear()
dao.insert(Entity())
var dateInt = (System.currentTimeMillis())
dao.insert(Entity(writeDate = Date(dateInt)))
dao.insert(Entity(1000,Date((System.currentTimeMillis()) - (100 /*days*/ * 24 /*hours*/ * 60 /*mins*/ * 60 /*secs*/ * 1000))))
for (e: Entity in dao.getAllFromEntity()) {
Log.d("DBINFO","Date is ${e.writeDate} ID is ${e.id}")
}
for (e: Entity in dao.getMessedUpDate()) {
Log.d("DBINFO","Date is ${e.writeDate} ID is ${e.id}")
}
The log includes :-
2021-11-06 07:40:38.063 D/DBINFO: Date is Sat Nov 06 07:40:38 GMT+11:00 2021 ID is 1
2021-11-06 07:40:38.064 D/DBINFO: Date is Sat Nov 06 07:40:38 GMT+11:00 2021 ID is 2
2021-11-06 07:40:38.064 D/DBINFO: Date is Fri Nov 05 17:46:12 GMT+11:00 2021 ID is 1000
2021-11-06 07:40:38.069 D/DBINFO: Date is Thu Jan 01 10:00:00 GMT+10:00 1970 ID is -999
2021-11-06 07:40:38.069 D/DBINFO: Date is Thu Jan 01 10:00:00 GMT+10:00 1970 ID is -123
i.e. the 4th and 5th lines have returned the "default date" rather than null.

Calculate number of days excluding sunday in Hive

I have two timestamps as input. I want to calculate the time difference in hours between those timestamps excluding Sundays.
I can get the number of days using datediff function in hive.
I can get the day of a particular date using from_unixtime(unix_timestamp(startdate), 'EEEE').
But I dont know how to relate those functions to achieve my requirement or is there any other easy way to achieve this.
Thanks in Advance.
You can write one custom UDF which takes two columns containing the dates as inputs and counts the difference between the dates excluding sundays.
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.List;
import java.util.Date;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
public class IsoYearWeek extends UDF {
public LongWritable evaluate(Text dateString,Text dateString1) throws ParseException { //takes the two columns as inputs
SimpleDateFormat date = new SimpleDateFormat("dd/MM/yyyy");
/* String date1 = "20/07/2016";
String date2 = "28/07/2016";
*/ int count=0;
List<Date> dates = new ArrayList<Date>();
Date startDate = (Date)date.parse(dateString.toString());
Date endDate = (Date)date.parse(dateString1.toString());
long interval = 24*1000 * 60 * 60; // 1 hour in millis
long endTime =endDate.getTime() ; // create your endtime here, possibly using Calendar or Date
long curTime = startDate.getTime();
while (curTime <= endTime) {
dates.add(new Date(curTime));
curTime += interval;
}
for(int i=0;i<dates.size();i++){
Date lDate =(Date)dates.get(i);
if(lDate.getDay()==0){
count+=1; //counts the number of sundays in between
}
}
long days_diff = (endDate.getTime()-startDate.getTime())/(24 * 60 * 60 * 1000)-count; //displays the days difference excluding sundays
return new LongWritable(days_diff);
}
}
Use spark so that It will be more easy to implement and maintain
import org.joda.time.format.DateTimeFormat
def dayDiffWithExcludeWeekendAndHoliday(startDate:String,endDate:String,holidayExclusion:Seq[String]) ={
#transient val datePattern="yyyy-MM-dd"
#transient val dateformatter=DateTimeFormat.forPattern(datePattern)
var numWeekDaysValid=0
var numWeekends=0
var numWeekDaysInValid=0
val holidayExclusionJoda=holidayExclusion.map(dateformatter.parseDateTime(_))
val startDateJoda=dateformatter.parseDateTime(startDate)
var startDateJodaLatest=dateformatter.parseDateTime(startDate)
val endDateJoda=dateformatter.parseDateTime(endDate)
while (startDateJodaLatest.compareTo(endDateJoda) !=0)
{
startDateJodaLatest.getDayOfWeek match {
case value if value >5 => numWeekends=numWeekends+1
case value if value <= 5 => holidayExclusionJoda.contains(startDateJodaLatest) match {case value if value == true => numWeekDaysInValid=numWeekDaysInValid+1 case value if value == false => numWeekDaysValid=numWeekDaysValid+1 }
}
startDateJodaLatest = startDateJodaLatest.plusDays(1)
}
Array(numWeekDaysValid,numWeekends,numWeekDaysInValid)
}
spark.udf.register("dayDiffWithExcludeWeekendAndHoliday",dayDiffWithExcludeWeekendAndHoliday(_:String,_:String,_:Seq[String]))
case class tmpDateInfo(startDate:String,endDate:String,holidayExclusion:Array[String])
case class tmpDateInfoFull(startDate:String,endDate:String,holidayExclusion:Array[String],numWeekDaysValid:Int,numWeekends:Int,numWeekDaysInValid:Int)
def dayDiffWithExcludeWeekendAndHolidayCase(tmpInfo:tmpDateInfo) ={
#transient val datePattern="yyyy-MM-dd"
#transient val dateformatter=DateTimeFormat.forPattern(datePattern)
var numWeekDaysValid=0
var numWeekends=0
var numWeekDaysInValid=0
val holidayExclusionJoda=tmpInfo.holidayExclusion.map(dateformatter.parseDateTime(_))
val startDateJoda=dateformatter.parseDateTime(tmpInfo.startDate)
var startDateJodaLatest=dateformatter.parseDateTime(tmpInfo.startDate)
val endDateJoda=dateformatter.parseDateTime(tmpInfo.endDate)
while (startDateJodaLatest.compareTo(endDateJoda) !=0)
{
startDateJodaLatest.getDayOfWeek match {
case value if value >5 => numWeekends=numWeekends+1
case value if value <= 5 => holidayExclusionJoda.contains(startDateJodaLatest) match {case value if value == true => numWeekDaysInValid=numWeekDaysInValid+1 case value if value == false => numWeekDaysValid=numWeekDaysValid+1 }
}
startDateJodaLatest = startDateJodaLatest.plusDays(1)
}
tmpDateInfoFull(tmpInfo.startDate,tmpInfo.endDate,tmpInfo.holidayExclusion,numWeekDaysValid,numWeekends,numWeekDaysInValid)
}
//df way 1
val tmpDF=Seq(("2020-05-03","2020-06-08",List("2020-05-08","2020-06-05"))).toDF("startDate","endDate","holidayExclusion").select(col("startDate").cast(StringType),col("endDate").cast(StringType),col("holidayExclusion"))
tmpDF.as[tmpDateInfo].map(dayDiffWithExcludeWeekendAndHolidayCase).show(false)
//df way 2
tmpDF.selectExpr("*","dayDiffWithExcludeWeekendAndHoliday(cast(startDate as string),cast(endDate as string),cast(holidayExclusion as array<string>)) as resultDays").selectExpr("startDate","endDate","holidayExclusion","resultDays[0] as numWeekDaysValid","resultDays[1] as numWeekends","resultDays[2] as numWeekDaysInValid").show(false)
tmpDF.selectExpr("*","dayDiffWithExcludeWeekendAndHoliday(cast(startDate as string),cast(endDate as string),cast(holidayExclusion as array<string>)) as resultDays").selectExpr("startDate","endDate","holidayExclusion","resultDays[0] as numWeekDaysValid","resultDays[1] as numWeekends","resultDays[2] as numWeekDaysInValid").show(false)
// spark sql way, works with hive table when configured in hive metastore
tmpDF.createOrReplaceTempView("tmpTable")
spark.sql("select startDate,endDate,holidayExclusion,dayDiffWithExcludeWeekendAndHoliday(startDate,endDate,holidayExclusion) from tmpTable").show(false)

Update multiple rows using single query in JDO

Iam having a table like this
SorOrder Name Date
1 Image1 5/6/15
2 Image2 6/8/16
3 Image3 6/8/16
4 Image4 9/8/16
..........
Now if iam deleting image2 i want to udate the table so that the sortorder
again is in ordered form like this
Updated Table :
SorOrder Name Date
1 Image1 5/6/15
2 Image3 6/8/16
3 Image4 9/8/16
..........
SO how to make it posible??
This is the class for the table Images:
public class Images extends ApplicationEntity{
#Column(name="PROFILE_ID", allowsNull="false")
private Profile profile;
private int sortOrder;
private boolean active;
private Date deletedDate;
public Images (){
super.setEntity("Images ");
}
public Images (Profile profile, int sortOrder, boolean active,
Date deletedDate) {
super();
this.profile = profile;
this.sortOrder = sortOrder;
this.active = active;
this.deletedDate = deletedDate;
}
public Profile getProfile() {
return profile;
}
public int getSortOrder() {
return sortOrder;
}
public void setSortOrder(int sortOrder) {
this.sortOrder = sortOrder;
}
public boolean isActive() {
return active;
}
public void setActive(boolean active) {
this.active = active;
}
public Date getDeletedDate() {
return deletedDate;
}
public void setDeletedDate(Date deletedDate) {
this.deletedDate = deletedDate;
}
#Override
public String toString() {
return "Images [profile=" + profile + ", sortOrder=" + sortOrder
+ ", active=" + active + ", deletedDate=" + deletedDate + "]";
}
}
I tried this query: String query = "update Images set SORTORDER =((SELECT selected_value FROM (SELECT MAX(SORTORDER) AS selected_value FROM Images where ACTIVE = 0 && PROFILE_Id="+profileId+") AS sub_selected_value) + 1) where PROFILE_Id="+profileId;
But it updates all the sorOrder to same value.
I was using this code to update the sortorder:
int sortoder=1;
for (Images file : imagesListFromDB) {
file.setSortOrder(sortOrder);
sortOrder++;
}
But it takes more time,if iam having 8000 images then its really slow.SO i thought of updating in a single query. But not getting any idea
To do in a single statement you could make use of SQL. Here are a couple of ideas (adapt to your use-case) - you use the "?" parameter to set the position above what you delete.
UPDATE IMAGES SET SORTORDER =
(CASE WHEN (SORTORDER <= ?) THEN SORTORDER
ELSE (SORTORDER-1) END)
Or
UPDATE IMAGES SET SORTORDER = SORTORDER-1
WHERE SORTORDER > ?
Using DataNucleus JDOQL UPDATE extension you could do this (and set the parameter "param" to the sortOrder start point to update
pm.newQuery("UPDATE mydomain.Images SET this.sortOrder=this.sortOrder-1 WHERE this.sortOrder > :param");

get count for rolling date value Using Apache Pig

How can we achieve using Apache Pig :
File :
A 2014/10/01
A 2014/09/01
A 2014/08/01
A 2014/02/01
Result should A count 3, since i want to count the number of records using rolling window of 30 days between records group by A.
Please find the solution, i hope you can do further enhancement if it required. Try to execute with your input and let me know how it works.
input.txt
A 2014/12/01
A 2014/11/01
A 2014/10/01
A 2014/07/01
A 2014/05/01
A 2014/04/01
B 2014/09/01
B 2014/07/01
B 2014/06/01
B 2014/02/01
C 2014/09/01
C 2014/07/01
C 2014/05/01
Expected output
A 5
B 2
C 0
PigScript:
REGISTER rollingCount.jar;
A = LOAD 'input.txt' Using PigStorage(' ') AS (f1:chararray,f2:chararray);
B = GROUP A BY f1;
C = FOREACH B GENERATE mypackage.ROLLINGCOUNT(BagToString($1)) AS rollingCnt;
DUMP C;
OutPut from the Script:
(A,5)
(B,2)
(C,0)
Java Code:
1. Compile the below java code and create jar file name rollingCount.jar
2. I just wrote the code temporarily, you can optimize if required.
ROLLINGCOUNT.java
package mypackage;
import java.io.*;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import java.text.SimpleDateFormat;
import java.util.concurrent.TimeUnit;
import java.util.*;
public class ROLLINGCOUNT extends EvalFunc<Integer> {
public Integer exec(Tuple input) throws IOException {
//Get the input String from request
String inputString = (String)input.get(0);
Date[] arrayOfDates = getArrayOfDate(inputString);
long diffDays[] = getDaysBetweenList(arrayOfDates);
int rollingCount = getRollingCount(diffDays);
return rollingCount;
}
//Function to convert strings to array of dates
static protected Date[] getArrayOfDate(String inputString)
{
//Get the 1st column, this will be the Id
String ID = inputString.split("_")[0];
//Replace all the Ids with Null, bcoz its a duplicate columns
String modifiedString = inputString.replace(ID+"_","");
//Split the string into multiple columns using '_' as delimiter
String list[] = modifiedString.split("_");
//Convert the string to list of array dates
Date[] dateList = new Date[list.length];
int index=0;
for (String dateString: list)
{
try
{
//Convert the date string to date object in the give format
SimpleDateFormat dFormat = new SimpleDateFormat("yyyy/MM/dd");
dateList[index++] = dFormat.parse(dateString);
}
catch(Exception e)
{
// error handling goes here
}
}
return dateList;
}
//Function to get difference between two dates
static protected long[] getDaysBetweenList(Date[] arrayOfDate)
{
long diffDays[] = new long[arrayOfDate.length-1];
int cnt=0;
for (int index=0; index<arrayOfDate.length-1;index++)
{
long diff = Math.abs(arrayOfDate[index+1].getTime() - arrayOfDate[index].getTime());
long days = TimeUnit.DAYS.convert(diff, TimeUnit.MILLISECONDS);
diffDays[cnt++] = days;
}
return diffDays;
}
//Function to get the total rolling count
static protected int getRollingCount(long diffDays[])
{
int result =0;
for(int index=0;index<diffDays.length;index++)
{
int cnt =0;
//hardcoded the values of 30 and 31 days, may need to handle Feb month 28 or 29 days
while((index<diffDays.length)&&((diffDays[index]==30)||(diffDays[index]==31)))
{
cnt++;
index++;
}
if(cnt>0)
{
result = result + cnt+1;
}
}
return result;
}
}

Advanced LINQ/ Lambda Expression

I have this model:
Public Class Tbl_Exercise
<Key()> Public Property Exercise_ID() As Integer
Public Property Exercise_Employee_ID() As Integer
Public Property Exercise_Create_Date() As Date
<ForeignKey("Tbl_Exercise_Type")> _
Public Property Exercise_Type_ID() As Integer
Public Property Exercise_Duration() As Integer
Public Overridable Property Tbl_Exercise_Type As Tbl_Exercise_Type
End Class
I need to get the sum of the Exercise_Duration for each week of the year. I need to then check if the sum for the week is greater than or equal to 150. If it is, I need to +1 another variable (a count). The goal is to display this:
# of weeks you've reached 150: X out of Z
(Where X is the count of weeks greater than or equal to 150 and Z is equal to the total number of weeks in the current year.)
Final
' get number of weeks the exercise goal was reached (greater than or equal to the goal)
Dim exerciseDb = New ExerciseDbContext
Dim exercise = exerciseDb.Tbl_Exercises.Where(Function(x) x.Exercise_Employee_ID = empId)
Dim weeks = exercise.ToList.GroupBy(Function(x) CultureInfo.CurrentCulture.Calendar.GetWeekOfYear(x.Exercise_Create_Date, CalendarWeekRule.FirstDay, DayOfWeek.Sunday))
Dim totalWeeks = 0
For Each week In weeks
Dim sum = week.Sum(Function(x) x.Exercise_Duration)
If sum > 150 Then
totalWeeks += 1
End If
Next
Debug.Print("over150: " + totalWeeks.ToString)
using System.Globalization;
DateTimeFormatInfo dfi = DateTimeFormatInfo.CurrentInfo;
Calendar cal = dfi.Calendar;
var recap =
(from e in exercises
group e by cal.GetWeekOfYear(e.Exercise_Create_Date,
dfi.CalendarWeekRule,
dfi.FirstDayOfWeek)
into g
select new
{
g.Key,
Total = g.Sum(x => x.Exercise_Duration)
}
into p
where p.Total > 150
select p)
.Count();
Here is an example in C#:
public class Exercise
{
public DateTime CreateDate { get; set; }
public int Duration { get; set; }
}
class Program
{
static void Main()
{
Exercise[] ex = new Exercise[]
{
new Exercise { CreateDate = DateTime.Parse("1/1/2012"), Duration = 160 },
new Exercise { CreateDate = DateTime.Parse("1/8/2012"), Duration = 160 },
new Exercise { CreateDate = DateTime.Parse("1/15/2012"), Duration = 160 },
new Exercise { CreateDate = DateTime.Parse("2/1/2012"), Duration = 100 },
new Exercise { CreateDate = DateTime.Parse("3/1/2012"), Duration = 75 },
new Exercise { CreateDate = DateTime.Parse("3/1/2012"), Duration = 80 }
};
var weeks = ex.GroupBy(x => CultureInfo.CurrentCulture.Calendar.GetWeekOfYear(x.CreateDate, CalendarWeekRule.FirstDay, DayOfWeek.Sunday));
int currentweek = CultureInfo.CurrentCulture.Calendar.GetWeekOfYear(DateTime.Now, CalendarWeekRule.FirstDay, DayOfWeek.Sunday);
int over150 = weeks.Where(group => group.Sum(item => item.Duration) > 150).Count();
Console.WriteLine(String.Format("# of weeks you've reached 150: {0} out of {1}", over150, currentweek));
}
}