How to find last 7 days records using pig latin? - apache-pig

I am a beginner to Pig latin. I have a requirement to find the last 7 days records from the csv with contains last 4 years of data.
Can anyone help me to understand this.

A more generic way is to compare each line of data and check if it is older than 7 days or not?
For this, we need to capture the date in each line of data. Let the set of data be a relation dataSet with a column field as date of chararray type.
In Pig 0.11 you can convert the date field from chararray to datetime data type using the ToDate() function, and then get the difference between the CurrentTime() and date using DaysBetween() and filter according to it. For example:
lastSevenDaysRec = FILTER dataSet BY DaysBetween(CurrentTime(), ToDate(date, 'yyyy MM dd')) <= 7
You can check the following documentation for detailed understanding of different date time functions in Pig Latin. Also you can have a look at the valid formats to use in the ToDate function

Assuming that your set of data is A and there is one line per day, and it has a field named date, you could try something similar to this:
B = GROUP A BY date;
B = ORDER A BY group DESC;
B = LIMIT B BY 7;
And then, you would have the last seven days records grouped per date.

Related

How do I do date diff in a spark sql environment?

I have a table with a creation date and an action date. I'd like to get the number number of minutes between the two dates. I looked at the docs and I'm having trouble finding a solution.
%sql
SELECT datediff(creation_dt, actions_dt)
FROM actions
limit 10
This gives me the number of days between the two dates. One record looks like
2019-07-31 23:55:22.0 | 2019-07-31 23:55:21 | 0
How can I get the number of minutes?
As stated in the comments, if you are using Spark or Pyspark then the withColumn method is best.
BUT
If you are using the SparkSQL environment then you could use the unix_timestamp() function to get what you need
select ((unix_timestamp('2019-09-09','yyyy-MM-dd') - unix_timestamp('2018-09-09','yyyy-MM-dd'))/60);
Swap the dates with your column names and define what your date pattern is as the parameters.
Both dates are converted into seconds and the difference is taken. We then divide by 60 to get the minutes.
525600.0

What is the fastest way to populate table with dates after certain day?

Let's assume that we have the following input parameters:
date [Date]
period [Integer]
The task is the following: build the table which has two columns: date and dayname.
So, if we have date = 2018-07-12 and period = 3 the table should look like this:
date |dayname
-------------------
2018-07-12|THURSDAY
2018-07-13|FRIDAY
2018-07-14|SATURDAY
My solution is the following:
select add_days(date, -1) into previousDay from "DUMMY";
for i in 1..:period do
select add_days(previousDay, i) into nextDay from "DUMMY";
:result.insert((nextDay, dayname(nextDay));
end for;
but I don't like the loop. I assume that it might be a problem in the performance if there are more complicated values that I want to put to result table.
What would be the better solution to achieve the target?
Running through a loop and inserting values one by one is most certainly the slowest possible option to accomplish the task.
Instead, you could use SAP HANA's time series feature.
With a statement like
SELECT to_date(GENERATED_PERIOD_START)
FROM SERIES_GENERATE_TIMESTAMP('INTERVAL 1 DAY', '01.01.0001', '31.12.9999')
you could generate a bounded range of valid dates with a given interval length.
In my tests using this approach brought the time to insert a set of dates from ca. 9 minutes down to 7 seconds...
I've written about that some time ago here and here if you want some more examples for that.
In those examples, I even included the use of series tables that allow for efficient compression of timestamp column values.
Series Data functions include SERIES_GENERATE_DATE which returns a set of values in date data format. So you don't have to bother to convert returned data into desired date format.
Here is a sample code
declare d int := 5;
declare dstart date := '01.01.2018';
SELECT generated_period_start FROM SERIES_GENERATE_DATE('INTERVAL 1 DAY', :dstart, add_days(:dstart, :d));

Limiting data on monthly basis from start date to system date dynamically in Tibco spotfire

I've tried limiting data on monthly basis in spotfire and it's working fine.
Now I'm trying to do like getting the records from the current date to month start date.
For suppose if the current date is Sept 21, then i should get the records from Sept 21 to Sept-01(dynamically).
I have a property control to input the number of months.
The easiest way to do this is with Month and Year. For example, in your visualization:
Right Click > Properties > Data > Limit Data Using Expressions (Edit)
Then, use this expression:
Month([TheDate]) = Month(DateTimeNow()) and Year([TheDate]) = Year(DateTimeNow())
This will limit the data to only those rows with the current Year/Month combination in your data column. Just replace [TheDate] with whatever your date column name is.
In other places, you can wrap this in an IF statement if you'd like. It's redundant in this case, but sometimes helps with readability.
IF(Month([TheDate]) = Month(DateTimeNow()) and Year([TheDate]) = Year(DateTimeNow()),TRUE,FALSE)
#san - Adding to #scsimon answer. If you would like to precisely limit values between 1st of the current month to current date, you could add the below expression to 'Limit data using expression' section.
[Date]>=date(1&'-'&Month(DateTimeNow())&'-'&year(DateTimeNow())) and [Date]<=DateTimeNow()

sqlalchemy select by date column only x newset days

suppose I have a table MyTable with a column some_date (date type of course) and I want to select the newest 3 months data (or x days).
What is the best way to achieve this?
Please notice that the date should not be measured from today but rather from the date range in the table (which might be older then today)
I need to find the maximum date and compare it to each row - if the difference is less than x days, return it.
All of this should be done with sqlalchemy and without loading the entire table.
What is the best way of doing it? must I have a subquery to find the maximum date? How do I select last X days?
Any help is appreciated.
EDIT:
The following query works in Oracle but seems inefficient (is max calculated for each row?) and I don't think that it'll work for all dialects:
select * from my_table where (select max(some_date) from my_table) - some_date < 10
You can do this in a single query and without resorting to creating datediff.
Here is an example I used for getting everything in the past day:
one_day = timedelta(hours=24)
one_day_ago = datetime.now() - one_day
Message.query.filter(Message.created > one_day_ago).all()
You can adapt the timedelta to whatever time range you are interested in.
UPDATE
Upon re-reading your question it looks like I failed to take into account the fact that you want to compare two dates which are in the database rather than today's day. I'm pretty sure that this sort of behavior is going to be database specific. In Postgres, you can use straightforward arithmetic.
Operations with DATEs
1. The difference between two DATES is always an INTEGER, representing the number of DAYS difference
DATE '1999-12-30' - DATE '1999-12-11' = INTEGER 19
You may add or subtract an INTEGER to a DATE to produce another DATE
DATE '1999-12-11' + INTEGER 19 = DATE '1999-12-30'
You're probably using timestamps if you are storing dates in postgres. Doing math with timestamps produces an interval object. Sqlalachemy works with timedeltas as a representation of intervals. So you could do something like:
one_day = timedelta(hours=24)
Model.query.join(ModelB, Model.created - ModelB.created < interval)
I haven't tested this exactly, but I've done things like this and they have worked.
I ended up doing two selects - one to get the max date and another to get the data
using the datediff recipe from this thread I added a datediff function and using the query q = session.query(MyTable).filter(datediff(max_date, some_date) < 10)
I still don't think this is the best way, but untill someone proves me wrong, it will have to do...

how to convert a number to date in oracle sql developer

I have a excel format dataset that need to be imported to a table, one column is a date, but the data is stored in number format, such as 41275, when importing data, i tried to choose data format yyyy-mm-dd, it gives an error: not a valid month, also tried MM/DD/YYYY also error: day of month must be between 1 and last day of month. does anyone know what is this number and how can i convert it to a date format when importing it into the database?Thanks!
The expression (with respect to the Excel's leap year bug AmmoQ mentions) you are looking for is:
case
when yourNumberToBeImported <= 59 then date'1899-12-31' + yourNumberToBeImported
else date'1899-12-30' + yourNumberToBeImported
end
Then, you may either
Create a (global) temporary table in your Oracle DB, load your data from the Excel to the table and then reload the data from the temporary table to your target table with the above calculation included.
or you may
Load the data from the Excel to a persistent table in your Oracle DB and create a view over the persistent table which would contain the above calculation.
The number you got is the excel representation of a certain date ...
excel stores a date as the number of days, starting to count at a certain date ... to be precise:
1 = 1-JAN-1900
2 = 2-JAN-1900
...
30 = 30-JAN-1900
so, to get your excel number into an oracle date, you might want to try something like this...
to_date('1899-12-31','yyyy-mm-dd') + 41275