HIve Related date conversion - sql

I am facing issue while trying to add one year to the current timestamp.
I was able to add the year to the current timestamp but the timestamp is not coming with the result.
Any help would be a great support.
I am tring this select from_unixtime(unix_timestamp());

If you want a timestamp one year later than now, you can do date arithmetics as follows:
select current_timestamp() + interval '1' year

You can solve your problem by using hive udf.
package com.practice;
import java.sql.Timestamp;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import org.apache.hadoop.hive.ql.exec.UDF;
public class addYearWithTimestamp extends UDF{
private SimpleDateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.S");
public String evaluate(String t, int year) throws ParseException{
long time=formatter.parse(t.toString()).getTime();
java.sql.Timestamp ts = new Timestamp(time);
Calendar cal = Calendar.getInstance();
cal.setTime(ts);
cal.add(Calendar.YEAR, year);
ts.setTime(cal.getTime().getTime());
return ts.toString();
}
}
After creating addYearWithTimestamp.jar, register it into hive and create udf:
ADD JAR /home/cloudera/Desktop/addYearWithTimestamp.jar;
CREATE TEMPORARY FUNCTION addYear as 'com.practice.addYearWithTimestamp';
Use that udf :
hive> SELECT addYear(current_timestamp,1);
OK
2021-04-25 08:22:17.948
Time taken: 0.083 seconds, Fetched: 1 row(s)

Related

Pandas DataFrame Time Conversion

How to change for example '01:31:41' the value display in stopwatch which is into seconds value. E.g this is 1 min and 31 second, so roughly about 91 seconds.
split string into a list object
create a timedelta object without miliseconds as they aren't needed
we use the self method of timedelta total_seconds()
from datetime import timedelta
original_time_string = "01:30:12"
list_string = original_time_string.split(":")
print((timedelta(minutes=int(list_string[0]),seconds=int(list_string[1])).total_seconds()))

How do I repeat BigQueryOperator Dag and pass different dates to my sql file

I have a query I want to run using the BigQueryOperator. Each day, it will run for the past 21 days. The sql file stays the same, but the date passed to the file changes. So for example today it will run for today's date, then repeat for yesterday's date, and then repeat for 2 days ago, all the way up to 21 days ago. So it will run on 7/14/2021, and so I need to pass this date to my sql file. Then It will run for 7/13/2021, and the date I need to pass to my sql file is 7/13/2021. How can I have this dag repeat for a date range, and dynamically pass this date to the sql file.
In the BigQueryOperator, variables are passed in the "user_defined_macros, section, so I don't know how to change the date I am passing. I thought about looping over an array of dates, but I don't know how to pass that date to the sql file linked in the BigQueryOperator.
My sql file is 300 lines long, so I included a simple example below, as people seem to ask for one.
DAG
with DAG(
dag_id,
schedule_interval='0 12 * * *',
start_date=datetime(2021, 1, 1),
template_searchpath='/opt/airflow/dags',
catchup=False,
user_defined_macros={"varsToPass":Var1
}
) as dag:
query_one = BigQueryOperator(
task_id='query_one',
sql='/sql/something.sql',
use_legacy_sql=False,
destination_dataset_table ='table',
write_disposition = 'WRITE_TRUNCATE'
)
sql file
SELECT * FROM table WHERE date = {{CHANGING_DATE}}
Your code is confusing because you describe a repeated pattern of today,today-1 day, ..., today - 21 days however your code shows write_disposition = 'WRITE_TRUNCATE' which means that only the LAST query matters because each query erase the result of the previous one. Since no more information provided I assume you actually mean to run a single query between the today to today - 21 days.
Also You didn't mention if the date that you are referring to is Airflow execution_date or today date.
If it's execution_date you don't need to pass any parameters. the SQL needs to be:
SELECT * FROM table WHERE date BETWEEN {{ execution_date }} AND
{{ execution_date - macros.timedelta(days=21) }}
If it's today then you need to pass parameter with params:
from datetime import datetime
query_one = BigQueryOperator(
task_id='query_one',
sql='/sql/something.sql',
use_legacy_sql=False,
destination_dataset_table ='table',
write_disposition = 'WRITE_TRUNCATE',
params={
"end": datetime.utcnow().strftime('%Y-%m-%d'),
"start": (datetime.now() - datetime.timedelta(days=21)).strftime('%Y-%m-%d')
}
)
Then in the SQL you can use it as:
SELECT * FROM table WHERE date BETWEEN {{ params.start }} AND
{{ params.end }}
I'd like to point that if you are not using execution_date then I don't see the value of passing the date from Airflow. You can just do it directly with BigQuery by setting the query to:
SELECT *
FROM table
WHERE date BETWEEN DATE_SUB(current_date(), INTERVAL 21 DAY) AND current_date()
If my assumption was incorrect and you want to run 21 queries then you can do that with a loop as you described:
from datetime import datetime, timedelta
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
a = []
for i in range(0, 21):
a.append(
BigQueryOperator(
task_id=f'query_{i}',
sql='/sql/something.sql',
use_legacy_sql=False,
destination_dataset_table='table',
write_disposition='WRITE_TRUNCATE', # This is probably wrong, I just copied it from your code.
params={
"date_value": (datetime.now() - timedelta(days=i)).strftime('%Y-%m-%d')
}
)
)
if i not in [0]:
a[i - 1] >> a[i]
Then in your /sql/something.sql the query should be:
SELECT * FROM table WHERE date = {{ params.date_value }}
As mentioned this will create a workflow :
Note also that BigQueryOperator is deprecated. You should use BigQueryExecuteQueryOperator which available in Google provider via
from airflow.providers.google.cloud.operators.bigquery import BigQueryExecuteQueryOperator
for more information about how to install Google provider please see the 2nd part of the following answer.

Subtracting dates in Pandas

Trying to create a new column in a dataframe that shows number of days between now and a past date. So far I have below code but it returns 'days' + a timestamp. How can I get just the number of days?
import pytz
now = datetime.datetime.now(pytz.utc)
excel1['days_old'] = now - excel1['Start Time']
Returns:
92 days 08:08:06.667518
excel1['days_old'] will hold "timedeltas". To get them to the day difference, just use ".days" like this:
import pytz
now = datetime.datetime.now(pytz.utc)
excel1['days_timedelta'] = now - excel1['Start Time']
excel1['days_old'] = excel1['days_timedelta'].days
Assuming that Start Time column is of datetime type, run:
(pd.Timestamp.now() - df['Start Time']).dt.days
Also worked for me
import datetime
import pytz
now = datetime.datetime.now(pytz.utc)
excel1['days_old'] = (now - excel1['Start Time']).astype('timedelta64[D]')

How to parse Unix timestamp to date string in Kotlin

How can I parse a Unix timestamp to a date string in Kotlin?
For example 1532358895 to 2018-07-23T15:14:55Z
The following should work. It's just using the Java libraries for handling this:
val sdf = java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'")
val date = java.util.Date(1532358895 * 1000)
sdf.format(date)
Or with the new Time API:
java.time.format.DateTimeFormatter.ISO_INSTANT
.format(java.time.Instant.ofEpochSecond(1532358895))

Hive FROM_UNIXTIME() with milliseconds

I have seen enough posts where we divide by 1000 or cast to convert from Milliseconds epoch time to Timestamp. I would like to know how can we retain the Milliseconds piece too in the timestamp.
1440478800123 The last 3 bytes are milliseconds. How do i convert this to something like YYYYMMDDHHMMSS.sss
I need to capture the millisecond portion also in the converted timestamp
Thanks
select cast(epoch_ms as timestamp)
actually works, because when casting to a timestamp (as opposed to using from_unixtime()), Hive seems to assume an int or bigint is milliseconds. A floating point type is treated as seconds. That is undocumented as far as I can see, and possibly a bug. I wanted a string which includes the timezone (which can be important - particularly if the server changes to summer/daylight savings time), and wanted to be explicit about the conversion in case the cast functionality changes. So this gives an ISO 8601 date (adjust format string as needed for another format)
select from_unixtime(
floor( epoch_ms / 1000 )
, printf( 'yyyy-MM-dd HH:mm:ss.%03dZ', epoch_ms % 1000 )
)
create a hive udf in java
package com.kishore.hiveudf;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Date;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;
#UDFType(stateful = true)
public class TimestampToDateUDF extends UDF {
String dateFormatted;
public String evaluate(long timestamp) {
Date date = new Date(timestamp);
DateFormat formatter = new SimpleDateFormat("YYYYMMDDHHmmss:SSS");
dateFormatted = formatter.format(date);
return dateFormatted;
}
}
export as TimestampToDateUDF.jar
hive> ADD JAR /home/kishore/TimestampToDate.jar;
hive> create TEMPORARY FUNCTION toDate AS 'com.kishore.hiveudf.TimestampToDateUDF' ;
output
select * from tableA;
OK
1440753288123
Time taken: 0.071 seconds, Fetched: 1 row(s)
hive> select toDate(timestamp) from tableA;
OK
201508240144448:123
Time taken: 0.08 seconds, Fetched: 1 row(s)