how to generate unique weekid using weekofyear in hive - hive

I have a table I"m just iterating dates of 50 years.
Using the values of weekofyear("date") -> week_no_in_this_year.
I would like to create a column using (week_no_in_this_year), it should be unique for a week. name it as -> week_id
which should be concatination of Year+two_digit_week_no_in_this_year+Some_number(to make this id unique for one week). I tried like below:
concat(concat(YEAR,IF(week_no_in_this_year<10,
concat(0,week_no_in_this_year),week_no_in_this_year)),'2') AS week_id.
But I'm facing issue for few dates for below scenario's:
SELECT weekofyear("2019-01-01") ;
SELECT concat(concat("2019",IF(1<10, concat(0,1),1)),'2') AS week_id;
Expected Result: 2019012
SELECT weekofyear("2019-12-31");
SELECT concat(concat("2019",IF(1<10, concat(0,1),1)),'2') AS week_id;
Expected Result: 2020012

One way to do it is with UDF. Create a python script and push it to HDFS
mypy.py
import sys
import datetime
for line in sys.stdin:
line = line.strip()
(y,m,d) = line.split("-")
d = datetime.date(int(y),int(m),int(d)).isocalendar()
print str(d[0])+str(d[1])
In Hive
add file hdfs:/user/cloudera/mypy.py;
select transform("2019-1-1") using "python mypy.py" as (week_id);
INFO : OK
+----------+--+
| week_id |
+----------+--+
| 20191 |
+----------+--+
select transform("2019-12-30") using "python mypy.py" as (week_id)
+----------+--+
| week_id |
+----------+--+
| 20201 |
+----------+--+
1 row selected (33.413 seconds)

This scenario only happens when there is a split between years at the end of a given year (that is Dec 31) and the week number is rollover to next year. If we put a condition for this case, then we get what you expect.
Right function is the same as substr (, -n).
SELECT DTE as Date,
CONCAT(IF(MONTH(DTE)=12 and WEEKOFYEAR(DTE)=1, year(DTE)+1, year(DTE)),
SUBSTR(CONCAT('0', WEEKOFYEAR(DTE)), -2), '2') as weekid
FROM tbl;
Result:
Date WeekId
2019-01-01 2019012
2019-11-01 2019442
2019-12-31 2020012

Related

SQL Having/Where clause to compare MAX from current/another table

I have a table that has date information and is being copied to another table and trying to perform an incremental load.
date = date format
hour = int
person
date
hour
bob
2023-01-01
1
bill
2023-01-02
2
select * into test.person_copy from
(select * from original.person)
My thought process of performing the incremental load is to check on the max(date) & max(hour) from the original table against the copied table to identify what is the gap between the max values from both tables. However, I'm not entirely sure how to implement the logic as it doesn't seem straight forward with the where clause. Having clause might make more sense, but also doesn't seem correct?
select * into test.person_copy from
(select * from original.person org
Having max(org.date, org.hour) > (select max(copy.date,copy.hour) from test.person_copy copy)
)
The other variation I had in mind was to use HAVING NOT IN
Having max(org.date, org.hour) NOT IN (select max(copy.date,copy.hour) from test.person_copy copy)
Wasn't sure if logic is correct. Hour field will be of importance's, but can live with just the date fields.
Expected output would be that the logic would check for existing max(date) and only insert if it doesn't exist. Example below, 2023-01-03
| person | date | hour |
|--------|------------|------|
| bob | 2023-01-01 | 1 |
| bill | 2023-01-02 | 2 |
| test | 2023-01-03 | 2 |
Don't have access to a RedShift environment but the following query should work:
select *
into test.person_copy
from original.person org
where dateadd(hrs, org.hour, org.date) >
(select max(dateadd(hrs, cpy.hour, cpy.date))
from test.person_copy cpy
)
This assumes that when the previous hour's copy was made entire set of source rows for that date&hour was copied (the new incremental load would have all rows for the dates&hours not already copied). This means that you need additional criteria in the select to make sure that you include only completed date-hours (i.e. make sure that you don't include the rows with hour=10 while the time is still 10:30).

How to subtract data from a column in sql

Good morning I would like if you can help me to make a query in SQL or Postgresql, I have a column with the following data
" | Old 0.00 New 50.2 F20190429093143 | Old 50.20 New 50.2 F20191111151118 | Old 50.20 New 50.2 F20191202110735 | Old 50.20 New 53.2 F20201124173459 | Old 53.2 New 158.63 F20201125093143"
What I want to get are the last 3 updates, that is, we show the last 3 " | Old ..." and I kept something like that
| Old 50.20 New 50.2 F20191202110735 | Old 50.20 New 53.2 F20201124173459 | Old 53.2 New 158.63 F20201125093143
I can't think of how to make the query to get that value back
Assuming you also have some unique key/column in your table, the following hack would achieve what you want:
with entries as (
select t.id,
x.entry,
regexp_split_to_array(x.entry, '\s+') as items
from horribly_designed_table t
cross join regexp_split_to_table(t.line, '\s{0,1}\|\s*') as x(entry)
where nullif(trim(x.entry),'') is not null
), numbered as (
select id,
entry,
to_timestamp(items[5], '"F"yyyymmddhh24miss') as ts,
row_number() over (partition by id order by to_timestamp(items[5], '"F"yyyymmddhh24miss') desc) as rn
from entries
)
select id, string_agg(entry, ' | ' order by ts) filter (where rn <= 3) as new_line
from numbered
group by id;
The first CTE splits the line into multiple rows, one for each "entry". It also creates an array of the items for each entry.
The second CTE then uses a window function to calculate a rank based on the last "item" (converted to a timestamp).
And finally the "last 3 entry" for each ID are then aggregated back into a single line.
Online example

max function in R not giving desired Date results

I am trying to use Max function to get the record with latest dates but it does not give desired results as it also gives data with old dates.Below is output of dataframe OA_Output:
Org_ID ORG_OFFICIAL_NORM_NAME ORG_IMMEDIATE_PARENT_SRC ORG_IP_SRC_MD_Date
-------- ---------------------- ------------------------ ------------------
132693 BOLLE INCORPORATED abc.com 26-JUN-18
122789 BEE STINGER, LLC aa.com 12-Mar-18
344567 CALIBER COMPANY xyz.com 16-Feb-16
639876 Maruti yy.com 23-Jun-17
I am running below R Code to get the records with latest dates:
gautam1 <-
sqldf(
"
SELECT ORG_OFFICIAL_NORM_NAME,ORG_IMMEDIATE_PARENT_SRC
,MAX(ORG_IP_SRC_MD_DATE),ROW_ID
FROM OA_output
where ROW_ID = 1
and ORG_IMMEDIATE_PARENT_SRC like '%exhibit21%'
GROUP BY ORG_IMMEDIATE_PARENT_ORGID
ORDER BY ORG_IMMEDIATE_PARENT_ORGID
" )
In The above code max function is not giving desired results.I am not sure whether there is a date format issue between Oracle and R.
Any help will be appreciated
Thanks
Gautam
can you please use below query as a sql and run again code
SELECT ORG_OFFICIAL_NORM_NAME,ORG_IMMEDIATE_PARENT_SRC
,MAX(ORG_IP_SRC_MD_DATE),ROW_ID
FROM OA_output
where ROW_ID = 1
and ORG_IMMEDIATE_PARENT_SRC like '%exhibit21%'
GROUP BY ORG_OFFICIAL_NORM_NAME,ORG_IMMEDIATE_PARENT_SRC,ROW_ID
ORDER BY ORG_IMMEDIATE_PARENT_ORGID

SQL dynamic column name

How do I declare a column name that changes?
I take some data from DB and I am interested in last 12 months, so I only take events that happend, let's say in '2016-07', '2016-06' and so on...
Then, I want my table to look like this:
event type | 2016-07 | 2016-06
-------------------------------
A | 12 | 13
B | 21 | 44
C | 98 | 12
How can I achieve this effect that the columns are named using previous YYYY-MM pattern, keeping in mind that the report with that query can be executed any time, so it would change.
Simplified query only for prev month:
select distinct
count(event),
date_year_month,
event_name
from
data_base
where date_year_month = TO_CHAR(add_months(current_date, -1),'YYYY-MM')
group by event_name, date_year_month
I don't think there is an automated way of pivoting the year-month columns, and change the number of columns in the result dynamically based on the data.
However if you are looking for pivoting solution, you accomplish using table functions in netezza.
select event_name, year_month, event_count
from event_counts_groupby_year_month, table(inza.inza.nzlua('
local rows={}
function processRow(y2016m06, y2016m07)
rows[1] = { 201606, y2016m06 }
rows[2] = { 201607, y2016m07 }
return rows
end
function getShape()
columns={}
columns[1] = { "year_month", integer }
columns[2] = { "event_count", double }
return columns
end',
y2016m06, y2016m07));
you could probably build a wrapper on this to dynamically generate the query based on the year month present in the table using shell script.

Creating custom event schedules. Should I use "LIKE"?

I'm creating a campaign event scheduler that allows for frequencies such as "Every Monday", "May 6th through 10th", "Every day except Sunday", etc.
I've come up with a solution that I believe will work fine (not yet implemented), however, it uses "LIKE" in the queries, which I've never been too fond of. If anyone else has a suggestion that can achieve the same result with a cleaner method, please suggest it!
+----------------------+
| Campaign Table |
+----------------------+
| id:int |
| event_id:foreign_key |
| start_at:datetime |
| end_at:datetime |
+----------------------+
+-----------------------------+
| Event Table |
+-----------------------------+
| id:int |
| valid_days_of_week:string | < * = ALL. 345 = Tue, Wed, Thur. etc.
| valid_weeks_of_month:string | < * = ALL. 25 = 2nd and 5th weeks of a month.
| valid_day_numbers:string | < * = ALL. L = last. 2,7,17,29 = 2nd day, 7th, 17th, 29th,. etc.
+-----------------------------+
A sample event schedule would look like this:
valid_days_of_week = '1357' (Sun, Tue, Thu, Sat)
valid_weeks_of_month = '*' (All weeks)
valid_day_numbers = ',1,2,5,6,8,9,25,30,'
Using today's date (6/25/15) as an example, we have the following information to query with:
Day of week: 5 (Thursday)
Week of month: 4 (4th week in June)
Day number: 25
Therefore, to fetch all of the events for today, the query would look something like this:
SELECT c.*
FROM campaigns AS c,
LEFT JOIN events AS e
ON c.event_id = e.id
WHERE
( e.valid_days_of_week = '*' OR e.valid_days_of_week LIKE '%5%' )
AND ( e.valid_weeks_of_month = '*' OR e.valid_weeks_of_month LIKE '%4%' )
AND ( e.valid_day_numbers = '*' OR e.valid_day_numbers LIKE '%,25,%' )
That (untested) query would ideally return the example event above. The "LIKE" queries are what have me worried. I want these queries to be fast.
By the way, I'm using PostgreSQL
Looking forward to excellent replies!
Use arrays:
CREATE TABLE events (id INT NOT NULL, dow INT[], wom INT[], dn INT[]);
CREATE INDEX ix_events_dow ON events USING GIST(dow);
CREATE INDEX ix_events_wom ON events USING GIST(wom);
CREATE INDEX ix_events_dn ON events USING GIST(dn);
INSERT
INTO events
VALUES (1, '{1,3,5,7}', '{0}', '{1,2,5,6,8,9,25,30}'); -- 0 means any
, then query:
SELECT *
FROM events
WHERE dow && '{0, 5}'::INT[]
AND wom && '{0, 4}'::INT[]
AND dn && '{0, 26}'::INT[]
This will allow using the indexes to filter the data.