Create Temporary Function in Hive

Create Temporary Function in Hive - sql

i have a query like this
CREATE TEMPORARY MACRO AAA (input VARCHAR(16))
input||
SUBSTR(
LPAD(
(CASE WHEN SUBSTR(LPAD(input,16,0),15,1) = 1 THEN 2
WHEN SUBSTR(LPAD(input,16,0),15,1) = 2 THEN 4
WHEN SUBSTR(LPAD(input,16,0),15,1) = 3 THEN 6
WHEN SUBSTR(LPAD(input,16,0),15,1) = 4 THEN 8
WHEN SUBSTR(LPAD(input,16,0),15,1) = 5 THEN 10
WHEN SUBSTR(LPAD(input,16,0),15,1) = 6 THEN 1
WHEN SUBSTR(LPAD(input,16,0),15,1) = 7 THEN 3
WHEN SUBSTR(LPAD(input,16,0),15,1) = 8 THEN 5
WHEN SUBSTR(LPAD(input,16,0),15,1) = 9 THEN 7
ELSE 0
END) +
(CASE WHEN SUBSTR(LPAD(input,16,0),16,1) = 1 THEN 1
WHEN SUBSTR(LPAD(input,16,0),16,1) = 2 THEN 2
WHEN SUBSTR(LPAD(input,16,0),16,1) = 3 THEN 3
WHEN SUBSTR(LPAD(input,16,0),16,1) = 4 THEN 4
WHEN SUBSTR(LPAD(input,16,0),16,1) = 5 THEN 5
WHEN SUBSTR(LPAD(input,16,0),16,1) = 6 THEN 6
WHEN SUBSTR(LPAD(input,16,0),16,1) = 7 THEN 7
WHEN SUBSTR(LPAD(input,16,0),16,1) = 8 THEN 8
WHEN SUBSTR(LPAD(input,16,0),16,1) = 9 THEN 9
ELSE 0
END)
,3,0), 3, 1);
but I want to create a temporary function for hive.
What needs to be changed and what format is it in to create a temporary function?

You will have to write your own UDF.
Ref : https://cwiki.apache.org/confluence/display/Hive/HivePlugins
Once you have jar file with custom logic you can create function like below:
Tested in Hive version apache-hive-4.0.0-alpha-1
CREATE TEMPORARY FUNCTION aa_fun as 'AAA' using jar 'hdfs://localhost:9000/user/hive/aaa_func.jar';
select aa_fun('fwfrgwre12a3') fun_res, AAA('fwfrgwre12a3') macro_res;
+----------------+----------------+
| fun_res | macro_res |
+----------------+----------------+
| fwfrgwre12a33 | fwfrgwre12a33 |
+----------------+----------------+

Related

Find Max Gradient by Row in For Loop Pandas

I have a df of 15 x 4 and I'm trying to compute the maximum gradient in a North (N) minus South (S) direction for each row using a "S" and "N" value for each min or max in the rows below. I'm not sure that this is the best pythonic way to do this. My df "ms" looks like this:
minSlats minNlats maxSlats maxNlats
0 57839.4 54917.0 57962.6 56979.9
0 57763.2 55656.7 58120.0 57766.0
0 57905.2 54968.6 58014.3 57031.6
0 57796.0 54810.2 57969.0 56848.2
0 57820.5 55156.4 58019.5 57273.2
0 57542.7 54330.6 58057.6 56145.1
0 57829.8 54755.4 57978.8 56777.5
0 57796.0 54810.2 57969.0 56848.2
0 57639.4 54286.6 58087.6 56140.1
0 57653.3 56182.7 57996.5 57975.8
0 57665.1 56048.3 58069.7 58031.4
0 57559.9 57121.3 57890.8 58043.0
0 57689.7 55155.5 57959.4 56440.8
0 57649.4 56076.5 58043.0 58037.4
0 57603.9 56290.0 57959.8 57993.9
My loop structure looks like this:
J = len(ms)
grad = pd.DataFrame()
for i in range(J):
if ms.maxSlats.iloc[i] > ms.maxNlats.iloc[i]:
gr = ( ms.maxSlats.iloc[i] - ms.minNlats.iloc[i] ) * -1
grad[gr] = [i+1, i]
elif ms.maxNlats.iloc[i] > ms.maxSlats.iloc[i]:
gr = ms.maxNlats.iloc[i] - ms.minSlats.iloc[i]
grad[gr] = [i+1, i]
grad = grad.T # need to transpose
print(grad)
I obtain the correct answer but I'm wondering if there is a cleaner way to do this to obtain the same answer below:
grad.T
Out[317]:
0 1
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-3158.8 8 7
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14
thank you,

Use np.where to compute gradient and keep only last duplicated index.
grad = np.where(ms.maxSlats > ms.maxNlats, (ms.maxSlats - ms.minNlats) * -1,
ms.maxNlats - ms.minSlats)
df = pd.DataFrame({'A': pd.RangeIndex(1, len(ms)+1),
'B': pd.RangeIndex(0, len(ms))},
index=grad)
df = df[~df.index.duplicated(keep='last')]
>>> df
A B
-3045.6 1 0
-2463.3 2 1
-3045.7 3 2
-2863.1 5 4
-3727.0 6 5
-3223.4 7 6
-3158.8 8 7
-3801.0 9 8
-1813.8 10 9
-2021.4 11 10
483.1 12 11
-2803.9 13 12
-1966.5 14 13
390.0 15 14

oracle query with groupby clause

i have a upload table as follows:
bulk_upload_hist
id file_name doc_blob upload_date upload_by
10 abc.pdf 12-APR-21 123
11 xyz.pdf 12-APR-21 123
inventory history stores the records from the file as follows:
inventory_doc_hist
id upload_id upload_status create_date create_by inv_doc_type
1 10 1 12-APR-21 123 20
2 10 1 12-APR-21 123 20
3 10 0 12-APR-21 123 10
4 10 1 12-APR-21 123 10
4 11 1 12-APR-21 123 20
5 11 0 12-APR-21 123 10
I want my output per bulk upload as follows:
id file_name successful/10 Successful/20 UnSuccessful upload_date upload_by
10 abc.pdf 2 1 1 12-APR-21 123
11 xyz.pdf 1 1 0 12-APR-21 123
what is the best way to do this?

I think you want a join and conditional aggregation:
select buh.id, buh.file_name,
sum(case when ih.inv_doc_type = 10 then 1 else 0 end) as num_successful_10,
sum(case when ih.inv_doc_type = 20 then 1 else 0 end) as num_successful_20,
sum(case when ih.inv_doc_type not in (10, 20) then 1 else 0 end) as num_successful_20,
ih.upload_date, ih.upload_by
from bulk_upload_hist buh join
inventory_history ih
on ih.upload_id = ih.id
group by buh.id, buh.file_name, ih.upload_date, ih.upload_by;

Append new column to DF after sum?

I have a sample dataframe below:
sn C1-1 C1-2 C1-3 H2-1 H2-2 K3-1 K3-2
1 4 3 5 4 1 4 2
2 2 2 0 2 0 1 2
3 1 2 0 0 2 1 2
I will like to sum based on the prefix of C1, H2, K3 and output three new columns with the total sum. The final result is this:
sn total_c1 total_h2 total_k3
1 12 5 6
2 4 2 3
3 3 2 3
What I have tried on my original df:
lst = ["C1", "H2", "K3"]
lst2 = ["total_c1", "total_h2", "total_k3"]
for k in lst:
idx = df.columns.str.startswith(i)
for j in lst2:
df[j] = df.iloc[:,idx].sum(axis=1)
df1 = df.append(df, sort=False)
But I kept getting error
IndexError: Item wrong length 35 instead of 36.
I can't figure out how to append the new total column to produce my end result in the loop.
Any help will be appreciated (or better suggestion as oppose to loop). Thank you.

You can use groupby:
# columns of interest
cols = df.columns[1:]
col_groups = cols.str.split('-').str[0]
out_df = df[['sn']].join(df[cols].groupby(col_groups, axis=1)
.sum()
.add_prefix('total_')
)
Output:
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3

Let us try ,split then groupby with it with axis=1
out = df.groupby(df.columns.str.split('-').str[0],axis=1).sum().set_index('sn').add_prefix('Total_').reset_index()
Out[84]:
sn Total_C1 Total_H2 Total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3

Another option, where we create a dictionary to groupby the columns:
mapping = {entry: f"total_{entry[:2]}" for entry in df.columns[1:]}
result = df.groupby(mapping, axis=1).sum()
result.insert(0, "sn", df.sn)
result
sn total_C1 total_H2 total_K3
0 1 12 5 6
1 2 4 2 3
2 3 3 2 3

Code if then statement by only using $ utility

How can I code this 'if' conditions in GAMS?
Set j/1*10/
S/1*6/;
Parameter
b(s,j) export this from excel
U(s,j) export from excel
M(s)/1 100,2 250,3 140,4 120,5 132/ export from excel
;
table b(s,j)
1 2 3 4 5 6 7 8 9 10
1 3 40 23 12 9 52 9 14 89 33
2 0 0 42 0 11 32 11 15 3 7
3 10 20 12 9 5 30 14 5 14 5
4 0 0 0 9 0 3 8 0 13 5
5 0 10 11 32 11 0 3 1 12 1
6 12 20 2 9 15 3 14 5 14 5
;
u(s,j)=0;
u(s,j)$(b(s,j))=1;
Variable delta(j); "binary"
After solving a model I got the value of delta ( suppose delta(1)=1, delta(5)=1). Then Set A is
A(j)$(delta.l(j)=1)=Yes; (A={1,5})
I want to calculate parameter R(s) according to the following :
If there is no j in A(j) s.t. j in u(s,j) then R(s)=M(s)
Else if there is a j in A(j) s.t. j in u(s,j) then R(s)=min{b(s,j): j in A(j) , j in u(s,j) }
Then R(1)=3, R(2)=11,R(3)=5, R(4)=120, R(5)=11,R(6)=12.
Is it possible to code this ' if then ' statement only by $ utility?
Thanks

Following on from the comments, I think this should work for you.
(Create a parameter that mimics your variable delta just for demonstration:)
parameter delta(j);
delta('1') = 1;
delta('5') = 1;
With loop and if/else:
Create parameter R(s). Then, looping over s , pick the minimum of b(s,A) across set A where b(s,A) is defined if the sum of b(s,A) is not zero (i.e. if one of the set is non-zero. Else, set R(s) equal to M(s).
Note, the loop is one solution to the issue you were having with mixed dimensions. And the $(b(s,A)) needs to be on the first argument of smin(.), not on the second argument.
parameter R(s);
loop(s,
if (sum(A, b(s,A)) ne 0,
R(s) = smin(A$b(s,A), b(s,A));
else
R(s) = M(s);
);
);
With $ command only (#Lutz in comments):
R(s)$(sum(A, b(s,A)) <> 0) = smin(A$b(s,A), b(s,A));
R(s)$(sum(A, b(s,A)) = 0) = M(s);
Gives:
---- 56 PARAMETER R
1 3.000, 2 11.000, 3 5.000, 4 120.000, 5 11.000, 6 12.000

update table with multiple conditions

I have table as follows:
SYS_ID SUB_NET_ID NODE_NAME NODE_ID NODE_EQ_NO NODE_VAR_NO TEMP_ID EQUIP_TYPE EQ_ID VAR_ID VAR_OBJECT VAR_NAME VAR_SUBSET VAR_SET CALC_VAR_TYPE DATA_TYPE DOF
15 1 BLEND 1 13 21 16 5 0 BLEND DEMAND SELF BLEND_OUT VAR CONTINOUS
15 1 BLEND 1 14 6 16 6 0 BLEND DEMAND BLEND BLEND VAR CONTINOUS
15 1 DEST 2 5 2 4 7 0 DEST DEMAND SELF DEST_IN VAR CONTINOUS
15 1 DEST 2 1 3 4 1 0 DEST DEMAND UNDEF DEST_IN VAR CONTINOUS
15 1 DEST 2 4 6 4 4 0 DEST MFLOW SELF DEST_IN VAR CONTINOUS
15 1 SALK 5 6 5 13 4 0 SALK MFLOW SELF SALK_OUT VAR binary
15 1 SPEN 7 8 4 13 6 0 SPEN MFLOW SELF SPEN_OUT VAR integer
I want to update the column data_type to 1 where data_type is continous and update to 0 where it is binary and so on... any suggestion

Use the CASE statement for that:
UPDATE my_tbl SET data_type =
CASE data_type
WHEN 'continous' THEN '0'
WHEN 'binary' THEN '1'
-- more options
ELSE data_type -- to retain original string if no substitute is listed
END;
You are aware that the data type will still not be a number, but what ever string type it was before, right?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Create Temporary Function in Hive - sql

Related

Find Max Gradient by Row in For Loop Pandas

oracle query with groupby clause

Append new column to DF after sum?

Code if then statement by only using $ utility

update table with multiple conditions

Categories

Resources