Hive access previous row value

Hive access previous row value - sql

I have the same issue mentioned
here
However, the problem is on Hive database. When I try the solution on my table that looks like
Id Date Column1 Column2
1 01/01/2011 5 5 => Same as Column1
2 02/01/2011 2 18 => (1 + (value of Column2 from the previous row)) * (1 + (Value of Column1 from the current row)) i.e. (1+5)*(1+2)
3 03/01/2011 3 76 => (1+18)*(1+3) = 19*4
I get the error
FAILED: SemanticException Recursive cte cteCalculation detected (cycle: ctecalculation -> cteCalculation).
What is the workaround possible in this case

You will have to write a UDF for this.
Below you can see a very (!!) simplified UDF for what you need.
The idea is to store the value from the previous execution in a variable inside the UDF and each time return (stored_value+1)*(current_value+1) and then store it for the next line.
You need to take care of the first value to get, so there is a special case for that.
Also, you have to pass the data ordered to the function as it simply goes line by line and performs what you need without considering any order.
You have to add your jar and create a function, lets call it cum_mul.
The SQL will be :
select id,date,column1,cum_mul(column1) as column2
from
(select id,date,column1 from myTable order by id) a
The code for the UDF :
import org.apache.hadoop.hive.ql.exec.UDF;
public class cum_mul extends UDF {
private int prevValue;
private boolean first=true;
public int evaluate(int value) {
if (first) {
this.prevValue = value;
first = false;
return value;
}
else {
this.prevValue = (this.prevValue+1)*(value+1);
return this.prevValue;
}
}
}

Hive common table expression (CTE) works as a query level temp-table (a syntax sugar) that is accessible within the whole SQL.
Recursive query is not supported because it introduces multiple stages with massive I/O, which is something that the underlying execution and storage engine not good at. In fact, Hive strictly prohibit recursive references for CTEs and views. Hence the error you got.

Related

How to retrieve the list of dynamic nested keys of BigQuery nested records

My ELT tools imports my data in bigquery and generates/extends automatically the schema for dynamic nested keys (in the schema below, under properties)
It looks like this
How can I get the list of nested keys of a repeated record ? so for example I can group by properties when those items have said property non-null ?
I have tried
select column_name
from my_schema.INFORMATION_SCHEMA.COLUMNS
where
table_name = 'my_table
But it will only list first level keys
From the picture above, I want, as a first step, a SQL query that returns
message
user_id
seeker
liker_id
rateable_id
rateable_type
from_organization
likeable_type
company
existing_attempt
...
My real goal through, is to group/count my data based on a non-null value of a 2nd level nested properties properties.filters.[filter_type]
The schema may evolve when our application adds more filters, so this need to be dynamically generated, I can't just hard-code the list of nested keys.
Note: this is very similar to this question How to extract all the keys in a JSON object with BigQuery but in my case my data is already in a shcema and it's not a JSON object
EDIT:
Suppose I have a list of such records with nested properties, how do I write a SQL query that adds a field "enabled_filters" which aggregates, for each item, the list of properties for wihch said property is not null ?
Example input (properties.x are dynamic and not known by the programmer)
search_id
properties.filters.school
properties.filters.type
1
MIT
master
2
Princetown
null
3
null
master
Example output
search_id
enabled_filters
1
["school", "type"]
2
["school"]
3
["type"]

Have you looked at COLUMN_FIELD_PATHS? It should give you the paths for all columns.
select field_path from my_schema.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS where table_name = '<table>'
[https://cloud.google.com/bigquery/docs/information-schema-column-field-paths]

The field properties is not nested by array only by structures. Then a UDF in JavaScript to parse thise field should work fast enough.
CREATE TEMP FUNCTION jsonObjectKeys(input STRING, shownull BOOL,fullname Bool)
RETURNS Array<String>
LANGUAGE js AS """
function test(input,old){
var out=[]
for(let x in input){
let te=input[x];
out=out.concat(te==null ? (shownull?[x+'==null']:[]) : typeof te=='object' ? test(te,old+x+'.') : [fullname ? old+x : x] );
}
return out;
Object.keys(JSON.parse(input));
}
return test(JSON.parse(input),"");
""";
with tbl as (select struct(1 as alpha,struct(2 as x, 3 as y,[1,2,3] as z ) as B) A from unnest(generate_array(1,10*1))
union all select struct(null,struct(null,1,[999])) )
select *,
TO_JSON_STRING (A ) as string_output,
jsonObjectKeys(TO_JSON_STRING (A),true,false) as output1,
jsonObjectKeys(TO_JSON_STRING (A),false,true) as output2,
concat('["', array_to_string(jsonObjectKeys(TO_JSON_STRING (A),false,true),'","' ) ,'"]') as output_sring,
jsonObjectKeys(TO_JSON_STRING (A.B),false,true) as outpu
from tbl

flutter_moor filter select query using more than one value inside where

I am trying to implement a multiple values filter to my database using flutter's moor package.
moor already has where method that takes an expression and converts it into sql statement. like:
(select(exercisesTable)..where((e) => (e.name.equals(name)))).get();
But I need to filter data due to more one value.
After I searched in the documentation I have found two possible solutions:
Use CutomExpressionClass link:
Expression expression = CustomExpression<bool, BoolType>(" water BETWEEN 4.0 AND 5.0 AND protein BETWEEN 4.0 AND 15.0 AND description LIKE CHESS%");
But I get this error : *
SqliteException: near ";": syntax error, SQL logic error
*
Use Custom select statements link:
I have not tried this because, I believe the problem is in sql itself not moor package.

From the comments of the SingleTableQueryMixin::where method (link):
...
/// If a where condition has already been set before, the resulting filter
/// will be the conjunction of both calls.
...
According to this you can use something like that:
Future<List<TableData>> getTableList({int position, int type}) {
final _select = select(table);
if (position != null) {
_select..where((tbl) => tbl.position.equals(position));
}
if (type != null) {
_select..where((tbl) => tbl.type.equals(type));
}
return _select.get();
}

You can use the boolean algebra feature to have multiple where conditions.
// find all animals that aren't mammals and have 4 legs
select(animals)..where((a) => a.isMammal.not() & a.amountOfLegs.equals(4));
// find all animals that are mammals or have 2 legs
select(animals)..where((a) => a.isMammal | a.amountOfLegs.equals(2));

Using a table for variable name in a table is not found when called for

I am making quite the complex thing and I am trying to use tables as variable names cause I have found that lua works with it, that is:
lua
{[{1,2}]="Meep"}
The issue is it is callable, when I do it and try to call it using the same kind of table, it won't find it.
I have tried looking for it and such but I have no clue why it won't do this.
ua
local c = {[{1,2}]="Meep"}
print(c[{1,2}],c)
Do I expect to become but it does not.
"Meep",{[{1,2}]="Meep"}
but what I get is
nil,{[{1,2}]="Meep"}
If I however try
lua
local m={1,2}
local c = {[m]="Meep"}
print(c[m],c)
it becomes the correct one, is there a way to avoid that middle man? After all m=={1,2} will return true.

The problem you have is that tables in lua are represented as references. If you compare two different talbes you are comparing those references. So the equation only gets true if the given tables are exactly the same.
t = { 1, 2, 3 }
t2 = { 1, 2, 3 }
print(t == t) -- true
print(t2 == t) -- false
print(t2 == t2) -- true
Because of this fact, you can pass them in function per reference.
function f(t)
t[1] = 5
end
t2 = { 1 }
f(t2)
print(t2[1]) -- 5
To bypass this behavior, you could (like suggested in comments) serialize the table before using it as a key.

How do you make a reusable UDF that operates on a single column?

Instead of writing a query like
select * from xyz where mydomain IN ('foobar.com', 'www.example.com')
I want to write a function like
select * from xyz where one_of_my_domains(select mydomain as from_site)
But I want to be able to reuse this functions for any url in one of many tables. Currently when I use a function like this, I have to predefined what is returned and use it on the whole FROM part of the SQL statement. Is there any way to generalize a UDF to where I can use it on just 1 column instead of it operating over all rows. Here is my code right now that works but I have to predefine every output column which makes it not reusable.
domains = ['foobar.com', 'www.example.com'];
// The UDF
function has_domain(row, emit) {
var has_domain = false;
if (row.to_site !== null && row.to_site !== undefined) {
for (var i = 0; i < domains.length; i++){
if (domains[i] === String.prototype.toLowerCase.call(row.to_site)){
has_domain = true;
break;
}
}
}
return emit({has_domain: has_domain, trackingEventId: row.trackingEventId, date: row.date, from_site: row.from_site, to_site: row.to_site});
}
// UDF registration
bigquery.defineFunction(
'has_domain', // Name used to call the function from SQL
['from_site'], // Input column names
// JSON representation of the output schema
[{name: 'has_domain', type: 'boolean'}],
has_domain // The function reference
);

It might look a little messy - but below does exactly what you asked!
Make sure you are in Standard SQL (see Enabling Standard SQL)
CREATE TEMPORARY FUNCTION one_of_my_domains(x STRING, a ARRAY<STRING>)
RETURNS BOOLEAN AS
(x IN (SELECT * FROM UNNEST(a)));
WITH xyz AS (
SELECT 1 AS id, 'foobar.com' AS mydomain UNION ALL
SELECT 2 AS id, 'www.google.com' AS mydomain
),
site AS (
SELECT 'foobar.com' AS domain UNION ALL
SELECT 'www.example.com' AS domain
)
SELECT *
FROM xyz
WHERE one_of_my_domains(mydomain, ARRAY((SELECT domain FROM site)))

You're looking for scalar UDFs using standard SQL. They're much less awkward to use compared to those of legacy SQL.

Saving state of closure in Groovy

I would like to use a Groovy closure to process data coming from a SQL table. For each new row, the computation would depend on what has been computed previously. However, new rows may become available on further runs of the application, so I would like to be able to reload the closure, initialised with the intermediate state it had when the closure was last executed in the previous run of the application.
For example, a closure intending to compute the moving average over 3 rows would be implemented like this:
def prev2Val = null
def prevVal = null
def prevId = null
Closure c = { row ->
println([ prev2Val, prevVal, prevId])
def latestVal = row['val']
if (prev2Val != null) {
def movMean = (prev2Val + prevVal + latestVal) / 3
sql.execute("INSERT INTO output(id, val) VALUES (?, ?)", [prevId, movMean])
}
sql.execute("UPDATE test_data SET processed=TRUE WHERE id=?", [row['id']])
prev2Val = prevVal
prevVal = latestVal
prevId = row['id']
}
test_data has 3 columns: id (auto-incremented primary key), value and processed. A moving mean is calculated based on the two previous values and inserted into the output table, against the id of the previous row. Processed rows are flagged with processed=TRUE.
If all the data was available from the start, this could be called like this:
sql.eachRow("SELECT id, val FROM test_data WHERE processed=FALSE ORDER BY id", c)
The problem comes when new rows become available after the application has already been run. This can be simulated by processing a small batch each time (e.g. using LIMIT 5 in the previous statement).
I would like to be able to dump the full state of the closure at the end of the execution of eachRow (saving the intermediate data somewhere in the database for example) and re-initialise it again when I re-run the whole application (by loading those intermediate variable from the database).
In this particular example, I can do this manually by storing the values of prev2Val, prevVal and prevId, but I'm looking for a generic solution where knowing exactly which variables are used wouldn't be necessary.
Perhaps something like c.getState() which would return [ prev2Val: 1, prevVal: 2, prevId: 6] (for example), and where I could use c.setState([ prev2Val: 1, prevVal: 2, prevId: 6]) next time the application is executed (if there is a state stored).
I would also need to exclude sql from the list. It seems this can be done using c.#sql=null.
I realise this is unlikely to work in the general case, but I'm looking for something sufficiently generic for most cases. I've tried to dehydrate, serialize and rehydrate the closure, as described in this Groovy issue, but I'm not sure how to save and store all the # fields in a single operation.
Is this possible? Is there a better way to remember state between executions, assuming the list of variables used by the closure isn't necessarily known in advance?

Not sure this will work in the long run, and you might be better returning a list containing the values to pass to the closure to get the next set of data, but you can interrogate the binding of the closure.
Given:
def closure = { row ->
a = 1
b = 2
c = 4
}
If you execute it:
closure( 1 )
You can then compose a function like:
def extractVarsFromClosure( Closure cl ) {
cl.binding.variables.findAll {
!it.key.startsWith( '_' ) && it.key != 'args'
}
}
Which when executed:
println extractVarsFromClosure( closure )
prints:
['a':1, 'b':2, 'c':4]
However, any 'free' variables defined in the local binding (without a def) will be in the closures binding too, so:
fish = 42
println extractVarsFromClosure( closure )
will print:
['a':1, 'b':2, 'c':4, 'fish':42]
But
def fish = 42
println extractVarsFromClosure( closure )
will not print the value fish

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive access previous row value - sql

Related

How to retrieve the list of dynamic nested keys of BigQuery nested records

flutter_moor filter select query using more than one value inside where

Using a table for variable name in a table is not found when called for

How do you make a reusable UDF that operates on a single column?

Saving state of closure in Groovy

Categories

Resources