The case: we have an agent which send data to eventhub, we use the event hub as input for the aggregation and calculation using ASA. When the agent has some problems, it will send a value to the eventhub(maybe serval days send an error value), we want to write a value to the output when we receive the agent error value, so how to solve this problem? We cannot using window because it is data triggered.
Assuming you want to write to the same output (sink), you can define replicated output in ASA (same as the existing output as it does not support 2 queries writing to same output). Define another query that just consumes the error events only and writes to second output defined (but to the same physical output location) and this would be a pass-through query "select * ...". hope that helps.
thanks
Venkat
Related
At last BigQuery supports using ; in the queries, so I can write more than one query in one "block", if I seperate them with semicolon.
If I run the code manually, it works. But I cannot schedule that.
When I want to schedule, I have two choices:
(New) Web UI: I must give a destination table. If I don't do it, I could not save the scheduled query. But all my queries are updates and inserts with different "destination tables". Like these:
UPDATE project.exampledataset.a
SET date = current_date()
WHEN TRUE
;
INSERT INTO project.otherdataset.b
SELECT c,d
FROM project.otherdataset.c
So I cannot even make a scheduling in the Web UI.
Classic UI: I tried this, because the official documentary states, that I should leave the "destination table" blank, and Classic UI allows it. I can setup the scheduling, but it doesn't run, when it should. I get the error message in email "Error status: Dataset specified in the query ('') is not consistent with Destination dataset 'exampledataset'."
AIK scripting (and using semicolon) is a very new feature in BigQuery, but I hope someone can help me.
Yes, I know that I could schedule every query one by one, but I would like to resolve it with one big script.
Looks like the scheduled query was defined earlier with destination dataset defined with APPEND/TRUNCATE type transaction. While updating the same scheduled query to a DML query, GUI doesn't show the dataset field / table name to update to NULL. Hence this error is coming considering the previously set dataset and table name in the scheduled query.
Hence the fix is to delete the scheduled query and create it from scratch with DML query option. It worked for me.
Scripting is supported in scheduled query now. However, scripting query, when being scheduled, doesn't support setting a destination table for now. You still need to use DDL/DML to make change to existing table.
E.g.:
CREATE OR REPLACE TABLE destinationTable AS
SELECT *
FROM sourceTable
WHERE date >= maxDate
As of 2022, the BQ Console UI will let you create a new scheduled query without a destination dataset, but it won't let you update a prior SELECT to use DDL/DML block syntax. However, you can use the BigQuery Data Transfer API to update the destinationDatasetId field, via transferconfigs/patch. Use transferconfigs/list to get the configId for a given scheduled query.
Note that you can either use the in-browser API Explorer, if you have the appropriate credentials, or write a programmatic solution. Also seems useful for setting/updating any other fields, including renaming scheduled queries.
I want to execute select * from table1, select * from table2, select * from table3,...select * from table80....(Basically extract data from 80 different tables and send the data to 80 different indexes in Elasticsearch(Kibana).
Is it possible for me to give multiple select * statement in one Query Database Table and then route it to different indexes? If yes, what will be the flow like?
There are a couple approaches you can take to solve this issue.
If your tables are literally table1, table2, etc., you can simply generate 80 flowfiles, each with a unique integer value in an attribute (i.e. table_count) and use GenerateTableFetch and ExecuteSQL to create the queries using this attribute via Expression Language
If the table names are non-sequential (i.e. users, addresses, etc.), you can read from a file listing each on a line or use ListDatabaseTables to query the database for the names. You can then perform simple text processing to split the created flowfile(s) to one per table and continue as above
QueryDatabaseTable doesn't allow incoming connections so it is not possible.
But you can achieve same use case with the following flow
Flow:
1. ListDatabaseTables
2. RouteOnAttribute //*optional* filter only required tables
3. GenerateTableFetch //to generate pages of sql queries and store state
4. RemoteProcessGroup (or) Load balance connection
5. ExecuteSql //run more than one concurrent task if needed
6. further processing
7. PutElasticSearch.
In addition if you don't want to run the flow incrementally then remove GenerateTableFetch processor
Configure ExecuteSql processor select query as
select * from ${db.table.schema}.${db.table.name}
Some useful references:
GenerateTableFetch link1 link2
Incrementally run ExecuteSQL processor without using GenerateTableFetch link
I am currently using a function in SQL Server to get the max-value of a certain column. I Need this value to generate a specific number of dummy files to insert flowfiles that are created later on.
Is there a way of calling this function via a nifi-processor?
By using ExecuteSQL I Always get error like unable to execute SQL select query or the column "ab" was not found, when using select ab.functionname() (ab is the loginname of the db)
In SQL Server I can just use select ab.functionname() and get the desired results.
If there is no possible way of calling this function, is there another way to create #flowfiles dummyfiles to reserve this place for them in the DB so that no one else could insert or use this ids (not autoincremt, because it is not possible) while the flowfiles are getting processed?
I tried using $flowfile.count and the Counterprocessor, but this did not solve the Problem.
It should look like: INSERT INTO table (id,nr) values (max(id)+1,anynumber) for every flowfiles, unfortunately the ExecuteSQL is not able to do this.
Think this conversation can help you:
https://community.hortonworks.com/questions/26170/does-executesql-processor-allow-to-execute-stored.html
Gist:
You can use ExecuteScript or ExecuteProcess to call appropriate script. For example for ExecuteProcess just call sqlplus command. Choose type of command "sqlplus". In command arguments set something like: user_id/password#dbname #"script_path/someScript.sql". In someScript.sql you put something like:
execute spname(param)
You can write your own processor :) Of course it's more difficulty and often unnecessary
I have a set of SQL statements using distributed option of Invantive SQL that extract shipped goods information from Exact Online and create for each serial number shipped a ticket in Freshdesk, together with the consumer as a contact.
This works fine when connected to Exact Online and Freshdesk under one log on code. However, the end user uses a different log on code. In that case the set of SQL statements retrieves data from their test division in Exact Online instead of the correct production division.
When using no distributed option, I can change the division using:
use 123123
Where 123123 is the unique division number in the Exact Online country.
When connected both to Exact Online and Freshdesk, I get a:
itgenuse002: List of partitions could not be determined.
How can I enforce that the set of SQL statements is executed for a specific Exact Online division instead of the default one set at that moment for the log on code?
Sample SQL query that shows the problem:
create or replace table fulladdress#inmemorystorage --STAP 1.
as
select acad.id
, acad.name
, acad.phone
, acad.email
, acad.addressline1 || ' ' || acad.postcode || ' ' || acad.city fulladdress
from ExactonlineREST..Accounts#eolnl acad
where acad.status = 'C'
The use statement shown is for databases with exactly one data container. In that case, there is only one data container that can handle the question and everything runs smooth.
With a distributed query in Invantive SQL, you need to direct the use statement which data container to use. Otherwise, the first data container will try to handle it (in this case probably Freshdesk which has no concept of partitioning). That is similar to appending the data container alias to each tables as in:
select ...
from table#eolnl
join table2#freshdesk
on ...
Here eolnl and freshdesk specify where the tables should looked up.
So, in this case use:
use 123123#eolnl
The same also holds for the set statement.
From your code it seems you have multiple data containers running in your connection. From the user interface you can only set the partitions on the default data container.
However, there is an easy code solution to use. You have to know the alias of the data container you want to set the partition for. The use that alias in a use call (in this sample 123123 is the partition you want to choose):
use 123123#eolnl
Or to use all partitions available:
use all#eolnl
I am using pentaho for data migration testing. I have set a "table input" step where many parts of the query inside "table inputs" are variables. I have been looking for a way to capture that query after it gets executed during runtime.
I was wondering if there is any specific system log variables for sql or is it to do with metadata. need help! Thanks
Maybe the following approach will help:
We assume a transformation reading a CSV file to get the dynamic portion of the SELECT statement (e.g. the columns) and setting the variable columns with it.
The second transformation uses this variable to generate the SELECT statement and store it into the variable sql_statement.
In the main transformation we use ${sql_statement} as the SELECT statement of the table input and write the data to an output file (that's the business process so to say). From the same input we copy the output to another path. There we add the current time as a field (use element "Get system data") and we add the generated SQL statement, join them as a cartesian product and group the result by the sql_statement. That way we can compute the first time and the last time that the statement was used. These results are written to a text file.
The last thing we need is a job calling the three transformations sequentially.
This is a sample output:
sql_statement;min_time;max_time
SELECT my_column FROM test_table;2014/05/08 00:41:21.143;2014/05/08 00:41:21.144
Thank you Marcus! I did some thing similar.
It works. awesome.
I gathered parts of queries from table field where they were kept and formed a full query in javascript. After that full query will be sent as parameter to a transformation that will run and log the query.