Azure Stream Analytics How to handle more than 5 query? - azure-stream-analytics

I have created one ASA job and also created one input(eventhub) and 6 output(2 cosmos and 4 service bus queue)
and Queries are like below. SA allows to write more than 5 query but giving error in activity logs.Because of which I am getting watermark delay also.
1: Select
*
INTO
CosmosOutput
FROM
eventhubinput ;
2: Select
id,long,lat,timestamp
INTO
CosmosOutput1
FROM
eventhubinput ;
3: Select
*
INTO
SB1
FROM
eventhubinput
Where <condition>;
4: Select
*
INTO
SB2
FROM
eventhubinput
Where <condition>;
5: Select
*
INTO
SB3
FROM
eventhubinput
Where <condition1>;
6: Select
*
INTO
SB4
FROM
eventhubinput
Where <condition1>;
Question:
How do I write more than 5 query in efficient way? Thanks in advance!

Since you have multiple queries, you could try to allocate the Stream Units Settings.
Streaming Units (SUs) represents the computing resources that are allocated to execute a Stream Analytics job. The higher the number of SUs, the more CPU and memory resources are allocated for your job. Choosing the number of required SUs for a particular job depends on the partition configuration for the inputs and the query that's defined within the job.
Definitely, SUs produces more cost.Another workaround,you could set azure function output to replace some queries.For example, I notice you need to push data into different Service Bus Output with totally same conditions. You could sum them into one query and push same data into Azure function as parameters. Inside Azure function, configure multiple output bindings of service bus.

Related

Executing multiple Select * in QueryDatabaseTable in NiFi

I want to execute select * from table1, select * from table2, select * from table3,...select * from table80....(Basically extract data from 80 different tables and send the data to 80 different indexes in Elasticsearch(Kibana).
Is it possible for me to give multiple select * statement in one Query Database Table and then route it to different indexes? If yes, what will be the flow like?
There are a couple approaches you can take to solve this issue.
If your tables are literally table1, table2, etc., you can simply generate 80 flowfiles, each with a unique integer value in an attribute (i.e. table_count) and use GenerateTableFetch and ExecuteSQL to create the queries using this attribute via Expression Language
If the table names are non-sequential (i.e. users, addresses, etc.), you can read from a file listing each on a line or use ListDatabaseTables to query the database for the names. You can then perform simple text processing to split the created flowfile(s) to one per table and continue as above
QueryDatabaseTable doesn't allow incoming connections so it is not possible.
But you can achieve same use case with the following flow
Flow:
1. ListDatabaseTables
2. RouteOnAttribute //*optional* filter only required tables
3. GenerateTableFetch //to generate pages of sql queries and store state
4. RemoteProcessGroup (or) Load balance connection
5. ExecuteSql //run more than one concurrent task if needed
6. further processing
7. PutElasticSearch.
In addition if you don't want to run the flow incrementally then remove GenerateTableFetch processor
Configure ExecuteSql processor select query as
select * from ${db.table.schema}.${db.table.name}
Some useful references:
GenerateTableFetch link1 link2
Incrementally run ExecuteSQL processor without using GenerateTableFetch link

Behavior of IgniteCache.loadCache

I am using IgniteCache.loadCache(null, keyClassName, sqlArray) to load RDMS data into cache by running sql query specified by sqlArray
It looks that loadCache internally will run the sqlArray with ThreadPool(each sql will be run within a task)
My question is:
Does IgniteCache internally will control the parallesm? I have the following scenario:
My datasource's max connection is set to 200.
The sqlArray's length is about 1000 since I have a large table:
select * from person where id >=0 and i <=20000
...
select * from person where id >=10000000 and i <=10020000
If all these 1000 sql runs at the same time, then the connection would be unavailable from the connection pool which will lead to error
IgniteCache.loadCache method fully relies on configured CacheStore implementation. It looks like CacheAbstractJdbcStore is support parallelism internally.
By default, pool size is equal number of available processors, but you are free to change it with CacheAbstractJdbcStore.setMaximumPoolSize(int) method.
So, you'll run out of connections, if only you have more than 200 processor available.

Why spark sql is adding where 1=0 during load?

I am pretty new with spark. I have a task to fetch 3M record from a sql server through denodo data platform and write into s3. In sql server side it is a view on join of two tables. The view is time consuming.
Now I am trying to run a spark command as:
val resultDf = sqlContext.read.format("jdbc").option("driver","com.denodo.vdp.jdbc.Driver").option("url", url).option("dbtable", "myview").option("user", user).option("password", password)
I can see that spark is sending query like:
SELECT * FROM myview WHERE 1=0
And this portion is taking more than an hour.
Can anyone please tell me why the where clause is appending here?
Thanks.
If I'm understanding your issue correctly, Spark is sending SELECT * FROM myview WHERE 1=0 to the Denodo Server.
If that is the case, that query should be detected by Denodo as a query with no results due to incompatible conditions in the WHERE clause and the execution should be instantaneous. You can try to execute the same query in Denodo's VQL Shell (available in version 6), Denodo's Administration Tool or any other ODBC/JDBC client to validate that the query is not even sent to the data source. Maybe Spark is executing that query in order to obtain the output schema first?
What version of Denodo are you using?
I see this is an old thread -- however we are experiencing the same issue -- however it does not occur all of the time nor on all connections/queries -
SQOOP command is sent -- the AND (1=0) context ('i18n' = 'us_est') is added somewhere --we are using Denodo 7 -- jdbc driver com.denodo.vdp.jdbc.Driver
select
BaseCurrencyCode,BaseCurrencyName,TermCurrencyCode,TermCurrencyName,
ExchangeAmount,AskRate,BidRate,MidMarketRate,ExchangeRateStartDate,
ExchangeRateEndDate,RecCreateDate ,LastChangeDate
from
CurrencyExchange
WHERE
LastChangeDate > '2020-01-21 23:20:15'
And LastChangeDate <= '2020-01-22 03:06:19'
And (1 = 0) context ('i18n' = 'us_est' )

How to configure the data trigger function?

The case: we have an agent which send data to eventhub, we use the event hub as input for the aggregation and calculation using ASA. When the agent has some problems, it will send a value to the eventhub(maybe serval days send an error value), we want to write a value to the output when we receive the agent error value, so how to solve this problem? We cannot using window because it is data triggered.
Assuming you want to write to the same output (sink), you can define replicated output in ASA (same as the existing output as it does not support 2 queries writing to same output). Define another query that just consumes the error events only and writes to second output defined (but to the same physical output location) and this would be a pass-through query "select * ...". hope that helps.
thanks
Venkat

How do I display the query time when a query completes in Vertica?

When using vsql, I would like to see how long a query took to run once it completes. For example when i run:
select count(distinct key) from schema.table;
I would like to see an output like:
5678
(1 row)
total query time: 55 seconds.
If this is not possible, is there another way to measure query time?
In vsql type:
\timing
and then hit Enter. You'll like what you'll see :-)
Repeating that will turn it off.
Regarding the other part of your question:
is there another way to measure query time?
Vertica can log a history of all queries executed on the cluster which is another source of query time. Before 6.0 the relevant system table was QUERY_REPO, starting with 6.0 it is QUERY_REQUESTS.
Assuming you're on 6.0 or higher, QUERY_REQUESTS.REQUEST_DURATION_MS will give you the query duration in milliseconds.
Example of how you might use QUERY_REQUESTS:
select *
from query_requests
where request_type = 'QUERY'
and user_name = 'dbadmin'
and start_timestamp >= CURRENT_DATE
and request ilike 'select%from%schema.table%'
order by start_timestamp;
The QUERY_PROFILES.QUERY_DURATION_US and RESOURCE_ACQUISITIONS.DURATION_MS columns may also be of interest to you. Here are the short descriptions of those tables in case you're not already familiar:
RESOURCE_ACQUISITIONS - Retains information about resources (memory, open file handles, threads) acquired by each running request for each resource pool in the system.
QUERY_PROFILES - Provides information about queries that have run.
I'm not sure how to enable that in vsql or if that's possible. But you could get that information from a script.
Here's the psuedocode (I used to use perl):
print time
system("vsql -c 'select * from table'");
print time
Or put time into a variable and do some subtraction.
The other option is to use some tool like Toad to connect to Vertica instead of using vsql.