Dynamic window creation at run time - Drools Fusion 6 / Esper - dynamic

I need to achieve a dynamic sliding window of length (5) where I have incoming flight statuses from various flights coming into one stream.
Based on the flight_id property from various flights - dynamic windows of length 5 have to be created at run time and its average speed has to be maintained individually.
This example in Drools Fusion does not work when I insert multiple flights with different flight ids and speeds into it - http://books.google.co.in/books?id=trrfxX8JCisC&pg=PA136&lpg=PA136&dq=flight+average+speed+example+drools+fusion&source=bl&ots=NpRv7D32Us&sig=6XbWtIQ2T1idGMQRU_hQZgmd8fc&hl=en&sa=X&ei=RBAUU92yIsLkiAenFg&ved=0CDIQ6AEwAQ#v=onepage&q=flight%20average%20speed%20example%20drools%20fusion&f=false
The window gets reset when it detects a new flight id.
Please let me know if there is a solution for this in Drools Fusion or Esper or any other open source CEP.
Thanks in advance.

The link does not work.
Can you perhaps clarify "dynamic windows" and "windows get reset"? It is not clear what that could mean.
In Esper I have found an example in the docs in "4.2.6.1. Distinct Events for the Initiating Condition" and rewrote this for you what may match the somewhat fuzzy requirements:
create context Flight initiated by distinct(flightId) FlightEVent
terminated after 5 seconds; // you don't mention when to throw a flight away
context Flight select avg(speed) from FlightEvent.win:length(5);

Related

Airflow Operator BigQueryTablePartitionExistenceSensor Question

I'm trying to use this BigQueryTablePartitionExistenceSensor operator in Airflow and I was wondering if this operator checks whether the partition is fully loaded or can potentially mark to success even if the data isn't complete yet.
For example, if my table is partitioned on DAY and the load for 20220420 has started but isn't complete, would this sensor trigger? Or, would it wait until that load step has been completed before marking the sensor to success?
Thanks
The Operator will not wait until your data has loaded, it will just check for the existence of the partition value at that moment in time. So if a single row gets inserted into that partition then this sensor would return True. See the sensor code that gets called by this operator.
An idea I've used in the past for similar problems has been to use a sentinel Label on the partitioned table to mark a load as "in-progress" or "done"
As has already been answered, it does not await anything except the existence of the partition.
If your data is streamed into partitions, and you have ordered delivery, you can probably add a sensor for the next-day partition — on the assumption that the previous day is complete when events have started streaming into the next.
If the load is managed by the same Airflow instance, I'd suggest using an ExternalTaskSensor on the load job. If not, you might be able to use the more generic SqlSensor, and run a custom SQL query on metadata tables to determine if a partition is complete, perhaps you can add a label or something with the Load job that you then query for.

Octaplanner example for Capicated Vehicle Routing with Time Window?

I am new to OctaPlanner.
I want to build a solution where I will nave number of locations to deliver items from one single location and also I want to use openmap distance data for calculating the distance.
Initially I used jsprit, but for more than 300 deliveries, it takes more than 8 minutes with 20 threads. Thats why I am trying to use Octa planner.
I want to map 1000 deliveries within 1 minute.
Does any one know any reference code or reference material which I can start using?
Thanks in advance :)
CVRPTW is a standard example, just open the examples app, vehicle routing and then import one of the belgium datasets with timewindows. The code is in the zip too.
To scale to 1k deliveries and especially beyond, you'll want to use "Nearby selection" (see reference manual), which isn't on by default but which makes a huge difference.

Dynamic query using parameters in Tableau

I am trying to utilize tableau in creating a web dashboard to interact with a postgres database with a fair amount of rows.
The key here is the relevant data is within latitude/longitude boundaries, so I'm using tableau parameters in a custom SQL statement to get what I need, like so
SELECT id, lat, lng... FROM my_table
WHERE lat >= <Parameters.MIN_LAT> AND lat <= <Parameters.MAX_LAT>
AND lng >= <Parameters.MIN_LNG> AND lng <= <Parameters.MAX_LNG>
LIMIT 10000
I'm setting these parameters using the Tableau JavaScript API based off of a Google maps widget boundaries. When the map is moved, I'll refresh the parameters and the data needs to update as well. This refresh is not done constantly, but frequent enough that long wait times are not acceptable.
Because the lat/lng boundaries are dynamic and the full unfiltered table is very big (~1GB) I presumed it is impractical to create a data extract. Am I wrong?
Furthermore when I change some of the in-Tableau filters I'm applying there is a very long wait as if it is re-executing the query every-time, even if the MIN_LAT, MAX_LAT, .. parameters are un-changed.
What's the best way of resolving this? I'm new to Tableau so sorry if I'm missing something super obvious!
Thanks.
The best way of resolving this, is by making a query with less information on it (1GB is too much, the extract can help to group data to present dimensions very fast, but that's it.. if there is nothing to group it will be very extense), which permits doing a drill down to present more information on subsequents steps or dashboards levels.
I am thinking of a field of the database which can tell the zoom level of presenting information.
If you are navigating on googlemaps... first you find the countrys, then the capital cities, then the cities, then the small towns, then the local stores...
The key is on the zoom level you are on each time.
You may visit Tableau about Drill downs.

BigQuery streaming data not available instantly

Since couple of days some data i am streaming to bigquery is not available instantly (as it normally happens) within bigquery web ui after being inserted successfully.
My use case consists of inserting thousand of lines using :
bigquery.tabledata().insertAll(...)
The results of the streaming inserts into the table are :
(i am also checking for insertErrors to be sure as described here):
BigQuery insert status : {"kind":"bigquery#tableDataInsertAllResponse"}
BigQuery insert errors : null
Total number of lines available in bigquery web ui is different that total inserted.
I would be grateful for any help.
Bigquery project details :
Project ID : favorable-beach-87616
Table : mtp_UA_xxxx_1_20150410
Project dependencies on google libraries:
compile 'com.google.api-client:google-api-client:1.19.0'
compile 'com.google.http-client:google-http-client:1.19.0'
compile 'com.google.http-client:google-http-client-jackson2:1.19.0'
compile 'com.google.oauth-client:google-oauth-client:1.19.0'
compile 'com.google.oauth-client:google-oauth-client-servlet:1.19.0'
compile 'com.google.apis:google-api-services-bigquery:v2-rev171-1.19.0'
compile 'com.google.api-client:google-api-client:1.17.0-rc'
Great thanks in advance for your help!
When you say the total number of lines available in the web UI, do you mean the number of rows that show up in the 'details' pane on the table, or the number of rows that are returned if you do a SELECT COUNT(*) query?
If the former, that is expected, since that counter only returns the number of rows that have been flushed to long-term storage (as opposed to the short-term storage buffers the streaming data originally gets written to). This is admittedly confusing, and we are working on a fix.
If the latter, the rows don't show up in a query, that is more concerning. If that is the case, please let us know and we'll investigate.

how to list job ids from all users?

I'm using the Java API to query for all job ids using the code below
Bigquery.Jobs.List list = bigquery.jobs().list(projectId);
list.setAllUsers(true);
but it doesn't list me job ids that were run by Client ID for web applications (ie. metric insights) I'm using private key authentication.
Using the command line tool 'bq ls -j' in turn giving me only the metric insight job ids but not the ones ran with the private key auth. Is there a get all method?
The reason I'm doing this is trying to get better visibility into what queries are eating up our data usage. We have multiple sources of queries: metric insights, in house automation, some done manually, etc.
As of version 2.0.10, the bq client has support for API authorization using service account credentials. You can specify using a specific service account with the following flags:
bq --service_account your_service_account_here#developer.gserviceaccount.com \
--service_account_credential_store my_credential_file \
--service_account_private_key_file mykey.p12 <your_commands, etc>
Type bq --help for more information.
My hunch is that listing jobs for all users is broken, and nobody has mentioned it since there is usually a workaround. I'm currently investigating.
Jordan -- It sounds like you're honing in on what we want to do. For all access that we've allowed into our project/dataset we want to produce an aggregate/report of the "totalBytesProcessed" for all queries executed.
The problem we're struggling with is that we have a handful of distinct java programs accessing our data, a 3rd party service (metric insights) and 7-8 individual users who have query access via the web interface. Fortunately the incoming data only has one source so explaining the cost for that is simple. For queries though I am kinda blind at the moment (and it appears queries will be the bulk of the monthly bill).
It would be ideal if I can get the underyling data for this report with just one listing made with some single top level auth. With that I think from the timestamps and the actual SQL text I can attribute each query to a source.
One thing that might make this problem far easier is if there were more information in the job record (or some text adornment in the job_id for queries). I don't see that I can assign my own jobIDs on queries (perhaps I missed it?) and perhaps recording some source information in the job record would be possible? Just thinking out loud now...
There are three tables you can query for this.
region-**.INFORMATION_SCHEMA.JOBS_BY_{USER, PROJECT, ORGANIZATION}
Where ** should be replaced by your region.
Example query for JOBS_BY_USER in the eu region:
select
count(*) as num_queries,
date(creation_time) as date,
sum(total_bytes_processed) as total_bytes_processed,
sum(total_slot_ms) as total_slot_ms_cost
from
`region-eu.INFORMATION_SCHEMA.JOBS_BY_USER` as jobs_by_user,
jobs_by_user.referenced_tables
group by
2
order by 2 desc, total_bytes_processed desc;
Documentation is available at:
https://cloud.google.com/bigquery/docs/information-schema-jobs