NiFi GenerateTableFetch does not store state per database.name - apache

I am testing out NiFi to replace our current ingestion setup which imports data from multiple MySQL shards of a table and store it in HDFS.
I am using GenerateTableFetch and ExecuteSQL to achieve this.
Each incoming flow file will have a database.name attribute which is being used by DBCPConnectionPoolLookup to select the relevant shard.
Issue is that, let's say I have 2 shards to pull data from, shard_1 and shard_2 for table accounts and also I have updated_at as Maximum Value Columns, it is not storing state for the for the table#updated_at per shard. There is only 1 entry per table in state.
When I check in Data Provenance, I see the shard_2 flowfile file getting dropped without being passed to ExecuteSQL. And my guess is it's because shard_1 query gets executed first and then when shard_2 query comes, it's records are checked against shard_1's updated_at and since it returns empty, it drops the file.
Has anyone faced this issue? Or am I missing something?

The ability to choose different databases via DBCPConnectionPoolLookup was added after the scheme to store state in the database fetch processors (QueryDatabaseTable, GenerateTableFetch, e.g.). Also, getting the database name differs between RDBMS drivers, it might be in the DatabaseMetaData or ResultSetMetaData, possibly in getCatalog() or getSchema() or neither.
I have written NIFI-5590 to cover this improvement.

Related

Preventing Data Fraud and Deletion in SQL Database

We have an application that is installed on premises for many clients. We are trying to collect information that will be sent to us at a point in the future. We want to ensure that we can detect if any of our data is modified and if any data was deleted.
To prevent data from being modified we currently hash table rows and send the hashes with the data. However, we are struggling to detect if data has been deleted. For example if we insert 10 records in a table and hash each row the user wont be able to modify the record without us detecting it but if they drop all the records then we can't distinguish this from the initial installation.
Constraints:
Clients will have admin roles to DB
Application and DB will be behind a DMZ and won't be able to connect external services
Clients will be able to profile any sql commands and be able to replicate any initial setup we do. (to clearify they clients can also drop/recreate tables)
although clients can drop data and tables, there are some sets of data and tables that if dropped or deleted would be obvious to us during audits beacuse they should always be accumulating data and missing data or truncated data would stand out. We want to be able to detect deletion and fraud in the remaining tables.
We're working under the assumption that clients will not be able to reverse our code base or hash/encrypt data themselves
Clients will send us all data collected every month and the system will be audited by us once a year.
Also consider they client can take backups of the DB or snapshots of a VM in a 'good' state and then do a roll back to that 'good' state if they want to destroy data. we don't want do do any detection of vm snapshot or db backup roll backs directly.
So far the only solution we have is encrypting the install date (which could be modified) and the instance name. Then every minute 'increment' the encrypted data. When we add data to the system, we hash the data row and stick the hash in the encrypted data. Then continue to 'increment' the data. Then when the monthly data is sent we'd be able to see if they are deleting data and rolling the DB back to just after installation because the encrypted value wouldn't have any increments or would be have extra hashes that don't belong to any data.
Thanks
Have you looked into Event Sourcing? This could be used possibly with write-once media as secondary storage if performance is good enough that way. That would then guarantee transaction integrity even against DB or OS admins. I'm not sure whether it's feasible to do Event Sourcing with real write-once media and still keep a reasonable performance.
Let's say we have a md5() or similar function in your code, and you want to keep control of the modification of the "id" fields of the table "table1". You can do something like:
accumulatedIds = "secretkey-only-in-your-program";
for every record "record" in the table "table1"
accumulatedIds = accumulatedIds + "." + record.id;
update hash_control set hash = md5(accumulatedIds) where table = "table1";
After every authorized change of the information of the table "table1". Nobody could make changes out of this system without being noticed.
If somebody changes some of the id's, you will notice that because the hash wouldn't be the same.
If somebody wants to recreate your table, unless he recreates exactly the same information, he woulnd't be capable of making the hash again, because he don't know the "secretkey-only-in-your-program".
If somebody deletes a record, it can be discovered also, because that "accumulatedIds" wouldn't match. The same will apply if somebody adds a record.
The user can delete the record under the table hash_control, but he can't reconstruct the hash information properly without the "secretkey...", so you will notice that also.
What am I missing??

allowLargeResults in Query job in BigQuery

I'm trying to run a Query job in BigQuery and getting the following error:
Response too large to return. Consider setting allowLargeResults to
true in your job configuration
I understand that I need to set allowLargeResults to True in my job configuration, but then I also have to supply a destination table field.
I don't want to insert the results of the query to specific table, only to process it locally.
how can I manage this situation?
I don't want to insert the results of the query to specific table,
only to process it locally.
Wanted to clarify – so you hopefully feel better about using destination table:
In reality, any query result ends up in some table!
If result is smaller than 128MB - BigQuery creates temporary table on your behalf (in special dataset which name starts with underscore so it is not visible in Web UI dataset/table navigator).
This temporary table is available for 24 hours and is used if you use Query Cashing or you can even use it by yourself – you just need to find which table is created. You can find this in API – destination table – which as I said above exists even if you have not set specific table. Or you can find it in Web UI
When result is bigger than 128MB – you must set destination table. The only drawback in your case is that you need to make sure you delete this table after you don’t need it anymore otherwise you will be paying for storage
You can do this either by actually deleting table - manually (in UI) or programmatically (API). Or you can set expiration on the table (API)
First of all if it's means it's too large, then probably greater than 128MB. You need to make sure that you query is accurate and if indeed you want to return the large data. Usually people make mistakes in the queries, like join explosion, missing time filters to reduce data, or missing limits.
After you are convinced the data is too large, you need to write to a table, then export to GCS, then download, and then deal with it.
https://cloud.google.com/bigquery/docs/exporting-data#exportingmultiple

GemFire : CacheLoader : Getting data from external database

cacheloader : Use case
One of main use case where GemFire is used is, where it is used as a fast running cache which holds most recent data (example last 1 month) and all remaining data sits in back-end database. I mean Gemfire data which is 1 month old is overflowed to database after 1 month.
However when user is looking for data which was beyond 1 month, we need to go to the database and get the data.
Cache loader is suitable for doing this operation on cache misses and gets data from the database. Regarding cache loader I beleive cache misses is triggered only when we do a Get operation on a key and if the key is missing.
What I do not understand is when the data gets overflowed to back-end, I beleieve no reference exist in Gemfire. Also a user may not know the Key - to do a get operation on Key, he might need to execute a OQL query on some other fields other than key.
How will cache miss be triggered when I don't know the key?
Then how does Cache loader fits into the overall solution?
Geode will not invoke a CacheLoader during a Query operation.
From the Geode documentation:
The loader is called on cache misses during get operations, and it populates the cache with the new entry value in addition to returning the value to the calling thread.
(Emphasis is my own)

Bigquery caching when hitting table would provide a different result?

As part of our Bigquery solution we have a cron job which checks the latest table created in a dataset and will create more if this table is out of date.This check is done with the following query
SELECT table_id FROM [dataset.__TABLES_SUMMARY__] WHERE table_id LIKE 'table_root%' ORDER BY creation_time DESC LIMIT 1
Our integration tests have recently been throwing errors because this query is hitting Bigquery's internal cache even though running the query against the underlying table would provide a different result. This caching also occurs if I run this query in the web interface from Google cloud console.
If I specify for the query not to cache using the
queryRequest.setUseQueryCache(false)
flag in the code then the tests pass correctly.
My understanding was that Bigquery automatic caching would not occur if running the query against the underlying table would provide a different result. Am I incorrect in this assumption in which case when does it occur or is this a bug?
Well the answer for your question is: you are doing conceptually wrong. You always need to set the no cache param if you want no cache data. Even on the web UI there are options you need to use. The default is to use the cached version.
But, fundamentally you need to change the process and use the recent features:
Automatic table creation using template tables
A common usage pattern for streaming data into BigQuery is to split a logical table into many smaller tables, either for creating smaller sets of data (e.g., by date or by user ID) or for scalability (e.g., streaming more than the current limit of 100,000 rows per second). To split a table into many smaller tables without adding complex client-side code, use the BigQuery template tables feature to let BigQuery create the tables for you.
To use a template table via the BigQuery API, add a templateSuffix parameter to your insertAll request
By using a template table, you avoid the overhead of creating each table individually and specifying the schema for each table. You need only create a single template, and supply different suffixes so that BigQuery can create the new tables for you. BigQuery places the tables in the same project and dataset. Templates also make it easier to update the schema because you need only update the template table.
Tables created via template tables are usually available within a few seconds.
This way you don't need to have a cron, as it will automatically create the missing tables.
Read more here: https://cloud.google.com/bigquery/streaming-data-into-bigquery#template-tables

Is data appended to a table or overwrite it if the table has existed already when streaming data into BigQuery

When streaming data into a BigQuery table, I wonder if the default is to append the json data to a BigQuery table if the table has existed already? The api documentation for tabledata().insertAll() is very brief and doesn't mention parameters like configuration.load.writeDisposition as in a load job.
There are no multiple choices here, so there is no default and no overridden case. Don't forget that BigQuery is a WORM technology (append-only by design). It looks for me, that you are not aware of this thing, as there is no option like UPDATE.
You just set the path parameters, the trio of project, dataset, table ID,
then set the existing schema as json and the rows, and it will append to the table.
To help ensure data consistency, you can supply insertId for each inserted row. BigQuery remembers this ID for at least one minute. If you try to stream the same set of rows within that time period and the insertId property is set, BigQuery uses the insertId property to de-duplicate your data on a best effort basis.
In case of error you have a short error code that summarizes the error. For help on debugging the specific reason value you receive, see troubleshooting errors.
Also worth reading:
Bigquery internalError when streaming data