GemFire : CacheLoader : Getting data from external database - gemfire

cacheloader : Use case
One of main use case where GemFire is used is, where it is used as a fast running cache which holds most recent data (example last 1 month) and all remaining data sits in back-end database. I mean Gemfire data which is 1 month old is overflowed to database after 1 month.
However when user is looking for data which was beyond 1 month, we need to go to the database and get the data.
Cache loader is suitable for doing this operation on cache misses and gets data from the database. Regarding cache loader I beleive cache misses is triggered only when we do a Get operation on a key and if the key is missing.
What I do not understand is when the data gets overflowed to back-end, I beleieve no reference exist in Gemfire. Also a user may not know the Key - to do a get operation on Key, he might need to execute a OQL query on some other fields other than key.
How will cache miss be triggered when I don't know the key?
Then how does Cache loader fits into the overall solution?

Geode will not invoke a CacheLoader during a Query operation.
From the Geode documentation:
The loader is called on cache misses during get operations, and it populates the cache with the new entry value in addition to returning the value to the calling thread.
(Emphasis is my own)


How to implement check and set or Read - Update ACID transaction in Aerospike

I have a usecase where I have data in Aerospike and now that data required frequent updates, but under a transaction, following ACID. The documentation doesn't clearly show how to achieve it:
It should simply require you to set 'EXPECT_GEN_EQUAL' in the write policy, so that the generation of the record is checked prior to applying the write transaction. If the generation doesn't match, you will get an error back and the server will tick the fail_generation stat. The generation is an internal simple counter metadata on a per record basis that gets incremented every time the record is updated.
You of course would need to then first read the record in order to get its current generation.
You aren't describing what your operations are. Are you aware that in Aerospike you can do multiple operations on the same record in a single transaction, all under the same record lock? This is the operate() method in all the language clients. For the Go client, it's documented at

How to consistently track all new rows in a SQL database table

What I am trying to do
I am developing a web service, which runs in multiple server instances, all accessing the same RDBMS (PostgreSQL). While the database is needed for persistence, it contains very little data, which is why every server instance has a cache of all the data. Further the application is really simple in that it only ever inserts new rows in rather simple tables and selects that data in a scheduled fashion from all server instances (no updates or changes... only inserts and reads).
The way it is currently implemented
basically I have a table which roughly looks like this:
-- further data columns...
The server is doing something like this every couple of seconds (pseudocode):
get all rows with creation_timestamp > lastMaxTimestamp
lastMaxTimestamp = max timestamp for all data just retrieved
insert new rows into application cache
The issue I am running into
The application skips certain rows when updating the caches. I analyzed the issue and figured out, that the problem is caused in the following way:
one server instance is creating a new row in the context of a transaction. An id for the new row is retrieved from the associated sequence (id=n) and the creation_timestamp (with value ts_1) is set.
another server does the same in the context of a different transaction. The new row in this transaction gets id=n+1 and a creation_timestamp ts_2 (where ts_1 < ts_2).
transaction 2 finishes before transaction 1
one of the servers executes a "select all rows with creation_timestamp > lastMaxTimestamp". It gets row n+1, but not n1. It sets lastMaxTimestamp to ts_2.
transaction 1 completes
some time later the server from step 4 executes "select all rows with creation_timestamp > lastMaxTimestamp" again. But since lastMaxTimestamp=ts_2 and ts_2>ts_1 the row n will never be read on that server.
Note: CURRENT_TIMESTAMP has the same value during a transaction, which is the transaction start time.
So the application gets inconsistent data into its cache and can't get new rows based on the insertion timestamp OR based on the sequence id. Transaction isolation levels don't really change anything about the situation, since the problem is created in essence by transaction 2 finishing before transaction 1.
My question
Am I missing something? I am thinking there must be a straightforward way to get all new rows of a RDBMS, but I can't come up with a simple solution... at least with a simple solution that is consistent. Extensive locking (e.g. of tables) wouldn't be acceptable because of performance reasons. Simply trying to ensure to get all ids from that sequence seems like a) a complicated solution and b) can't be done easily, since rollbacks during transactions can happen (which would lead to sequence ids not being used).
Anyone has the solution?
After a lot of searching, I found the right keywords to google for... "transaction commit timestamp" to leads to all sorts of transaction timestamp tracking and system columns like xmin:
This post has some more detailed information:
Questions about Postgres track_commit_timestamp (pg_xact_commit_timestamp)
In short:
you can turn on a postgresql option to track timestamps of commits and compare those instead of the current_timestamps/clock_timestamps inside the transaction
it seems though, that it is only tracked when a transaction is completed - not when it is commited, which makes the solution not bullet proof. There are also further issue to consider like transaction id (xmin) rollover for example
logical decoding / replication is something to look into for a proper solution
Thanks to everyone trying to help me find an answer. I hope this summary is useful to someone in the future.

NiFi GenerateTableFetch does not store state per

I am testing out NiFi to replace our current ingestion setup which imports data from multiple MySQL shards of a table and store it in HDFS.
I am using GenerateTableFetch and ExecuteSQL to achieve this.
Each incoming flow file will have a attribute which is being used by DBCPConnectionPoolLookup to select the relevant shard.
Issue is that, let's say I have 2 shards to pull data from, shard_1 and shard_2 for table accounts and also I have updated_at as Maximum Value Columns, it is not storing state for the for the table#updated_at per shard. There is only 1 entry per table in state.
When I check in Data Provenance, I see the shard_2 flowfile file getting dropped without being passed to ExecuteSQL. And my guess is it's because shard_1 query gets executed first and then when shard_2 query comes, it's records are checked against shard_1's updated_at and since it returns empty, it drops the file.
Has anyone faced this issue? Or am I missing something?
The ability to choose different databases via DBCPConnectionPoolLookup was added after the scheme to store state in the database fetch processors (QueryDatabaseTable, GenerateTableFetch, e.g.). Also, getting the database name differs between RDBMS drivers, it might be in the DatabaseMetaData or ResultSetMetaData, possibly in getCatalog() or getSchema() or neither.
I have written NIFI-5590 to cover this improvement.

The order of records in a regularly updated bigquery database

I am going to be maintaining a local copy of a database on bigquery. I will be using the API and tabledata:list. This database is not my own, and is regularly updated by the maintainers by appending new data (say every hour).
First, can I assume that when this data is appended, it will definitely be added to the end of the database?
Now, let's assume that currently the database has 1,000,000 rows and I am now downloading all of these by paging through tabledata:list. Also, let's assume that the database is updated partway through (with 10,000 rows). By using the page tokens, can I be assured that I will only download the 1m rows present when I started in the order they are in in the database?
Finally, now let's say that I come to update my copy. If I initiate the tabledata:list with a startIndex of 1,000,000 and I use a maxResults of 1000, will I get 10 pages containing the updated data that I am expecting?
I suppose all these questions boil down to whether bigquery respects the order the data is in, whether this order is used by tabledata:list, and whether appended data is guaranteed to follow previous data.
As there is a column whose values are unique, and I can perform a simple select count(1) from table to get the length of the table, I can of course check that my local copy is complete by comparing the length of my local db with that of the remote, however if the above weren't guaranteed and I ended up with holes in my data, it would be quite impractical to remedy as the primary key is not sequential (otherwise I could just fill in the missing rows) and the database is very large.
When you append data, we will append to the end of the table data list, however, bigquery may periodically coalesce data, which does not respect ordering. We have been discussing being able to preserve the ordering, or at least have a way of accessing the most recent data, but this is not yet implemented or designed. If it is an important feature for you, let us know and we'll prioritize it accordingly.
If you use page tokens, you are assured of a stable listing. If the table gets updated in the middle of paging through the data, you'll still only see the data that was in the table when you created the page token. Note that because of this, page tokens are only valid for 24 hours.
This should work as long as no coalesce has occurred since you have updated the table.
You can get the number of rows in the table by calling tables.get, which is usually simpler and faster than running a query.

Real time analytic processing system design

I am designing a system that should analyze large number of user transactions and produce aggregated measures (such as trends and etc).
The system should work fast, be robust and scalable.
System is java based (on Linux).
The data arrives from a system that generate log files (CSV based) of user transactions.
The system generates a file every minute and each file contains the transactions of different users (sorted by time), each file may contain thousands of users.
A sample data structure for a CSV file:
10:30:01,user 1,...
10:30:01,user 1,...
10:30:02,user 78,...
10:30:02,user 2,...
10:30:03,user 1,...
10:30:04,user 2,...
The system I am planning should process the files and perform some analysis in real-time.
It has to gather the input, send it to several algorithms and other systems and store computed results in a database. The database does not hold the actual input records but only high level aggregated analysis about the transactions. For example trends and etc.
The first algorithm I am planning to use requires for best operation at least 10 user records, if it can not find 10 records after 5 minutes, it should use what ever data available.
I would like to use Storm for the implementation, but I would prefer to leave this discussion in the design level as much as possible.
A list of system components:
A task that monitors incoming files every minute.
A task that read the file, parse it and make it available for other system components and algorithms.
A component to buffer 10 records for a user (no longer than 5 minutes), when 10 records are gathered, or 5 minute have passed, it is time to send the data to the algorithm for further processing.
Since the requirement is to supply at least 10 records for the algorithm, I thought of using Storm Field Grouping (which means the same task gets called for the same user) and track the collection of 10 user's records inside the task, of course I plan to have several of these tasks, each handles a portion of the users.
There are other components that work on a single transaction, for them I plan on creating other tasks that receive each transaction as it gets parsed (in parallel to other tasks).
I need your help with #3.
What are the best practice for designing such a component?
It is obvious that it needs to maintain the data for 10 records per users.
A key value map may help, Is it better to have the map managed in the task itself or using a distributed cache?
For example Redis a key value store (I never used it before).
Thanks for your help
I had worked with redis quite a bit. So, I'll comment on your thought of using redis
#3 has 3 requirements
Buffer per user
Buffer for 10 Tasks
Should Expire every 5 min
1. Buffer Per User:
Redis is just a key value store. Although it supports wide variety of datatypes, they are always values mapped to a STRING key. So, You should decide how to identify a user uniquely incase you need have per user buffer. Because In redis you will never get an error when you override a key new value. One solution might be check the existence before write.
2. Buffer for 10 Tasks: You obviously can implement a queue in redis. But restricting its size is left to you. Ex: Using LPUSH and LTRIM or Using LLEN to check the length and decide whether to trigger your process. The key associated with this queue should be the one you decided in part 1.
3. Buffer Expires in 5 min: This is a toughest task. In redis every key irrespective of underlying datatype it value has, can have an expiry. But the expiry process is silent. You won't get notified on expiry of any key. So, you will silently lose your buffer if you use this property. One work around for this is, having an index. Means, the index will map a timestamp to the keys who are all need to be expired at that timestamp value. Then in background you can read the index every minute and manually delete the key [after reading] out of redis and call your desired process with the buffer data. To have such an index you can look at Sorted Sets. Where timestamp will be your score and set member will be the keys [unique key per user decided in part 1 which maps to a queue] you wish to delete at that timestamp. You can do zrangebyscore to read all set members with specified timestamp
Use Redis List to implement a queue.
Use LLEN to make sure you are not exceeding your 10 limit.
Whenever you create a new list make an entry into index [Sorted Set] with Score as Current Timestamp + 5 min and Value as the list's key.
When LLEN reaches 10, remember to read then remove the key from the index [sorted set] and from the db [delete the key->list]. Then trigger your process with data.
For every one min, generate current timestamp, read the index and for every key, read data then remove the key from db and trigger your process.
This might be my way to implement it. There might be some other better way to model your data in redis
For your requirements 1 & 2: [Apache Flume or Kafka]
For your requirement #3: [Esper Bolt inside Storm. In Redis for accomplishing this you will have to rewrite the Esper Logic.]