Datamodel design for an application using Redis - redis

I am new to redis and I am trying to figure out how redis can be used.
So please let me know if this is a right way to build an application.
I am building an application which has got only one data source. I am planning to run a job on nightly basis to get data into a file.
Now I have a front end application, that needs to render this data in different formats.
Example application use case
Download processed applications by a university on nightly basis.
Display how many applications got approved or rejected.
Display number of applications by state.
Let user search for an application by application id.
Instead of using postgres/mysql like relational database, I am thinking about using redis. I am planning to store data in following ways.
Application id -> Application details
State -> List of application ids
Approved -> List of application ids (By date ?)
Declined -> List of application ids (By date ?)
Is this correct way to store data into redis?
Also if someone queries for all applications in california for a certain date,
I will be able to pull application ids in one call but to get details for each application, do I need to make another request?

Word of caution:
Instead of using postgres/mysql like relational database, I am thinking about using redis.
Why? Redis is an amazing database, but don't use the right hammer for the wrong nail. Use Redis if you need real time performance at scale, but don't try make it replace an RDBMS if that's what you need.
Answer:
Fetching data efficiently from Redis to answer your queries depends on how you'll be storing it. Therefore, to determine the "correct" data model, you first need to define your queries. The data model you proposed is just a description of the data - it doesn't really say how you're planning to store it in Redis. Without more details about the queries, I would store the data as follows:
Store the application details in a Hash (e.g. app:<id>)
Store the application IDs in a per state in Set (e.g. apps:<state>)
Store the approved/rejected applications in two Sorted Sets, the id being the member and the date being the score
Also if someone queries for all applications in california for a certain date, I will be able to pull application ids in one call but to get details for each application, do I need to make another request?
Again, that depends on the data model but you can use Lua scripts to embed this logic and execute it in one call to the database.

First of all you can use a Hash to store structured Data. With Lists (ZSets) and Sets you can create indexes for an ordered or unordered access. (Depending on your requirements of course. Make a list of how you want to access your data).
It is possible to get all data as json of an index in one go with a simple redis script (example using an unordered set):
local bulkToTable = function(bulk)
local retTable = {};
for index = 1, #bulk, 2 do
local key = bulk[index];
local value = bulk[index+1];
retTable[key] = value;
end
return retTable;
end
local functionSet = redis.call("SMEMBERS", "app:functions")
local returnObj = {} ;
for index = 1, #functionSet, 1 do
returnObj[index] = bulkToTable(redis.call("HGETALL", "app:function:" .. functionSet[index]));
returnObj[index]["functionId"] = functionSet[index];
end
return cjson.encode(returnObj);
more information about redis scripts see here : http://www.redisgreen.net/blog/intro-to-lua-for-redis-programmers/

Related

How to properly store a JSON object into a Table?

I am working on a scenario where I have invoices available in my Data Lake Store.
Invoice example (extremely simplified):
{
"business_guid":"b4f16300-8e78-4358-b3d2-b29436eaeba8",
"ingress_timestamp": 1523053808,
"client":{
"name":"Jake",
"age":55
},
"transactions":[
{
"name":"peanut",
"amount":100
},
{
"name":"avocado",
"amount":2
}
]
}
All invoices are stored in ADLS, and can be queried. But, It is my desire to provide access to the same data inside an ALD DB.
I am not an expert on unstructed data: I have RDBMS background. Taking that into consideration, I can only think of 2 possible scenarios:
2/3 tables - invoice, client (could be removed) and transaction. In this scenario, I would have to create an invoice ID to be able to build relationships between those tables
1 table - client info could be normalized into invoice data. But, transactions could (maybe) be defined as an SQL.ARRAY<SQL.MAP<string, object>>
I have mainly 3 questions:
What is the correct way of doing so? Solution 1 seems much better structured.
If I go with solution 1, how do I properly create an ID (probably GUID)? Is it acceptable to require ID creation when working with ADL?
Is there another solution I am missing here?
Thanks in advance!
This type of question is a bit like do you prefer your sauce on the pasta or next to the pasta :). The answer is: it depends.
To answer your 3 questions more seriously:
#1 has the benefit of being normalized that works well if you want to operate on the data separately (e.g., just clients, just invoices, just transactions) and want to the benefits of normalization, get the right indexing, and are not limited by the rowsize limits (e.g., your array of map needs to fit into a row). So I would recommend that approach unless your transaction data is always small and you always access the data together and mainly search on the column data.
U-SQL per se has no understanding of the hierarchy of the JSON document. Thus, you would have to write an extractor that turns your JSON into rows in a way that it either gives you the correlation of the parent to the child (normally done by stepwise downwards navigation with cross apply) and use the key value of the parent data item as the foreign key, or have the extractor generate the key (as int or guid).
There are some sample JSON extractors on the U-SQL GitHub site (start at http://usql.io) that can get you started with the JSON to rowset conversion. Note that you will probably want to optimize the extraction at some point to be JSON Reader based so you process larger docs without loading it into memory.

How to get multiple data from gemfire cacheloader?

We are going to implement gemfire for our project. We are currently syncing gemfire cache with our DB2 database. So, we are facing issue while putting DB data into cache.
To put DB data into region. I have implement com.gemstone.gemfire.cache.CacheLoader and override load method of it. As written in java doc load method will return only one Object. But for our requirement we will have to return multiple VO from load method
public List<CmDvceInvtrGemfireBean> load(LoaderHelper<CmDvceInvtrGemfireBean, CmDvceInvtrGemfireBean> helper)
throws CacheLoaderException
While returining multiple VO in form of List<CmDvceInvtrGemfireBean> gemfire region consider it's as single value.
So, when i invoke,
System.out.println("return COUNT" + cmDvceInvtrRecord.query("SELECT COUNT(*) FROM /cmDvceInvtrRecord"));
It return count of one. But i can see total 7 number of data into it.
So, I want to implement the kind of mechanism that will put all the 7 values as a separate VO in Region
Is there any way to do this using Gemfire CacheLoader?
A CacheLoader was meant to load a value only for a single entry in the GemFire Region on a cache miss. As the Javadoc states...
..creates the value for the desired key..
While a key can map to a multi-valued (e.g. an array/Collection) value, the CacheLoader can only populate a single entry.
You will have to resort to other means of populating the cache with multiple "entries" in a single operation.
Out of curiosity, why do you need (requirement?) to load multiple entries (from the DB) at once? Are you trying to minimize the number of round trips to the DB?
Also, what logic are you using to decide what VO from the DB will be loaded based on the information (i.e. key) provided in the CacheLoader?
For instance, are you somehow trying to predictably select values from the DB based on the CacheLoader key that would subsequently minimize cache misses on future Region.get(key) calls?
Sorry, I don't have a better answer for you right now, but answers to some of these questions may help me give you some ideas for alternatives.
Cheers,
John

neo4j count nodes performance on 200K nodes and 450K relations

We're developing an application based on neo4j and php with about 200k nodes, which every node has a property like type='user' or type='company' to denote a specific entity of our application. We need to get the count of all nodes of a specific type in the graph.
We created an index for every entity like users, companies which holds the nodes of that property. So inside users index resides 130K nodes, and the rest on companies.
With Cypher we quering like this.
START u=node:users('id:*')
RETURN count(u)
And the results are
Returned 1 row.Query took 4080ms
The Server is configured as default with a little tweaks, but 4 sec is too for our needs. Think that the database will grow in 1 month 20K, so we need this query performs very very much.
Is there any other way to do this, maybe with Gremlin, or with some other server plugin?
I'll cache those results, but I want to know if is possible to tweak this.
Thanks a lot and sorry for my poor english.
Finaly, using Gremlin instead of Cypher, I found the solution.
g.getRawGraph().index().forNodes('NAME_OF_USERS_INDEX').query(
new org.neo4j.index.lucene.QueryContext('*')
).size()
This method uses the lucene index to get "aproximate" rows.
Thanks again to all.
Mmh,
this is really about the performance of that Lucene index. If you just need this single query most of the time, why not update an integer with the total count on some node somewhere, and maybe update that together with the index insertions, for good measure run an update with the query above every night on it?
You could instead keep a property on a specific node up to date with the number of such nodes, where updates are done guarded by write locks:
Transaction tx = db.beginTx();
try {
...
...
tx.acquireWriteLock( countingNode );
countingNode.setProperty( "user_count",
((Integer)countingNode.getProperty( "user_count" ))+1 );
tx.success();
} finally {
tx.finish();
}
If you want the best performance, don't model your entity categories as properties on the node. In stead, do it like this :
company1-[:IS_ENTITY]->companyentity
Or if you are using 2.0
company1:COMPANY
The second would also allow you automatically update your index in a separate background thread by the way, imo one of the best new features of 2.0
The first method should also proof more efficient, since making a "hop" in general takes less time than reading a property from a node. It does however require you to create a separate index for the entities.
Your queries would look like this :
v2.0
MATCH company:COMPANY
RETURN count(company)
v1.9
START entity=node:entityindex(value='company')
MATCH company-[:IS_ENTITIY]->entity
RETURN count(company)

django objects...values() select only some fields

I'm optimizing the memory load (~2GB, offline accounting and analysis routine) of this line:
l2 = Photograph.objects.filter(**(movie.get_selectors())).values()
Is there a way to convince django to skip certain columns when fetching values()?
Specifically, the routine obtains all rows of the table matching certain criteria (db is optimized and performs it very quickly), but it is a bit too much for python to handle - there is a long string referenced in each row, storing the urls for thumbnails.
I only really need three fields from each row, but, if all the fields are included, it suddenly consumes about 5kB/row which sadly pushes the RAM to the limit.
The values(*fields) function allows you to specify which fields you want.
Check out the QuerySet method, only. When you declare that you only want certain fields to be loaded immediately, the QuerySet manager will not pull in the other fields in your object, till you try to access them.
If you have to deal with ForeignKeys, that must also be pre-fetched, then also check out select_related
The two links above to the Django documentation have good examples, that should clarify their use.
Take a look at Django Debug Toolbar it comes with a debugsqlshell management command that allows you to see the SQL queries being generated, along with the time taken, as you play around with your models on a django/python shell.

Searching Authorize.net CIM Records

Has anyone come up with an elegant way to search data stored on Authorize.net's Customer Information Manager (CIM)?
Based on their XML Guide there doesn't appear to be any search capabilities at all. That's a huge short-coming.
As I understand it, the selling point for CIM is that the merchant doesn't need to store any customer information. They merely store a unique identifier for each and retrieve the data as needed. This may be great from a PCI Compliance perspective, but it's horrible from a flexibility standpoint.
A simple search like "Show me all orders from Texas" suddenly becomes very complicated.
How are the rest of you handling this problem?
The short answer is, you're correct: There is no API support for searching CIM records. And due to the way it is structured, there is no easy way to use CIM alone for searching all records.
To search them in the manner you describe:
Use getCustomerProfileIdsRequest to get all the customer profile IDs you have stored.
For each of the CustomerProfileIds returned by that request, use getCustomerProfileRequest to get the specific record for that client.
Examine each record at that time, looking for the criterion you want, storing the pertinent records in some other structure; a class, a multi-dimensional array, an ADO DataTable, whatever.
Yes, that's onerous. But it is literally the only way to proceed.
The previously mentioned reporting API applies only to transactions, not the Customer Information Manager.
Note that you can collect the kind of data you want at the time of recording a transaction, and as long as you don't make it personally identifiable, you can store it locally.
For example, you could run a request for all your CIM customer profile records, and store the state each customer is from in a local database.
If all you store is the state, then you can work with those records, because nothing ties the state to a specific customer record. Going forward, you could write logic to update the local state record store at the same time customer profile records are created / updated, too.
I realize this probably isn't what you wanted to hear, but them's the breaks.
This is likely to be VERY slow and inefficient. But here is one method. Request an array of all the customer Id's, and then check each one for the field you want... in my case I wanted a search-by-email function in PHP:
$cimData = new AuthorizeNetCIM;
$profileIds = $cimData->getCustomerProfileIds();
$profileIds = $cimData->getCustomerProfileIds();
$array = $profileIds->xpath('ids');
$authnet_cid = null;
/*
this seems ridiculously inefficient...
gotta be a better way to lookup a customer based on email
*/
foreach ( $array[0]->numericString as $ids ) { // put all the id's into an array
$response = $cimData->getCustomerProfile($ids); //search an individual id for a match
//put the kettle on
if ($response->xml->profile->email == $email) {
$authnet_cid = $ids;
$oldCustomerProfile = $response->xml->profile;
}
}
// now that the tea is ready, cream, sugar, biscuits, you might have your search result!
CIM's primary purpose is to take PCI compliance issues out of your hands by allowing you to store customer data, including credit cards, on their server and then access them using only a unique ID. If you want to do reporting you will need to keep track of that kind of information yourself. Since there's no PCI compliance issues with storing customer addresses, etc, it's realistic to do this yourself. Basically, this is the kind of stuff that needs to get flushed out during the design phase of the project.
They do have a new reporting API which may offer you this functionality. If it does not it's very possible it will be offered in the near future as Authnet is currently actively rolling out lots of new features to their APIs.