How to handle "private" data in Elastic Search - api

I need some input from anyone who might have experience/knowledge about how ElasticSearch works and API's..
I have a (very large) database with a lot of data for a lot of different items.
I need to make all of this data searchable through a public API, so that anyone can use it and query the API about data for specific items. I already have ElasticSearch up & running, and have populated an index in ElasticSearch with all of the data from the database. ElasticSearch is working fine and so is the API.
The challenge I now face is that some of the data in our database is "private" data which must not be publicly searchable. At the same time this private data must be searchable internally, which means that I need to make the API run in both at public mode and a private mode (user authenticated). When a client that has not been authenticated queries the API for some data the client should only get the public items, whereas the private (user authenticated) client should get all possible results.
I don't have a problem with the items where all the data for one item must not be publicly available. I can simply mark them with a flag and make sure that when I return data to the client through the API they are not returned by ElasticSearch.
The challenge occurs when there is data for an item where part of the data is private and part of the data is public. I have thought about peeling off the private data before returning the data to the (public) client. This way the private data is not available directly through the API, but it will be indirectly/implicitly. If for instance the client have searched for some data which is of a private nature and in which case I will "strip" the private data from the search result before returning it to the user, then the client will get the document returned, indicating that the document was a "hit" for the specific query. However the specific query string from the client is nowhere to be found in the document that I return, thus indicating that the query string is somehow associated with the document and that the association is of a sensitive/private nature.
I have thought about creating two different indices. One that has all the data for all the objects (the private index) and one that has only the publicly available data (where I have stripped all the documents for the data parts that are of a sensitive nature). This would work and would be a fairly easy solution to implement, but the downside is that I now have duplicated data in two indices.
Any ideas?

From your description, you clearly need two distinct views of your data:
PUBLIC: Subset of the documents in the collection, and certain fields should not be searched or returned.
PRIVATE: Entire collection, all fields searchable and visible.
You can accomplish two distinct views of the data by either having:
One index / Two queries, one public, and one private (you can either implement this yourself, or have Shield manage this opaquely for you).
Two indices / Two queries (one public, one private)
In the first case, your public query will filter out private documents as you mention, and only search/return the publicly visible fields. While the private query will not filter, and will search/return all fields.
In the second case, you would actually index your data into two separate indices, and explicitly have the public query run against the public index (containing only the public fields), and the private query run against the private index.
It's true that you can build a mechanism (or use Shield) to accomplish what you need on top of a single index. However, you might want to consider (2) the public/private indices option if:
You want to reduce the risk of inadvertently exposing your sensitive data through an oversight, or configuration change.
You'd like to reduce the coupling between the public features of your application and the private features of your application.
You anticipate the scaling characteristics of public usage to deviate significantly from private usage.
As an example of this last point, most freemium sites have a very skewed distribution of paying vs non-paying users (say 1 in 10 for the sake of argument).
Not only will you likely need to aggressively replicate your public index at scale, but also by stripping your public documents of private fields, you'll proportionately reduce the resources needed to manage your public shards (and replicas).
This brings up the question of data duplication. In most systems, the search index is not "the system of record" (see discussion). Search indices are more typically used like an external database index, or perhaps a materialized view. Having duplicate data in this scenario is less of an issue when there is a durable backing store representing the latest state.
If, for whatever reason, you are relying on Elasticsearch as "the system of record", then the dual index route is somewhat trickier (as you'll need to choose one, probably the private index to represent the ground-truth, and then treat the other (public index) as a downstream view of the private data.)

Related

Static instance in the class in distributed system

i was reading the blogs for "What about the code? Why does the code need to change when it has to run on multiple machines?".
And i came across one line which i am not getting it, can anyone please help me to understand it with simple or any example.
"There should be no static instances in the class. Static instances hold application data and when a particular server goes down, all the static data/state is lost. The app is left in an inconsistent state."
Assuming: Stating instance is an instance which can be at most once per process or context - e.g. in java there is at most one copy of a static class, with all data (or state) that the class contains.
So it is very simple memory model for a static class in a single node/jvm/process. Since there is a single copy of data, it is quite straightforward to reason about it. For example, you one may update the data and every next reader will see the updated information. This is a bit more complicated for multithreaded programs, but is still straightforward comparing to distributed systems.
Clearly in a distributed system, every node may have at most one static class with state. Which means if a system contains several nodes - a distributed system - then there are several copies of data.
Having several copies is a problem. It is hard to reason about such system - every node may have some unique data and data may differ on different node. It is very hard to reason about such data: how it is synced? Availability vs consistency?
For example, take a simple counter. In a single node system, a static instance may keep the score. If one writer increased the counter, the next reader will see the increased value (assuming multithreaded part is implemented correctly, which is not that complicated).
Same counter is a distributed system is much more complicated. A writer may write to one node, but a reader may read from another.
Basically, having state on nodes is a hard problem to solve. This is the primary reason to use some distributed storage layer e.g. Hbase, Cassandra, AWS DynamoDB. All these storage systems have predictable behaviour which helps to reason about correctness of programs.
For example, there are just two servers which accepts payments from clients.
Then somebody decided to create static class to be friendly with mutli threading:
public static class Payment
{
public static decimal Amount;
public static bool IsMoneyReceived;
public static string UserName;
}
Then some client, let's call him John, decided to buy something in shop. John sent money and static class has data about this purchase. Some service is going to write data into database from
Payment class, however, electicity was turned off. Load balancer knows that the server is not responding and redirects John requests to another server which knows nothing about data in
Payment class.

Choosing a hash function for API call to determine when to update

Context:
In the design of our application, for certain frequently-used APIs, the response will be big (~3-5MB). For example, an API call to get all the profiles of 1000 users.
Moreover, more often than not, the response will stay relatively the same - unchanged. We want to save the information in the front-end store (e.g., redux-store) as a JSON object, and when the FE calls the BE to retrieve the information, we will pass in a calculated checksum as a hash value of the JSON object - let say using MD5 function. BE will calculate the response in hash value using MD5 as well. And BE will only return the response if the hash values are different. Otherwise, it will return something like HTTP.status.OK
I wonder what the most appropriate hash function for this type of operation would be, what are the criteria to choose one? Seem like no global answer from what I searched. Indeed, it should be fast, but I feel like the time to calculate a hash value is ignorable compared to other database operations. Also, the chance of collisions is negligible as well.
Any hash function with a low collision probability is workable for this application. You're not using the hash to secure your data against modification.
That being said, you should avoid MD5 for code-reviewer and boss reasons. It's no longer good for security, and some people don't like seeing it used in any new code.
SHA2-224 should be fine, for performance, security, and boss satisfaction.

Implementing a RMW operation in Redis

I would like to maintain comma separated lists of entries of the following form <ip>:<app> indexed by a an account ID. There would be one such list for each user indexed by their account ID with the number of users in the millions. This is mainly to track which server in a cluster a user using a certain application is connected to.
Since all servers are written in Java, with Redisson I'm currently doing:
RSet<String> set = client.getSet(accountKey);
and then I can modify the set using some typical Java container APIs supported by Redisson. I basically need three types of updates to these comma separated lists:
Client connects to a new application = append
Client reconnects with existing application to new endpoint = modify
Client disconnects = remove
A new connection would require a change to a field like:
1.1.1.1:foo,2.2.2.2:bar -> 1.1.1.1:foo,2.2.2.2:bar,3.3.3.3:baz
A reconnect would require an update like:
1.1.1.1:foo,2.2.2.2:bar -> 3.3.3.3:foo,2.2.2.2:bar
A disconnect would require an update like:
1.1.1.1:foo,2.2.2.2:bar -> 2.2.2.2:bar
As mentioned the fields would be keyed by the account ID of the user.
My question is the following: Without using Redisson how can I implement this "directly" on top of Redis commands? The goal is to allow rewriting certain components in a language different than Java. The cluster handles close to a million requests per second.
I'm actually quite curious how Redisson implements an RSet under the hood and I haven't had time to dig into it. I guess one option would be to use Lua, but I've never used it with Redis. Any ideas how to efficiently implement these operations on top of Redis on a manner that is easily supported by multiple languages, i.e. not relying on a specific library?
Having actually thought about the problem properly, it can be solved directly with a HSET. Where <app> is the field name and the value are the IPs. Keys being user accounts.

Google App Engine: Is there any security concern with giving away datastore urlsafe entity keys in an API?

I want to give out anonymous IDs to certain entities in a public token-based API.
Is there any reason I shouldn't be using the urlsafe string of entity keys for that, since they are already anonymized (at least in my case, where I’m not using my own data to construct the key?
Google App Engine and the Datastore are considered safe as long as I'm not handing anyone the key, which I'm not, right?
Thank you.
One of their documentations says ....The urlsafe keyword parameter uses a websafe-base64-encoded serialized reference but it's best to think of it as just an opaque unique string.... I think this is what you're referring to when you say it is anonymized
But a subsequent documentation says ....The string representation of a key looks cryptic, but is not encrypted! It can be converted back to the raw key data, both kind and identifier. If you don't want to expose this data to your users (and allow them to easily guess other entities' keys), then encrypt these strings or use something else....
You can decode the key yourself via base64 - usually there is no risk in giving it away.
The huge risk is in taking an urlsafe entity keys as parameters and using them to read from the datastore. An attacker can trick your application in reading arbitrary data from your datastore project. This is to my knowledge nowhere documented.
So basically, any variant of this is a no-go in a web server:
def get(params):
data = datastore.get(urlsavedeccode(params.key))
return data
Any key supplied from the outside should be never used with the datastore since you can not be shure you are reading the kind / path you are expecting. This is basically the same scope of risk as SQL injection.

When and how to assign unique id to an entity in DDD?

The best example would be an User entity which needs to be persisted. I have the following candidates to assign unique identifier to an user:
Assign keys provided by back-end (NDB, MySQL etc.).
Hand crafting unique identifier through some service (like system clock).
Properties like emailId.
Taking a simple example of a detailed view, we often have a detailed display of an user like some/path/users/{user_id}, if we keep emailId as the unique id then there are chances that an user may change its email id one day and breaks it.
Which is a better approach to identify the same entity?
Named UUID.
UUID, because it gives the identifier a nice predictable structure, without introducing any semantic implications (like your email id example). Think surrogate key.
Named UUID, because you want the generated id to be deterministic. Deterministic means reproducable : you can move your system to a test environment, and replay commands to inspect the results.
It also gives you an extra way to detect duplicated work - what should happen in your system if a create user command is repeated (example: user POSTs the same web form twice). There are various ways that you can guard against this in your intermediate layers, but a really easy way to cover this in your persistence layer (aka in your system of record) is to put a uniqueness constraint on the id. Because running the command a second time produces a "new" user entity with the same id, the persistence layer will object to the duplication, and you can handle things from there.
Thus, you get idempotent command handling even if all of your intermediate guard layers restart during the interval between the duplicated commands.
Named UUID gives you these properties; for instance, you might build the uuid from an identifier for the type of the entity and the id of the command (the duplicated command will have the same id when it is resent).
You can use transient properties of the user (like email address) as part of the seed for your named uuid if you have a guarantee that the property won't ever be assigned to someone else. Are you sure vivek#stackoverflow.com won't be assigned to another user? Then it's not a good seed to use.
Back end key assignment won't detect a collision if a command is duplicated - you would need to rely on some other bit of state to detect the collision.
System clock isn't a good choice, because it makes reproducing the same id difficult. A local copy of the system clock can work, if you can reproduce the updates to the local clock in your test environment. But that's a bunch of extra effort your don't want if time isn't already part of your domain model.
https://www.ietf.org/rfc/rfc4122.txt (Section 4.3)
http://www.rfc-editor.org/errata_search.php?rfc=4122&eid=1352 (Errata for the example in the spec)
Generating v5 UUID. What is name and namespace?
I agree with #VoiceOfUnreason but only partially. We all know that UUIDs are terrible to spell and keep track of. All methods to use incremental and meaningful UUIDs resolve only parts of these issues.
An aggregate is being created with some id that is already available to the creating party. Although UUID can be generated without involving any external components, this is not the only solution. Using an external identity provider like Twitter Snowflake (retired) is an option too.
It is not very complicated to create very simple and reliable identity provider that can return incrementing long value by being given an aggregate type name.
Surely, it increases the complexity and can only be justified when there is a requirement to generate sequential unique numeric values. Resilience of this service becomes very important and needs to be addressed carefully. But it can just be seen as any other critical infrastructure component and we know that every system has quite a few of those anyway.