Static instance in the class in distributed system - oop

i was reading the blogs for "What about the code? Why does the code need to change when it has to run on multiple machines?".
And i came across one line which i am not getting it, can anyone please help me to understand it with simple or any example.
"There should be no static instances in the class. Static instances hold application data and when a particular server goes down, all the static data/state is lost. The app is left in an inconsistent state."

Assuming: Stating instance is an instance which can be at most once per process or context - e.g. in java there is at most one copy of a static class, with all data (or state) that the class contains.
So it is very simple memory model for a static class in a single node/jvm/process. Since there is a single copy of data, it is quite straightforward to reason about it. For example, you one may update the data and every next reader will see the updated information. This is a bit more complicated for multithreaded programs, but is still straightforward comparing to distributed systems.
Clearly in a distributed system, every node may have at most one static class with state. Which means if a system contains several nodes - a distributed system - then there are several copies of data.
Having several copies is a problem. It is hard to reason about such system - every node may have some unique data and data may differ on different node. It is very hard to reason about such data: how it is synced? Availability vs consistency?
For example, take a simple counter. In a single node system, a static instance may keep the score. If one writer increased the counter, the next reader will see the increased value (assuming multithreaded part is implemented correctly, which is not that complicated).
Same counter is a distributed system is much more complicated. A writer may write to one node, but a reader may read from another.
Basically, having state on nodes is a hard problem to solve. This is the primary reason to use some distributed storage layer e.g. Hbase, Cassandra, AWS DynamoDB. All these storage systems have predictable behaviour which helps to reason about correctness of programs.

For example, there are just two servers which accepts payments from clients.
Then somebody decided to create static class to be friendly with mutli threading:
public static class Payment
{
public static decimal Amount;
public static bool IsMoneyReceived;
public static string UserName;
}
Then some client, let's call him John, decided to buy something in shop. John sent money and static class has data about this purchase. Some service is going to write data into database from
Payment class, however, electicity was turned off. Load balancer knows that the server is not responding and redirects John requests to another server which knows nothing about data in
Payment class.

Related

How to handle "private" data in Elastic Search

I need some input from anyone who might have experience/knowledge about how ElasticSearch works and API's..
I have a (very large) database with a lot of data for a lot of different items.
I need to make all of this data searchable through a public API, so that anyone can use it and query the API about data for specific items. I already have ElasticSearch up & running, and have populated an index in ElasticSearch with all of the data from the database. ElasticSearch is working fine and so is the API.
The challenge I now face is that some of the data in our database is "private" data which must not be publicly searchable. At the same time this private data must be searchable internally, which means that I need to make the API run in both at public mode and a private mode (user authenticated). When a client that has not been authenticated queries the API for some data the client should only get the public items, whereas the private (user authenticated) client should get all possible results.
I don't have a problem with the items where all the data for one item must not be publicly available. I can simply mark them with a flag and make sure that when I return data to the client through the API they are not returned by ElasticSearch.
The challenge occurs when there is data for an item where part of the data is private and part of the data is public. I have thought about peeling off the private data before returning the data to the (public) client. This way the private data is not available directly through the API, but it will be indirectly/implicitly. If for instance the client have searched for some data which is of a private nature and in which case I will "strip" the private data from the search result before returning it to the user, then the client will get the document returned, indicating that the document was a "hit" for the specific query. However the specific query string from the client is nowhere to be found in the document that I return, thus indicating that the query string is somehow associated with the document and that the association is of a sensitive/private nature.
I have thought about creating two different indices. One that has all the data for all the objects (the private index) and one that has only the publicly available data (where I have stripped all the documents for the data parts that are of a sensitive nature). This would work and would be a fairly easy solution to implement, but the downside is that I now have duplicated data in two indices.
Any ideas?
From your description, you clearly need two distinct views of your data:
PUBLIC: Subset of the documents in the collection, and certain fields should not be searched or returned.
PRIVATE: Entire collection, all fields searchable and visible.
You can accomplish two distinct views of the data by either having:
One index / Two queries, one public, and one private (you can either implement this yourself, or have Shield manage this opaquely for you).
Two indices / Two queries (one public, one private)
In the first case, your public query will filter out private documents as you mention, and only search/return the publicly visible fields. While the private query will not filter, and will search/return all fields.
In the second case, you would actually index your data into two separate indices, and explicitly have the public query run against the public index (containing only the public fields), and the private query run against the private index.
It's true that you can build a mechanism (or use Shield) to accomplish what you need on top of a single index. However, you might want to consider (2) the public/private indices option if:
You want to reduce the risk of inadvertently exposing your sensitive data through an oversight, or configuration change.
You'd like to reduce the coupling between the public features of your application and the private features of your application.
You anticipate the scaling characteristics of public usage to deviate significantly from private usage.
As an example of this last point, most freemium sites have a very skewed distribution of paying vs non-paying users (say 1 in 10 for the sake of argument).
Not only will you likely need to aggressively replicate your public index at scale, but also by stripping your public documents of private fields, you'll proportionately reduce the resources needed to manage your public shards (and replicas).
This brings up the question of data duplication. In most systems, the search index is not "the system of record" (see discussion). Search indices are more typically used like an external database index, or perhaps a materialized view. Having duplicate data in this scenario is less of an issue when there is a durable backing store representing the latest state.
If, for whatever reason, you are relying on Elasticsearch as "the system of record", then the dual index route is somewhat trickier (as you'll need to choose one, probably the private index to represent the ground-truth, and then treat the other (public index) as a downstream view of the private data.)

Apache JENA TDB files locked after creation with web application

I am using JENA to create a triple store (TDB functionality) with the following code:
public void createTDBFromOWL() {
Dataset dataset = TDBFactory.createDataset(newTripleStoreLocation);
dataset.begin(ReadWrite.WRITE);
try {
//getting the model inside the transaction
Model model = dataset.getDefaultModel();
FileManager fileManager=FileManager.get();
Model holder=fileManager.readModel(model, newOWLFileLocation);
//committing dataset
dataset.commit();
model.close();
holder.close();
} finally {
dataset.end();
dataset.close();
}
}
After I create the triple store, the files created are locked by my application server (Glassfish), and I can't delete them until I manually stop Glassfish and it releases its lock. As shown in the above code, I think I am closing everything, so I don't get why a lock is maintained on the files.
When you call Dataset#close(), the implementation delegates that call to an underlying
DatasetGraphBase#close(), which then ultimately delegates to DatasetGraphTDB#_close().
This results in calls to TripleTable#close() and QuadTable#close(). Both of these call (several) NodeTupleTable#close(). Continuing with the indirection, this calls NodeTable#close() and TupleTable#close(). The former is an interface, so we'd need to make a proper guess as to which class is run in your implementation. The latter iterates through a collection of TupleIndex objects and calls close() on each of them. TupleIndex is, also, an interface.
There is only one meaningful heirarchy of descendents from TupleIndex that results in something which can lock a file, which leads us to TupleIndexRecord#close(). We can then follow a particular implementation of RangeIndex called BPlusTree all the way down until we see actual ownership of the MappedByteBuffer
Ultimately, while reading the implementation of BlockAccessMapped#close(), it seems like the entire heirarchy is closing things properly, down to the final classes, but that this longstanding bug may be the culprit. From the documentation:
once a file has been mapped a number of operations on that file will
fail until the mapping has been released (e.g. delete, truncating to a
size less than the mapped area). However the programmer can't control
accurately the time at which the unmapping takes place --- typically
it depends on the processing of finalization or a PhantomReference
queue.
So there you have it. Despite Jena's best efforts, one cannot yet control when that file will be unmapped in Java. This ends up being the tradeoff for memory-mapped file IO in java.

Is it possible to use Bukkit for Minecraft to define a new kind of mob?

I'd like to write a Minecraft mod which adds a new type of mob. Is that possible? I see that, in Bukkit, EntityType is a predefined enum, which leads me to believe there may not be a way to add a new type of entity. I'm hoping that's wrong.
Yes, you can!
I'd direct you to some tutorials on the Bukkit forums. Specifically:
Creating a Meteor Entity
Modifying the Behavior of a Mob or Entity
Disclaimer: the first is written by me.
You cannot truly add an entirely new mob just via Bukkit. You'd have to use Spout to give it a different skin. However, in the case you simply want a mob, and are content with sharing a skin of another entity, then it can be done.
The idea is injecting the EntityType values via Java's Reflection API. It would look something like this:
public static void load() {
try {
Method a = EntityTypes.class.getDeclaredMethod("a", Class.class, String.class, int.class);
a.setAccessible(true);
a.invoke(a, YourEntityClass.class, "Your identifier, can be anything", id_map);
} catch (Exception e) {
//Insert handling code here
}
}
I think the above is fairly straightforward. We get a handle to the private method, make it public, and invoke its registration method.id_map contains the entity id to map your entity to. 12 is that of a fireball. The mapping can be found in EntityType.class. Note that these ids should not be confused with their packet designations. The two are completely different.
Lastly, you actually need to spawn your entity. MC will continue spawning the default entity, since we haven't removed it from the map. But its just a matter of calling the net.minecraft.server.spawnEntity(your_entity, SpawnReason.CUSTOM).
If you need a skin, I suggest you look into SpoutPlugin. It does require running the Spout client to join to such a server, but the possibilities at that point are literally infinite.
It would only be possible with client-side mods as well, sadly. You could look into Spout, (http://www.spout.org/) which is a client mod which provides an API for server-side plugins to do more on the client, but without doing something client side, this is impossible.
It's not possible to add new entities, but it is possible to edit entity behaviors for example one time, I made it so that you could tame iron golems and they followed you around.
Also you can sort of achieve custom looking human entities by accessing player entities and tweaking network packets
It's expensive as you need to create a player account to achieve this that then gets used to act as a mob. You then spawn a named entity and give it the same behaviour AI as you would with an existing mob. Keep in mind however you will need to write the AI yourself (you could borrow code straight from craftbukkit/bukkit) and you will need to push the movement and events of this mob to players within sight .. As technically speaking all your doing is pushing packets to the client from the serve on what's actually happening but if your outside that push list nothing will happen as other players will see you being knocked around by invisible something :) it's a bit of a mental leap :)
I'm using this concept to create Npc that act as friendly and factional armies. I've also used mobs themselves as friendly entities (if you belong to a dark faction)
I'd like to personally see future server API that can push model instructions to the client for server specific cache as well as the ability to tell a client where to download mob skins ..
It's doable today but I'd have to create a plugin for the client to achieve this which is then back to a game of annoyance especially when mojang push out a new release and all the plugins take forever to rise with its tide
In all honesty this entire ecosystem could be managed more strategically but right now I think it's just really ad hoc product management (speaking as a former product manager of .net I'd love to work on this strategy it would be such a fun gig)

Many-to-many relationships with objects, where intermediate fields exist?

I'm trying to build a model of the servers and applications at my workplace. A server can host many applications. An application can be hosted across many servers.
Normally I would just have the host class contain a List, and the application class a List. However, there are a few fields that are specific to the particular host-application relationship. For example, UsedMb represents the amount of disk-space used by an application on a host.
I could, of course have a HostedApplicationclass representing an intermediate object which would hold the UsedMb field. Both Host and Application classes would then contain a List.
The problem is, however, that an application needs also to know about some aspects of its Host that would be included in the Host class (for example, the hosts are geographically distrubuted; an application needs to know how many data centres it is hosted in, so it needs to be able to check the DC names of all its hosts.
So instead I could have the HostedApplication class hold references to both the Host object and Application object it refers to. But then in some cases I will need to loop through all applications (and in other cases, all hosts). Therefore I would need 3 separate lists, a List, and List, and a List, to be able to loop through all three as needed.
My basic question is, what is the standard way of dealing with this sort of configuration? All options have advantages and disadvantages. The last option I mentioned seems most correct, but is having three lists overkill? Is there a more elegant solution?
Ideally i would be able to talk to you about the problem, but here is a potential solution based on my rough understanding of the requirements ( c++ style with a lot of implementation left out)
class Host {
public:
string GeographicLocation() const;
string DCName() const;
};
class HostAsAppearsToClient : public Host {
HostAsAppearsToClient(const Host&);
// Allows Host -> HostAsAppears... conversion
size UsedMB() const;
void UseMoreMB(size) const;
};
class Client {
HostAsAppearsToClient* hosts;
void AddHost(const Host& host) {
// Reallocate enough size or move a pointer or whatever
hosts[new_index] = HostAsAppearsToClient(host);
hosts[new_index].UseMoreMB(56);
}
void DoSomething() {
hosts[index].UsedMB();
// Gets the MB that that host is using, and all other normal host data if
// we decide we need it ...
print(hosts[index].DCName());
}
};
int Main() {
Host* hosts = new Host[40];
Client* clients = new Client[30];
// hosts[2].UsedMB() // Doesn't allow
}
I fully expect that this does not meet your requirements, but please let me know in what way so that I can better understand your problem.
EDIT:
VBA .... unlucky :-P
It is possible to load dll's in VBA, which would allow you to write and compile your code in any other language, and just forward the inputs and outputs through VBA from the UI to the DLL, but i guess its up to you if thats worth it. Documentation on how to use a dll in VBA Excel: link
Good Luck!

Class design for serialization - ideas or patterns?

Let me begin with an illustrative example (assume the implementation is in a statically typed language such as Java or C#).
Assume that you are building a content management system (CMS) or something similar. The data is hierarchically organised into Folders. Each folder has a collection of children; a child may be a Page or a Folder. All items are stored within a root folder. No cycles are allowed. We have an acyclic graph.
The system will have a remote API and instances of Folder and Page must be serialized / de-serialized across the network. With a typical implementation of folder, in which a folder's children are a List, serialization of the root node would send the entire graph. This is unacceptable for obvious reasons.
I am interested to hear people have solved this problem in the past.
I have two potential suggestions:
Navigation by query: Change the domain model so that the folder class contains only a list of IDs for each child. To access a child we must query for it. Serialisation is now trivial since the graph ends at a well defined point. The major downside is that we lose type safety - the ID could be for something other than a folder/child.
Stop and re-attach: During serialization stop whenever we detect a reference to a folder or page, send the ID instead. When de-serializing we must then look up the corresponding object for each ID and re-attach it at the relevant position in the nascent object.
I don't know what kind of API you are trying to build, but your suggestion #1 sounds like it is close to what is recommended for REST style services and APIs. Basically, a Folder object would contain a list of URLs to its children.
The Navigation by query solution was used for NFS. By reading through your question, it looks to me, as if you're trying to implements kind of a file system yourself.
If you're looking specifically into sending objects over the network there is always CORBA. Aside from that there is DCOM and the newer WCF. But wait there is more like RMI. Furthermore there are Web Services. I'll stop here now.
Suppose You model the whole tree with every element being a Node, specialisations of Node being Folder and, umm, Leaf. You have a "root" Node. Nodes have a methods
canHaveChildren()
getChildren()
Leaf nodes have the obvious behaviours (never even need to hit the network)
Folders getChildren() get the next set of nodes.
I did devise a system with Restful services along these lines. Seemed to be reasonably easy to program to.
I would not do it by the Navigation by query method. Simply because I would like to stick with the domain model where folders contains folders or pages.
Customizing the serialization might also be tricky, bug prone and difficult to change\understand.
I would suggest that you introduce and object like FolderBowser in your model which takes an id and gives you a list of contents of the folder. That will make your service operations simpler.
Cheers,
Unmesh
The classical solution is probably to use a proxy pattern, where some of the graph is sent over the network and some of the folders are replaced by proxies that will not have their lists of children populated until they are queried. A round trip to the server takes a significant amount of time and it will probably result in too many requests if all folders are proxies (this would yield a new request each time the contents of a folder is inspected), so you want to go for some trade off between the size of each chunk of data and the number of server requests needed in a typical scenario. This is of course application specific, but sending the contents of all child folders in for instance depth 2 might be a useful strategy...
Long story short: What will probably work best is your solution #1 with the exception that you want to send more than one folder at a time because of the overhead of a round trip to the server...