Best approach of testing ravendb database queries speed - ravendb

I've searched the internet for a while but am not able to find any examples how to test RavenDb query speed.
What I'm trying to archive is to compare two session.query and find out witch of those two has the best performance speed. How can I do that? //Thanks
EDIT:
I'm building a mvc Notes-app were a user can create an account and save notes. Lets say I have these these two classes:
public class SingleNote : ContentPage
{
public string Header { get; set; }
public string Note { get; set; }
public string Category { get; set; }
}
And this one:
public class LoginViewModel
{
[Required]
[Display(Name = "Username")]
public string UserName { get; set; }
[Required]
[DataType(DataType.Password)]
[Display(Name = "password")]
public string Password { get; set; }
}
Is it best to put a List of the users singleNotes in the LoginViewodel and store all the user notes there or should I put a property in the the SingleNote-ravenDocument that refers to the user.
What I'm then trying to achieve is to test these to different types of queries and see which of them gets the best performance / speed.
SO TO THE RELEVANT QUESTIOIN: can I do some testing of that, and compare these two queries and see witch of them gets best performance speed:
Case 1: where I have put the prop string UserThatOwnsTheDoc in SingleNote-class
The risk here I have to query all my documents In the collection "SingleNotes". Which results in searching though many documents. Can that be an issue?
var listOfSpecificUsersDocuments =
RavenSession.Query<SingleNote>()
.Where(o => o.UserThatOwnsTheDoc == User.Identity.Name)
.ToList();`
Case 2: where I have put the prop List<SingleNote> SingleNotes in the LoginViewModel
In this case I store every note in UserDocument. The risk here is that the documentsize can grow very large if the list of "SingleNotes". Can that be an issue?
var userDocumentWitchIncludesAListOfSingleNotes = RavenSession.Load<LoginViewModel>("UserName/1");

I would say what you are interested in is not in fact performance testing, but actually stress testing. Suppose you complete your performance analysis and find one way is "faster" does "faster" actually mean anything? It sure doesn't mean anything if your "faster" solution breaks under load.
The type of testing you are looking to do is very nontrivial and will require real effort to achieve. Instead of looking at performance in isolation you need to look at the overall experience to the user. You need to plan a scenario (or series of scenarios that could run concurrently) that emulates the behavior of the system in a real world scenario.
Taking Twitter, a "simple" stress test might involve 1000 users register, 800 login, 400 post 100 tweets, 200 post 1000 tweets, and all 1000 search for users to follow linking to 100 other users. Do all of this concurrently. Monitor the performance of the various parts of the system, is one action becoming a detriment to the others? When is the fracturing point of the system?
So circling back to your question, the solution here is build both possible solutions. Then use a standard load tester against the MVC end of the application. This will show you how many concurrent users you can support with good experience, slowed down experience, and eventually lights out. For legitimate testing make sure your application is deployed to a real server and is running IIS with a comparable quality production server. (aka don't use your Windows 8, use Windows Server 2012. IIS is degraded on consumer editions to prevent it from being used as a server)
Some simple advice to relearning object modeling when in a document world instead a relational world. You are seeking to model transaction boundaries. If X changes, Y needs to change too? Those should likely be in the same document. LoginViewModel does this need to have a List<Note>, or rephrased when a user logs in do i need at every single moment the list of all associated notes? If the answer is yes, I need them at all times, that is a clear sign it belongs in the same document. If the answer is "it depends" or "sometimes" that implies it does not belong in the same document. If what you're trying to do feels "hard" that's a pretty clear sign you have a bad document model. Well modeled nonrelational systems can commonly respond to any individual request with a single Load<T> statement or single Query<T>. If you need multiple loads that you cannot solve with a single LoadStartingWith<T> there is likely something wrong, similar if you multiple queries.

Related

Static instance in the class in distributed system

i was reading the blogs for "What about the code? Why does the code need to change when it has to run on multiple machines?".
And i came across one line which i am not getting it, can anyone please help me to understand it with simple or any example.
"There should be no static instances in the class. Static instances hold application data and when a particular server goes down, all the static data/state is lost. The app is left in an inconsistent state."
Assuming: Stating instance is an instance which can be at most once per process or context - e.g. in java there is at most one copy of a static class, with all data (or state) that the class contains.
So it is very simple memory model for a static class in a single node/jvm/process. Since there is a single copy of data, it is quite straightforward to reason about it. For example, you one may update the data and every next reader will see the updated information. This is a bit more complicated for multithreaded programs, but is still straightforward comparing to distributed systems.
Clearly in a distributed system, every node may have at most one static class with state. Which means if a system contains several nodes - a distributed system - then there are several copies of data.
Having several copies is a problem. It is hard to reason about such system - every node may have some unique data and data may differ on different node. It is very hard to reason about such data: how it is synced? Availability vs consistency?
For example, take a simple counter. In a single node system, a static instance may keep the score. If one writer increased the counter, the next reader will see the increased value (assuming multithreaded part is implemented correctly, which is not that complicated).
Same counter is a distributed system is much more complicated. A writer may write to one node, but a reader may read from another.
Basically, having state on nodes is a hard problem to solve. This is the primary reason to use some distributed storage layer e.g. Hbase, Cassandra, AWS DynamoDB. All these storage systems have predictable behaviour which helps to reason about correctness of programs.
For example, there are just two servers which accepts payments from clients.
Then somebody decided to create static class to be friendly with mutli threading:
public static class Payment
{
public static decimal Amount;
public static bool IsMoneyReceived;
public static string UserName;
}
Then some client, let's call him John, decided to buy something in shop. John sent money and static class has data about this purchase. Some service is going to write data into database from
Payment class, however, electicity was turned off. Load balancer knows that the server is not responding and redirects John requests to another server which knows nothing about data in
Payment class.

Elasticsearch testing(unit/integration) best practices in C# using Nest

I've been seraching for a while how should I test my data access layer with not a lot of success. Let me list my concerns:
Unit tests
This guy (here: Using InMemoryConnection to test ElasticSearch) says that:
Although asserting the serialized form of a request matches your
expectations in the SUT may be sufficient.
Does it really worth to assert the serialized form of requests? Do these kind of tests have any value? It doesn't seem likely to change a function that should not change the serialized request.
If it does worth it, what is the correct way to assert these reqests?
Unit tests once again
Another guy (here: ElasticSearch 2.0 Nest Unit Testing with MOQ) shows a good looking example:
void Main()
{
var people = new List<Person>
{
new Person { Id = 1 },
new Person { Id = 2 },
};
var mockSearchResponse = new Mock<ISearchResponse<Person>>();
mockSearchResponse.Setup(x => x.Documents).Returns(people);
var mockElasticClient = new Mock<IElasticClient>();
mockElasticClient.Setup(x => x
.Search(It.IsAny<Func<SearchDescriptor<Person>, ISearchRequest>>()))
.Returns(mockSearchResponse.Object);
var result = mockElasticClient.Object.Search<Person>(s => s);
Assert.AreEqual(2, result.Documents.Count()).Dump();
}
public class Person
{
public int Id { get; set;}
}
Probably I'm missing something but I can't see the point of this code snippet. First he mocks an ISearchResponse to always return the people list. then he mocks an IElasticClient to return this previous search response to any search request he makes.
Well it doesn't really surprise me the assertion is true after that. What is the point of these kind of tests?
Integration tests
Integration tests do make more sense to me to test a data access layer. So after a little search i found this (https://www.nuget.org/packages/elasticsearch-inside/) package. If I'm not mistaken this is only about an embedded JVM and an ES. Is it a good practice to use it? Shouldn't I use my already running instance?
If anyone has good experience with testing that I didn't include I would happily hear those as well.
Each of the approaches that you have listed may be a reasonable approach to take, depending on exactly what it is you are trying to achieve with your tests. you haven't specified this in your question :)
Let's go over the options that you have listed
Asserting the serialized form of the request to Elasticsearch may be a sufficient approach if you build a request to Elasticsearch based on a varying number of inputs. You may have tests that provide different input instances and assert the form of the query that will be sent to Elasticsearch for each. These kinds of tests are going to be fast to execute but make the assumption that the query that is generated and you are asserting the form of is going to return the results that you expect.
This is another form of unit test that stubs out the interaction with the Elasticsearch client. The system under test (SUT) in this example is not the client but another component that internally uses the client, so the interaction with the client is controlled through the stub object to return an expected response. The example is contrived in that in a real test, you wouldn't assert on the results of the client call as you point out but rather on the output of the SUT.
Integration/Behavioural tests against a known data set within an Elasticsearch cluster may provide the most value and go beyond points 1 and 2 as they will not only incidentally test the generated queries sent to Elasticsearch for a given input, but will also be testing the interaction and producing an expected result. No doubt however that these types of test are harder to setup than 1 and 2, but the investment in setup may be outweighed by their benefit for your project.
So, you need to ask yourself what kinds of tests are sufficient to achieve the level of assurance that you require to assert that your system is doing what you expect it to do; it may be a combination of all three different approaches for different elements of the system.
You may want to check out how the .NET client itself is tested; there are components within the Tests project that spin up an Elasticsearch cluster with different plugins installed, seed it with known generated data and make assertions on the results. The source is open and licensed under Apache 2.0 license, so feel free to use elements within your project :)

How to handle "private" data in Elastic Search

I need some input from anyone who might have experience/knowledge about how ElasticSearch works and API's..
I have a (very large) database with a lot of data for a lot of different items.
I need to make all of this data searchable through a public API, so that anyone can use it and query the API about data for specific items. I already have ElasticSearch up & running, and have populated an index in ElasticSearch with all of the data from the database. ElasticSearch is working fine and so is the API.
The challenge I now face is that some of the data in our database is "private" data which must not be publicly searchable. At the same time this private data must be searchable internally, which means that I need to make the API run in both at public mode and a private mode (user authenticated). When a client that has not been authenticated queries the API for some data the client should only get the public items, whereas the private (user authenticated) client should get all possible results.
I don't have a problem with the items where all the data for one item must not be publicly available. I can simply mark them with a flag and make sure that when I return data to the client through the API they are not returned by ElasticSearch.
The challenge occurs when there is data for an item where part of the data is private and part of the data is public. I have thought about peeling off the private data before returning the data to the (public) client. This way the private data is not available directly through the API, but it will be indirectly/implicitly. If for instance the client have searched for some data which is of a private nature and in which case I will "strip" the private data from the search result before returning it to the user, then the client will get the document returned, indicating that the document was a "hit" for the specific query. However the specific query string from the client is nowhere to be found in the document that I return, thus indicating that the query string is somehow associated with the document and that the association is of a sensitive/private nature.
I have thought about creating two different indices. One that has all the data for all the objects (the private index) and one that has only the publicly available data (where I have stripped all the documents for the data parts that are of a sensitive nature). This would work and would be a fairly easy solution to implement, but the downside is that I now have duplicated data in two indices.
Any ideas?
From your description, you clearly need two distinct views of your data:
PUBLIC: Subset of the documents in the collection, and certain fields should not be searched or returned.
PRIVATE: Entire collection, all fields searchable and visible.
You can accomplish two distinct views of the data by either having:
One index / Two queries, one public, and one private (you can either implement this yourself, or have Shield manage this opaquely for you).
Two indices / Two queries (one public, one private)
In the first case, your public query will filter out private documents as you mention, and only search/return the publicly visible fields. While the private query will not filter, and will search/return all fields.
In the second case, you would actually index your data into two separate indices, and explicitly have the public query run against the public index (containing only the public fields), and the private query run against the private index.
It's true that you can build a mechanism (or use Shield) to accomplish what you need on top of a single index. However, you might want to consider (2) the public/private indices option if:
You want to reduce the risk of inadvertently exposing your sensitive data through an oversight, or configuration change.
You'd like to reduce the coupling between the public features of your application and the private features of your application.
You anticipate the scaling characteristics of public usage to deviate significantly from private usage.
As an example of this last point, most freemium sites have a very skewed distribution of paying vs non-paying users (say 1 in 10 for the sake of argument).
Not only will you likely need to aggressively replicate your public index at scale, but also by stripping your public documents of private fields, you'll proportionately reduce the resources needed to manage your public shards (and replicas).
This brings up the question of data duplication. In most systems, the search index is not "the system of record" (see discussion). Search indices are more typically used like an external database index, or perhaps a materialized view. Having duplicate data in this scenario is less of an issue when there is a durable backing store representing the latest state.
If, for whatever reason, you are relying on Elasticsearch as "the system of record", then the dual index route is somewhat trickier (as you'll need to choose one, probably the private index to represent the ground-truth, and then treat the other (public index) as a downstream view of the private data.)

Porting PHP API over to Parse

I am a PHP dev looking to port my API over to the Parse platform.
Am I right in thinking that you only need cloud code for complex operations? For example, consider the following methods:
// Simple function to fetch a user by id
function getUser($userid) {
return (SELECT * FROM users WHERE userid=$userid LIMIT 1)
}
// another simple function, fetches all of a user's allergies (by their user id)
function getAllergies($userid) {
return (SELECT * FROM allergies WHERE userid=$userid)
}
// Creates a script (story?) about the user using their user id
// Uses their name and allergies to create the story
function getScript($userid) {
$user = getUser($userid)
$allergies = getAllergies($userid).
return "My name is {$user->getName()}. I am allergic to {$allergies}"
}
Would I need to implement getUser()/getAllergies() endpoints in Cloud Code? Or can I simply use Parse.Query("User")... thus leaving me with only the getScript() endpoint to implement in cloud code?
Cloud code is for computation heavy operations that should not be performed on the client, i.e. handling a large dataset.
It is also for performing beforeSave/afterSave and similar hooks.
In your example, providing you have set up a reasonable data model, none of the operations require cloud code.
Your approach sounds reasonable. I tend to put simply queries that will most likely not change on the client side, but it all depends on your scenario. When developing mobile apps I tend to put a lot of code in cloud code. I've found that it speeds up my development cycle. For example, if someone finds a bug and it's in cloud code, make the fix, run parse deploy, done! The change is available to all mobile environments instantly!!! If that same code is in my mobile app, it really sucks, cause now I have to fix the bug, rebuild, push it to the app store/google play, wait x number of days for it to be approved, have the users download it... you see where I'm going here.
Take for example your
SELECT * FROM allergies WHERE userid=$userid query.
Even though this is a simple query, what if you want to sort it? maybe add some additional filtering?
These are the kinds of things I think of when deciding where to put the code. Hope this helps!
As a side note, I have also found cloud code very handy when needing to add extra security to my apps.

Many-to-many relationships with objects, where intermediate fields exist?

I'm trying to build a model of the servers and applications at my workplace. A server can host many applications. An application can be hosted across many servers.
Normally I would just have the host class contain a List, and the application class a List. However, there are a few fields that are specific to the particular host-application relationship. For example, UsedMb represents the amount of disk-space used by an application on a host.
I could, of course have a HostedApplicationclass representing an intermediate object which would hold the UsedMb field. Both Host and Application classes would then contain a List.
The problem is, however, that an application needs also to know about some aspects of its Host that would be included in the Host class (for example, the hosts are geographically distrubuted; an application needs to know how many data centres it is hosted in, so it needs to be able to check the DC names of all its hosts.
So instead I could have the HostedApplication class hold references to both the Host object and Application object it refers to. But then in some cases I will need to loop through all applications (and in other cases, all hosts). Therefore I would need 3 separate lists, a List, and List, and a List, to be able to loop through all three as needed.
My basic question is, what is the standard way of dealing with this sort of configuration? All options have advantages and disadvantages. The last option I mentioned seems most correct, but is having three lists overkill? Is there a more elegant solution?
Ideally i would be able to talk to you about the problem, but here is a potential solution based on my rough understanding of the requirements ( c++ style with a lot of implementation left out)
class Host {
public:
string GeographicLocation() const;
string DCName() const;
};
class HostAsAppearsToClient : public Host {
HostAsAppearsToClient(const Host&);
// Allows Host -> HostAsAppears... conversion
size UsedMB() const;
void UseMoreMB(size) const;
};
class Client {
HostAsAppearsToClient* hosts;
void AddHost(const Host& host) {
// Reallocate enough size or move a pointer or whatever
hosts[new_index] = HostAsAppearsToClient(host);
hosts[new_index].UseMoreMB(56);
}
void DoSomething() {
hosts[index].UsedMB();
// Gets the MB that that host is using, and all other normal host data if
// we decide we need it ...
print(hosts[index].DCName());
}
};
int Main() {
Host* hosts = new Host[40];
Client* clients = new Client[30];
// hosts[2].UsedMB() // Doesn't allow
}
I fully expect that this does not meet your requirements, but please let me know in what way so that I can better understand your problem.
EDIT:
VBA .... unlucky :-P
It is possible to load dll's in VBA, which would allow you to write and compile your code in any other language, and just forward the inputs and outputs through VBA from the UI to the DLL, but i guess its up to you if thats worth it. Documentation on how to use a dll in VBA Excel: link
Good Luck!