Controlling and monitoring use of BI Engine Reservations - google-bigquery

With the new beta BI Engine Reservations, I've noticed some queries speed up, but others remain unaffected. Will it be possible
- to monitor how the reservation is being used?
- to have some control over how the reservation is used?

When it comes to control, I've seen no indication that you'll have any—the system decides what the most efficient mechanism is (BI Engine, query cache, etc.) and then allocates accordingly. Also, the size of your reservation, usage, and age are factored into what is added and subsequently removed from the BI Engine reservation.
While that may seem frustrating, it's also the selling point: zero-config, automatic acceleration of your dashboards. As Google iterates quickly on these products, I would expect some controls to find their way in eventually.
As a workaround, you could use a separate project for data you want to ensure has access to the full reservation (since BI Engine is project-level).
As was mentioned elsewhere, there are a handful of metrics that can be viewed using Stackdriver logging (if you enable it). These are all high-level metrics, and are listed in the documentation:
Reservation Total Bytes
Reservation Used Bytes
Inflight Requests
Request Count
Request Execution Times
These won't likely give you a lot of the information you're looking for, but can be monitored for patterns.

You can use the elasticsearch and logstash for monitoring and implementing a security enviroment. The way with works is simple and for Near Real Time.

Related

CouchDB backlog performance

I am working on a web service that uses CouchDB as its primary database. The setup runs as follows:
A small main database for high-level business logic
Large client databases (~5GB ea.) containing homogeneous records
Couchdb-Lucene indexing these large client databases
With out application, we receive approx. 2-20 posts per second from clients (only about 10 clients), containing records to be inserted into their database. Each post handler contains logic that needs to look up client details from the main database. We're experiencing an issue where CouchDB GETs and POSTs are taking upwards of 180s to complete. CouchDB processes are dominating the CPU, so I expect that this is a problem where CouchDB can't keep up with its indexing.
Any idea on how to optimize performance in a situation like this where we have a steady stream of records that need to be indexed ASAP?
It seems unlikely that this amount of data is enough to overwhelm CouchDB, but it's impossible to say without digging into the actual code of your post handlers and view functions. You might want to profile these in more detail to see where the bottlenecks are (or hire a consultant to help with this).

Storage Use Case "Logging + Images + Metadata"

I have the following use case for which I'm trying to find an optimal use of either filsystem, database (rdbms or a flavour of noSql solution). Any advice is welcome, as I want to see what is optimal.
Client application: will generate logs intervals of 1-3 seconds. By logs I mean structured log data (either about connections, applications used, processes used, screenshots, etc..). Some log data will be structured, some will be unstructured (where the schema can change thus).
Storage solution: will need to persist all this data very fast. Will sit on 1-* server(s). It doesn't matter if it's a hybrid solution between filesystem/rdbms/(any suitable flavour of) noSql.
Post processing: the data needs to be queryable ofcourse. E.g. just a key-value store would not suffice, that's a given (maybe for the screenshots only yes).
As a reference, here's a more concrete example:
User runs the client for 2-3 hours (during a "monitoring period"). It sends log data over the wire to the server (storage). Writing speed and data accuracy is vital here.
Management system accumulates the data and makes a report on certain characteristics. All log data should be able to be fetched if needed - but there will be a specific query for a set of users in a given monitoring period. Reading speed is less necessary here, but data accuracy and finding all log parts back eventually is necessary.
If I need to give more information, please let me know.
If you prefer to roll your own rather than use logging packages, I would stick with append only text files. You can certainly encode screenshots in Base64 and keep it in the same file, but I would rather store that separately in the file system with a generated filename stored in the log.
As for reporting, you can obviously read it through a text editor, but if you need a more sophisticated and regular management reporting, you can create an ETL of only the info you report on into a RDBMS. You can always go back and rerun ETL if you decide that you want more info later on.

Is there a way to leverage Hadoop tools to mange parallel REST API calls to external sources?

I am writing software that creates a large graph database. The software needs to access dozens of different REST APIs with millions of total requests. The data will then be processed by the Hadoop cluster. Each of these APIs have rate limits that vary by requests/second, per window, per day and per user (typically via OAuth).
Does anyone have any suggestions on how I might use either a Map function or other Hadoop-ecosystem tool to manage these queries? The goal would to be to leverage the parallel processing in Hadoop.
Because of the varied rate limits, it often makes sense to switch to a different API query while waiting for the first limit to reset. An example would be one API call that creates nodes in the graph and another that enriches the data for that node. I could have the system go out and enrich the data for the new nodes while waiting for the first API limit to reset.
I have tried using SQS queuing on EC2 to manage the various API limits and states (creating a queue for each API call), but have found it to be ridiculously slow.
Any ideas?
It looks like the best option for my scenario will be using Storm, or specifically the Trident abstraction. It gives me the greatest flexibility for both workload management but process management as well

On NServiceBus Profiles

I've been trying to find out ways to improve our nservicebus code performance. I searched and stumbled on these profiles that you can set upon running/installing the nservicebus host.
Currently we're running the nservicebus host as-is, and I read that by default we are using the "Lite" version of the available profiles. I've also learnt from this link:
http://docs.particular.net/nservicebus/hosting/nservicebus-host/profiles
that there are Integrated and Production profiles. The documentation does not say much - has anyone tried the Production profiles and noticed an improvement in nservicebus performance? Specifically affecting the speed in consuming messages from the queues?
One major difference between the NSB profiles is how they handle storage of subscriptions.
The lite, integration and production profiles allow NSB to configure how reliable it is. For example, the lite profile uses in-memory subscription storage for all pub/sub registrations. This is a concern because in order to register a subscriber in the lite profile, the publisher has to already be running (so the publisher can store the subscriber list in memory). What this means is that if the publisher crashes for any reason (or is taken offline), all the subscription information is lost (until each subscriber is restarted).
So, the lite profile is good if you are running on a developer machine and want to quickly test how your services interact. However, it is just not suitable to other environments.
The integration profile stores subscription information on a local queue. This can be good for simple environments (like QA etc.). However, in a highly distributed environment holding the subscription information in a database is best, hence the production profile.
So, to answer your question, I don't think that by changing profiles you will see a performance gain. If anything, changing from the lite profile to one of the other profiles is likely to decrease performance (because you incur the cost of accessing queue or database storage).
Unless you tuned the logging yourself, we've seen large improvements based on reduced logging. The performance from reading off the queues is same all around. Since the queues are local, you won't gain much from the transport. I would take a look at tuning your handlers and the underlying infrastructure. You may want to check out tuning MSMQ and look at the disk you are using etc. Another spot would be to look at how distributed transactions are working assuming you are using a remote database that requires them.
Another option to increase processing time is to increase the number of threads consuming the queue. This will require a license. If a license is not an option you can have multiple instances of a single threaded endpoint running. This requires you shard your work based on message type or something else.
Continuing up the scale you can then get into using the Distributor to load balance work. Again this will require a license, but you'll be able to add more nodes as necessary. All of the opportunities above also apply to this topology.

How to achieve high availability?

My boss wants to have a system that takes into concern of continent wide catastrophic event. He wants to have two servers in US and two servers in Asia (1 login server and 1 worker server in each continent).
In the event that earthquake breaks the connection between the two continents, both should work alone. When the connection is revived, they should sync each other back to normal.
External cloud system not allowed as he has no confidence.
The system should take into account of scalability which means addition of new servers should be easy to configure.
The servers should be load balanced.
The connection between the servers should be very secure(encrypted and send through SSL although SSL takes care of encryption).
The system should let one and only one user log in with one account. (beware of latency between continent and two users sharing account may reach both login server at the same time)
Please help. I'm already at the end of my wit. Thank you in advance.
I imagine that these requirements (if properly analysed) are essentially incompatible, in that they cannot work according to CAP Theorem.
If you have several datacentres, even if they are close by, partitions WILL happen. If a partition happens, either availability OR consistency MUST be lost, because either:
you have a pre-determined "master", which keeps working and other "slave" DCs which fail (or go readonly). This keeps consistency at the expense of availability.
OR you lose consistency for the duration of the partition (this means that operations which depend on immediate consistency are also unavailable).
This is incompatible with your requirements, as far as I can see. What your boss wants is clearly impossible. He needs to understand CAP theorem.
Now, in YOUR application case, you may decide that you can bend the rules and redefine what consistency or availiblity are, for convenience, and have a system which degrades into an inconsistent but temporarily acceptable state.
You probably want to get product management to have a look at the business case for these requirements. Dropping some of them is probably ok. Consistency is a good requirement to keep, as it makes things behave as people expect - this means to drop availability or partition-tolerance. Keeping consistency is definitely easier from an engineering perspective.
This is another one of those things where employers tend not to understand the benefits of using an off-the-shelf solution. If you as a programmer don't really even know where to start with this, then rolling your own is probably a going to be a huge money and time sink. There's nothing wrong with not knowing this stuff either; high-availability, failsafe networking that takes into consideration catastrophic failure of critical components is a large problem domain that many people pour a lot of effort and money into. Why not take advantage of what providers have to offer?
Give talking to your boss about using existing cloud providers one more try.
You could contact one of the solid and experienced hosting provides (we use Rackspace) that have data centers in different regions world wide and get their recommendations upon your requirements.
This will require expert assistance and a large budget, and serious planning.
I better option will be contact a reputable provider with a global footprint and select a premium solution with a solid SLA backing up there service and let them tailor a solution that comes close to your needs.
Just realize even the guys like Google, Yahoo, Microsoft and Amazon (to name a few), at one time or another have had some or other issue that rendered segments of there systems offline to certain users.