Run Spark streaming in virtual machines

Run Spark streaming in virtual machines - virtual-machine

Is there any obvious performance degradation or drawback when deploy Spark streaming cluster in virtualized environment like Xen or KVM? What's the main reason?

The usual caveats about virtualization apply, but there is nothing specific to Spark or Spark Streaming.
I don't know of an article that would directly address this question. But the Spark petasort benchmark was run on EC2 and the article pays close attention to performance: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

Related

How to Get CPU usage for each VM on ESXI host

I want to get CPU usage (cumulative) for each VM hosted on a VMware ESXI host.
I tried using Power CLI command 'Get-VMHost' but it only gives the overall CPU usage by ESXI host.

For CPU usage esxtop is a very powerful ESX command and you have to run it at the CLI. I haven't used the Power CLI so I'm unsure if it's available there but it is definitely available at the CLI which VMware tries to discourage you from using (see https://kb.vmware.com/s/article/2004746). Documentation for esxtop for the latest release of vSphere is at https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.monitoring.doc/GUID-D89E8267-C74A-496F-B58E-19672CAB5A53.html.
That document is a bit terse and in terms of getting CPU usage for each VM this old documentation for esxtop may guide you a bit better https://www.vmware.com/pdf/esx2_using_esxtop.pdf. In particular note the different nomenclature of ESXi (and ESX) for which the primary unit of address space and execution is the "world" rather than the "process". Thus you want to get the CPU usage for all "worlds" associated with each VM. Some VMs may have only a single "world" and some may have several and it is configurable. As for esxtop, it has been around forever and most likely it can still today provide the same functionality that it did over a decade ago with ESX 2.

Redis localhost limitations and costs

I downloaded Redis server and cli to my local machine and it working good.
I just wanted to know if I can use it also in production server:
Are there any critical limitations? For example: Can I use 100 GB for free? (It will be on my computer).
I know that Redis labs cost money per month but if I download the redis to my machine and not using the redis labs, would it be free? (and the cost will be only the storage of the machine I using).

Redis is an open source software, licensed under BSD. That basically means you can do anything you want with it, without owing anyone anything.
Redis Labs, the home of open source Redis and the provider of commercial products that leverage on it, offers a wide spectrum of solutions - whether hosted, as-a-service, downloadable, remotely managed and so forth. You can (and should sometimes) use them, but that's definitely not a requirement.
Disclaimer: I work at Redis Labs and with the open source project.

How to do Asynchronous inserts in Aerospike using Python Client

I am using Aerospike 3.4 and Python Clinet 1.0.41
I am able to achieve only around 1400 writes per second. This is by synchronous writes, single thread. Can anyone suggest how to improve the write speed on single thread. I didn't find Asynchronous write feature in Python client.
I have seen benchmark results on the web claiming around 8L writes per second on SSD.
My Configuration:
No of nodes:2,
CPUs: 16 per node,
Replication: 2,
Data Persistence: SSD
Thanks,
Dhanasekaran

Updated 2015-07-29:
(1) The Python Aerospike client is fully synchronous at the moment. There appeared to be no firm plans for async support in the discussion at https://discuss.aerospike.com/t/gevent-compatibility-or-async-api/1001
but Ronen has since confirmed below (see comments) that async support is planned for all clients in the future.
(2) Regarding 1.4k TPS, I experienced very similar results when hosting Aerospike in a VirtualBox VM and connecting from the physical host. This may be due to VirtualBox's networking issues. When the client (Java benchmark) was run on the same VM as the host database, my speed went up to about 8k TPS.

A good news here is 'C' client 4.0 has been released with asynchronous support. http://www.aerospike.com/download/client/c/notes.html.
Since python client wraps around C client, there are very good chances that python client will have this feature sooner.

Source code: https://github.com/sean-tan-columbia/aerospike-async-client
I've implemented an Aerospike asynchronous client as an open-source project, the source code is as above. It has been tested on Aerospike 3.3 with Aerospike Python Client 1.0.38 and Python 2.7.
I just recently started it so it's not yet mature, welcome to improve it!

what are the advantages of running docker on a vm?

Docker is an abstraction of OS (kernal) and below, VM is abstraction of Hardware. What is the point of running a Docker on an VM (like Azure) (apart from app portability)? should they not be directly hosting docker on the hardware?

Docker doesn't provide effective isolation for kernel-level security exploits (there's only one ring 0, and it's shared across all containers). Thus, one could reasonably wish to have the additional isolation provided by a virtualization mechanism.
Keep in mind that much of Docker's value is not about security, but about containerization -- building and distributing portable applications in such a way as to ensure that coupling between layers occurs only where and how intended.

The advantage of a cloud system like Azure is that you can go online with your credit card and get a machine up and running in a few minutes. This is enabled by that machine being virtual. Also VMs let you share hardware across multiple users with hardware-level isolation.
If everything else was equal, i.e. you didn't need any of the features of a VM, then you would be correct that a physical machine should be used, as it will run more efficiently.

Is Redis a better option for SignalR scale out over SQL Server, and do each support failover?

In David Fowler's blog, SQL Server has been added to the list of scale out providers for service bus.
I am in the process of implementing Redis on our Windows servers. Based on what I know about Redis, I'm guessing it will be significantly faster than using SQL Server - is that a fair assumption?
If so, how does the Windows version of Redis implement fail-over?

Redis is ~x200 faster than SQL, mainly because it's in-memory and the protocol is designed for speed.
If that helps, Redis Cloud is now offered on Windows Azure, and HA is a built-in capability of the service.
Disclosure - I'm the Co-Founder & CTO of Garantia Data, the company behind the Redis Cloud service.

Based on what I know about Redis, I'm guessing it will be
significantly faster than using SQL Server - is that a fair
assumption?
It will be faster than SQL Server since it's optimized for in-memory based operations, however its speed isn't the only advantage. Support of advanced data structures offers a great deal of flexibility when dealing with various scenarios.
If so, how does the Windows version of Redis implement fail-over?
There is a link in download section to unofficial windows based port of redis which however isn't meant to be used for production purpose. Official version of redis supports replication and sentinel has automatic failover, but it's hard to say what's the state of these features in windows port. In general I wouldn't recommend to use redis on windows machine but rather use virtual machine with linux distro and run it there.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Run Spark streaming in virtual machines - virtual-machine

Is there any obvious performance degradation or drawback when deploy Spark streaming cluster in virtualized environment like Xen or KVM? What's the main reason?

Related

How to Get CPU usage for each VM on ESXI host

Redis localhost limitations and costs

How to do Asynchronous inserts in Aerospike using Python Client

what are the advantages of running docker on a vm?

Is Redis a better option for SignalR scale out over SQL Server, and do each support failover?

Categories

Resources