High speed data acquisiton using REST Services - wcf

We need to develop a high speed REST based WCF Service , which will be used for updating 2000 datapoint , each data point changing at 25 msec . Is it possible to implement such high speed data acquisition using WCF

Using WCF yes. I'm not sure REST is the best architectural style for the type of problem you are trying to solve. I also wonder whether HTTP is appropriate.
Having said that you might want to look into CORE which is an effort to apply REST in highly constrained environments like data acquisition.

Here is how I am understanding your question: you expect new data values every 25 ms, or 40 x per second. There are 2000 discrete data values is one device, which means the telemetry flow from each device is around 80,000 values per second. You also have multiple devices, so your throughput will go higher than this, e.g. 800,000 updates per second for 10 devices.
In this scenario, I wouldn't expect the service layer to be a constraint, for the simple reason that it is always possible to scale up the service layer by adding more hosts to receive messages and load balancing between them. Where I would be concerned is any place where all transactions must be processed within the same domain. For example, is all this data winding up in one relational database? In that case you may have a problem with transaction throughput.
Another area that seems problematic in your architecture is the device itself. Is one device going to be capable of gathering and sending out values at 80 kHz? Here is where the REST protocol may have have too high an overhead. So it is device, not server, constraint that might drive you to find a more efficient protocol. This may be a case where writing a custom protocol directly against the socket might be warranted, but that depends on your device.

Related

Baselining internal network traffic (corporate)

We are collecting network traffic from switches using Zeek in the form of ‘connection logs’. The connection logs are then stored in Elasticsearch indices via filebeat. Each connection log is a tuple with the following fields: (source_ip, destination_ip, port, protocol, network_bytes, duration) There are more fields, but let’s just consider the above fields for simplicity for now. We get 200 million such logs every hour for internal traffic. (Zeek allows us to identify internal traffic through a field.) We have about 200,000 active IP addresses.
What we want to do is digest all these logs and create a graph where each node is an IP address, and an edge (directed, sourcedestination) represents traffic between two IP addresses. There will be one unique edge for each distinct (port, protocol) tuple. The edge will have properties: average duration, average bytes transferred, number of logs histogram by the hour of the day.
I have tried using Elasticsearch’s aggregation and also the newer Transform technique. While both work in theory, and I have tested them successfully on a very small subset of IP addresses, the processes simply cannot keep up for our entire internal traffic. E.g. digesting 1 hour of logs (about 200M logs) using Transform takes about 3 hours.
My question is:
Is post-processing Elasticsearch data the right approach to making this graph? Or is there some product that we can use upstream to do this job? Someone suggested looking into ntopng, but I did not find this specific use case in their product description. (Not sure if it is relevant, but we use ntop’s PF_RING product as a Frontend for Zeek). Are there other products that does the job out of the box? Thanks.
What problems or root causes are you attempting to elicit with graph of Zeek east-west traffic?
Seems that a more-tailored use case, such as a specific type of authentication, or even a larger problem set such as endpoint access expansion might be a better use of storage, compute, memory, and your other valuable time and resources, no?
Even if you did want to correlate or group on Zeek data, try to normalize it to OSSEM, and there would be no reason to, say, collect tuple when you can collect community-id instead. You could correlate Zeek in the large to Suricata in the small. Perhaps a better data architecture would be VAST.
Kibana, in its latest iterations, does have Graph, and even older version can lever the third-party kbn_network plugin. I could see you hitting a wall with 200k active IP addresses and Elasticsearch aggregations or even summary indexes.
Many orgs will build data architectures beyond the simple Serving layer provided by Elasticsearch. What I have heard of would be a Kappa architecture streaming into the graph database directly, such as dgraph, and perhaps just those edges of the graph available from a Serving layer.
There are other ways of asking questions from IP address data, such as the ML options in AWS SageMaker IP Insights or the Apache Spot project.
Additionally, I'm a huge fan of getting the right data only as the situation arises, although in an automated way so that the puzzle pieces bubble up for me and I can simply lock them into place. If I was working with Zeek data especially, I could lever a platform such as SecurityOnion and its orchestrated Playbook engine to kick off other tasks for me, such as querying out with one of the Velocidex tools, or even cross correlating using the built-in Sigma sources.

Gathering distributed data into central database

I was assigned to update existing system of gathering data coming from points of sale and inserting it into central database. The one that is working now is based on FTP/SFTP transmission, where the information is sent once a day, usually at night. Unfortunately, because of unstable connection links (low quality 2G/3G modems), some of the files appear to be broken. With just a few shops connected that way everything was working smooth, but along with increasing number of shops, errors became more often. What is worse, the time needed to insert data into central database is taking up to 12 - 14h (including waiting for the data to be downloaded from all of the shops) and that cannot happen during the working day as it would block the process of creating sale reports and other activities with the database - so we are really tight with processing time here.
The idea my manager suggested is to send the data continuously, during the day. Data packages would be significantly smaller, so their transmission and insertion would be much faster, central server would contain actual (almost real time) data and night could be used for long running database activities like creating backups, rebuilding indexes etc.
After going through many websites, I found that:
using ASMX web service is now obsolete and WCF should be used instead
WCF with MSMQ or System Messaging could be used to safely transmit data, where I don't have to care that much about acknowledging delivery of data, consistency, nodes going offline etc.
according to http://blogs.msdn.com/b/motleyqueue/archive/2007/09/22/system-messaging-versus-wcf-queuing.aspx WCF queuing is better
there are also other technologies for implementing message queue, like RabbitMQ, ZeroMQ etc.
And that is where I become confused. With so many options, do you have any pros and cons of these technologies?
We were using .NET with Windows Forms and SQL Server, but if it would be necessary, we could change to something more suitable. I am also a bit afraid of server efficiency. After some calculations, server would be receiving about 15 packages of data per second (peak). Is it much? I know there are many websites without serious server infrastructure, that handle hundreds of visitors online and still run smooth, but the website mainly uploads data to the client, and here we would download it from the client.
I also found somewhat similar SO question: Middleware to build data-gathering and monitoring for a distributed system
where DDS was mentioned. What do you think about introducing some middleware servers that would cope with low quality links to points of sale, so the main server would not be clogged with 1KB/s transmission?
I'd be grateful with all your help. Thank you in advance!
Rabbitmq can easily cope with thousands of 1kb messages per second.
As your use case is not about processing real time data, I'd say you should combine few messages and send them as a batch. That would be good enough in order to spread load over the day.
As the motivation here is not to process the data in real time, then any transport layer would do the job. Even ftp/sftp. As rabbitmq will work fine here, it's not the typical use case for it.
As you mentioned that one of your concerns is slow/unreliable network, I'd suggest to compress the files before sending them, and on the receiving end, immediately verify their integrity. Rsync or similar will probably do great job in doing that.
From what I understand, you have basically two problems:
Potential for loss/corruption of call data
Database write performance
The potential for loss/corruption of call data is being caused by a lack of reliability in the transmission of data from client to service.
And it's not clear what is causing the database contention/performance issues, beyond a vague reference to high volumes, so this answer will be more geared towards solving the first problem.
You have correctly identified the need for reliable asynchronous communication transport as a way to address the reliability issues in your current setup.
Looking at MSMQ to deliver this is a valid first step. MSMQ provides reliable communication via a store and forward messaging semantic which comes out of the box and requires very little in the way of configuration.
Unfortunately, while suitable for your needs, MSMQ relies on 2 things:
A reliable network protocol, and
A client service running on both sending and receiving machine.
From your description above, I don't believe 1 exists (the internet is not a reliable network), and you might well struggle with 2 - MSMQ only ships with Windows Server or business/enterprise versions of Windows on the desktop.(*see below...)
As a possible solution to the network reliability problem, you could use a WCF or a RESTful endpoint (using Nancy or WebApi) to expose a service operation(s) exposed over HTTP, which would accept the incoming calls from the client machines. These technologies are quite different, so you'll need to make sure you're making the correct choice early on.
WCF supports WS-ReliableMessaging from the SOAP 1.2 specification out of the box, which allows for reliable web service calls over http, however it's very config-heavy and not generally a nice framework to work with.
REST much simpler than WCF in .Net, is very lightweight and easy to use. However, for reliable delivery you would have to expose some kind of GET operation (in addition to a POST to allow the client to send data) to be called (within a reasonable time-frame) to verify the data was committed. The client would have to implement some kind of retry semantic if the result of the GET "acknowledgement" was negative.
Despite requiring two operations rather than one for the WCF route, I would favour the REST approach. I've done plenty of both and find REST services way nicer to work with.
(*) That's not to say that MSMQ wouldn't work in your ultimate solution, just that it would not be used to address the transmission reliability issue. However it could still be used to address another of your problems, that of database write contention. If you were to queue incoming requests once they came into the server, then these could be processed by an "offline" process, which could then perform the required database operations in a reliable manner. This could be done by using MSMQ transactional queues.
In response to comments:
99% messages are passed from shop to main server, but if some change
is needed (price correction, discounts etc.), that data has to be sent
to shop.
This kind of changes things. Had I understood from the beginning that you had a bidirectional requirement, and seeing as how you have managed to establish msmq communication, I would have nudged you towards NServiceBus, which is a really, really cool wrapper around MSMQ. The reason I would have done this is that you appear to have both a one way, and a publish-subscribe requirement, which is supported really nicely by NServiceBus.

Protocol for remote logging of temperature, gas/electricity consumption

So, I'm managing a series of rented holiday homes, which all have dynamic IP, ADSL Internet connections.
We've wanted to keep track of a few types of data, e.g. per-room electricity usage, hot water temperature, thermostat setting, gas usage, network bandwidth usage, etc etc, and keep these centrally so we can perform analytics and graph them in real-time.
I'm comfortable building the hardware required to log these variables every 1-5 seconds and get them into e.g. a Raspberry Pi, but I'm wondering what kind of framework would be suitable for transferring and storing the data on the server side.
My initial thought was something like SNMP, but a) this doesn't seem designed for non-network uses, b) it's not very secure, and c) I'm looking for something agent-to-server (so I don't have to know the IP of the agent, and it'll also traverse NAT, so I can have multiple devices logging different things on the same network.)
My second thought was something using a REST API, but making potentially hundreds of API calls per second via different TCP connections seems a bit wasteful.
I came across Cubism but this seems to have the same disadvantages as some sort of REST API; there's a lot of redundant data transmitted every connection, if I were to send the data every 5 seconds per sensor.
Names like AMQP and MQTT come up, though none of these seem particularly suited (natively) to travelling over the public Internet without configuring VPNs etc.
Thoughts?
[This doesn't seem like a particularly niche problem, now I think about it - weather logging, share price, etc etc... although this is probably a smaller interval]
I have an geospatial/environment monitoring background and can tell you something about two major standards which are used today in environmental/infrastructural (electricity and water supply networks) monitoring sensor networks.
Proprietary one: Most sensors simply store time series measurements in their own local data format. A server process calls every sensor from time to time to gather the time series data (in most cases via a simple GPRS uplink), transforms it into an exchange Format and then stores it into a centralized database where you can work with the data. One of the industry leader companies is Kisters AG and their exchange format ZRXP. So this is simply storing time series data in an ASCII Format (i.e.ZRXP), and import that into a database by calling the sensor over any connection.
Open Geospatial Standard: Sensor Observation Service and SensorML which I think does more fit your needs, because these are Web Service Specifications whilst the proprietary stuff above is a complete system solution built by one vendor. There exists a nearly ready to use java reference implementation of SOS provided by 52 north which should be easily runnable on a Pi. Although the SOS specification has a very strong geospatial background, that does not mean,that it can't be adopted for your purpose I think. At least SensorML should give you some ideas.

Best Way to Transmit LARGE data packages via SOAP web service

We are working with a .NET 3.5 app which is fast approaching legacy status. We have an existing SOAP service which reads records from our database and saves them to a third party MS SQL database, sending all the data rows in a single batch.
This has always worked fine, but recently we've taken on a much larger client than any we've had before, and they are transmitting much larger batches, so much so that they have begun to fail. We've upped the time out and max memory sizes in IIS, and maxed out the maxRequestLength in the web.config, but we are still bumping up against size problems.
So, I understand that long term, we should consider moving away from SOAP and into WCF, and plans for that are in the works. But in the mean time, we need a short term fix for this new client. And of course, to make the business and sales people happy, we need it kinda quickly.
I'm wondering what the best-practice approach might be. Initially I'm thinking something like this, but I could be thinking inside the box too much:
Establish a bench mark of # of records over which we don’t want to attempt to sync all at once.
Before attempting to save the data, check the number of records against that bench mark
If it's above it, then break the transmission down into segments which are each below that benchmark. SELECT TOP 10000 * FROM table WHERE sent = false, etc., if the benchmark is 10000. Then update sent to true for those records once submitted. Repeat.
Obviously, this will slow the process down, so to handle the user experience, we may want to toss in a status bar so they can see the progress.
Am I on the right track?
In addition to the comments from John, you should consider if you are solving the problem in the most optimal way.
It looks like you are triggering a one way sync between 2 database by calling a web service. This approach leads to the time out and memory problems that you are experiencing.
If your goal is to do the one way sync, you could use a free framework such as Microsofts sync framework: http://msdn.microsoft.com/en-US/sync

How can I handle 200K request per sec in wcf

I need to design a system that can handle 200K request per second in each machine over HTTP.
The wcf service need to be hosted under win service.
I wonder if wcf can handle such a requirement?
What is the best system setup/ best configuration?
The machine itself is pretty heavy 32G RAM and 8 core (or more), and can be upgraded if needed
Can I handle such amount of request in each single machine with wcf using http?
Doing this on a single machine is likely to be pretty tough (if indeed it's possible). It would be better to make your system scale horizontally, so you can add lots of machines as required. How you do that will depend on what your system actually needs to do. If it's some simple calculation which requires no persisted state, it shouldn't be too hard. If you've got some interaction with storage of some form which really needs to be read/written on each request, it'll be a lot harder - and choosing your persistence technology is likely to be pretty key to making it all hang together.
Note that there are other benefits to scaling horizontally too - in particular, the ability to upgrade the system without any downtime (if you're careful) and removing a huge single point of failure.
You need to give some more info on this.
Do you get the request and have to process it immediately?
Can you store the request data and delegate the processing to some other thread/process? Is there any way to scale the system out instead of up?
Is this in fact the only piece of infrastructure you can deploy stuff to?
I would start by asking what is it that I want to do during request handling. then what the bottlenecks are going to be.