WCF: Choosing between increasing maximum array length and splitting the message into smaller packages - wcf

Side note: even if the question was posted several months ago, I'm still in search of a good answer so any feedback is welcomed.
While developing WCF Web Services I have encountered the error:
The maximum array length quota (16384) has been exceeded while reading XML data.
like many others and have solved it by modifying the binding configuration.
When looking for answers on the Internet, the solution was almost always changing the binding configuration, setting the maxArrayLength to maximum, going to Streamed transfer.
In some situations, like in this question WCF sending huge data , people suggest modifying the binding configuration over transmiting data in smaller chunks.
But the maximum values and streamed transfer will always work? Even in a system where you may never know what maximum size will have the data?
How to choose between the two options?
It depends on what you transfer? Downloading media vs. returning large log information?
The answer given to me until now revolves around technical aspects of streaming, but the answer I am looking for should be more focused on guidelines in the situation exposed, about choosing between the two options.

Not all bindings support streaming. The only ones that do are basicHttpBinding, NetTcpBinding, NetNamedPipeBinding, and WebHttpBinding. You also do not get to do reliable sessions if using streaming.
So why the big deal about streaming for large messages? Well if you don't use streaming, it is going to load the entire message in the memory buffer which can kill the available resources.
For more information, see this on MSDN: MSDN Large Message Transfers

Related

"max allowed size 128000 bytes" reached when there are a lot of publisher/subscribers

Im using distributed pub/sub in an Akka.net cluster and I've begun seeing this error when pub/sub grows to approx. 1000 subscribers and 3000 publishers.
max allowed size 128000 bytes, actual size of encoded Akka.Cluster.Tools.PublishSubscribe.Internal.Delta was 325691 bytes
I don't know, but I'm guessing distributed pub/sub is trying to pass the pub/sub list to other actor systems on the cluster?
Anyway, I'm a little hesitant about boosting size limits because of this post. So what would be a reasonable approach to correcting this?
You may want to tackle with distributed pub/sub HOCON settings. Messages in Akka.Cluster.DistributePubSub are grouped together and send as deltas. You may be interested in two settings:
akka.cluster.pub-sub.max-delta-elements = 3000 says how many items can maximally consist on delta message. 3000 is the default value and you may want to lower it in order to reduce the size of the delta message (which seems to be an issue in your case).
akka.cluster.pub-sub.gossip-interval = 1s indirectly affects how often gossips will be sent. The more often they're send, the smaller they may be - assuming continuously highly saturated channel.
If these won't help, you may also think about reducing the size of your custom messages by introducing custom serializers with smaller payload footprint.

Object storage for a web application

I am currently working on a website where, roughly 40 million documents and images should be served to it's users. I need suggestions on which method is the most suitable for storing content with subject to these requirements.
System should be highly available, scale-able and durable.
Files have to be stored permanently and users should be able to modify them.
Due to client restrictions, 3rd party object storage providers such as Amazon S3 and CDNs are not suitable.
File size of content can vary from 1 MB to 30 MB. (However about 90% of the files would be less than 2 MB)
Content retrieval latency is not much of a problem. Therefore indexing or caching is not very important.
I did some research and found out about the following solutions;
Storing content as BLOBs in databases.
Using GridFS to chunk and store content.
Storing content in a file server in directories using a hash and storing the metadata in a database.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
The website is developed using PHP and Couchbase Community Edition is used as the database.
I would really appreciate any input.
Thank you.
I have been working on a similar system for last two years, the work is still in progress. However, requirements are slightly different from yours: modifications are not possible (I will try to explain why later), file sizes fall in range from several bytes to several megabytes, and, the most important one, the deduplication, which should be implemented both on the document and block levels. If two different users upload the same file to the storage, the only copy of the file should be kept. Also if two different files partially intersect with each other, it's necessary to store the only copy of the common part of these files.
But let's focus on your requirements, so deduplication is not the case. First of all, high availability implies replication. You'll have to store your file in several replicas (typically 2 or 3, but there are techniques to decrease data parity) on independent machines in order to stay alive in case if one of the storage servers in your backend dies. Also, taking into account the estimation of the data amount, it's clear that all your data just won't fit into a single server, so vertical scaling is not possible and you have to consider partitioning. Finally, you need to take into account concurrency control to avoid race conditions when two different clients are trying to write or update the same data simultaneously. This topic is close to the concept of transactions (I don't mean ACID literally, but something close). So, to summarize, these facts mean that you're are actually looking for distributed database designed to store BLOBs.
On of the biggest problems in distributed systems is difficulties with global state of the system. In brief, there are two approaches:
Choose leader that will communicate with other peers and maintain global state of the distributed system. This approach provides strong consistency and linearizability guarantees. The main disadvantage is that in this case leader becomes the single point of failure. If leader dies, either some observer must assign leader role to one of the replicas (common case for master-slave replication in RDBMS world), or remaining peers need to elect new one (algorithms like Paxos and Raft are designed to target this issue). Anyway, almost whole incoming system traffic goes through the leader. This leads to the "hot spots" in backend: the situation when CPU and IO costs are unevenly distributed across the system. By the way, Raft-based systems have very low write throughput (check etcd and consul limitations if you are interested).
Avoid global state at all. Weaken the guarantees to eventual consistency. Disable the update of files. If someone wants to edit the file, you need to save it as new file. Use the system which is organized as a peer-to-peer network. There is no peer in the cluster that keeps the full track of the system, so there is no single point of failure. This results in high write throughput and nice horizontal scalability.
So now let's discuss the options you've found:
Storing content as BLOBs in databases.
I don't think it's a good option to store files in traditional RDBMS because they provide optimizations for structured data and strong consistency, and you don't need neither of this. Also you'll have difficulties with backups and scaling. People usually don't use RDBMS in this way.
Using GridFS to chunk and store content.
I'm not sure, but it looks like GridFS is built on the top of MongoDB. Again, this is document-oriented database designed to store JSONs, not BLOBs. Also MongoDB had problems with a cluster for many years. MongoDB passed Jepsen tests only in 2017. This may mean that MongoDB cluster is not mature yet. Make performance and stress tests, if you go this way.
Storing content in a file server in directories using a hash and storing the metadata in a database.
This option means that you need to develop object storage on your own. Consider all the problems I've mentioned above.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
I used neither of these solutions, but HDFS looks like overkill, because you get dependent on Hadoop stack. Have no idea about GlusterFS performance. Always consider the design of distributed file systems. If they have some kind of dedicated "metadata" serves, treat it as a single point of failure.
Finally, my thoughts on the solutions that may fit your needs:
Elliptics. This object storage is not well-known outside of the russian part of the Internet, but it's mature and stable, and performance is perfect. It was developed at Yandex (russian search engine) and a lot of Yandex services (like Disk, Mail, Music, Picture hosting and so on) are built on the top of it. I used it in previous project, this may take some time for your ops to get into it, but it's worth it, if you're OK with GPL license.
Ceph. This is real object storage. It's also open source, but it seems that only Red Hat people know how to deploy and maintain it. So get ready to a vendor lock. Also I heard that it have too complicated settings. Never used in production, so don't know about performance.
Minio. This is S3-compatible object storage, under active development at the moment. Never used it in production, but it seems to be well-designed.
You may also check wiki page with the full list of available solutions.
And the last point: I strongly recommend not to use OpenStack Swift (there are lot of reasons why, but first of all, Python is just not good for these purposes).
One probably-relevant question, whose answer I do not readily see in your post, is this:
How often do users actually "modify" the content?
and:
When and if they do, how painful is it if a particular user is served "stale" content?
Personally (and, "categorically speaking"), I prefer to tackle such problems in two stages: (1) identifying the objects to be stored – e.g. using a database as an index; and (2) actually storing them, this being a task that I wish to delegate to "a true file-system, which after all specializes in such things."
A database (it "offhand" seems to me ...) would be a very good way to handle the logical ("as seen by the user") taxonomy of the things which you wish to store, while a distributed filesystem could handle the physical realities of storing the data and actually getting it to where it needs to go, and your application would be in the perfect position to gloss-over all of those messy filesystem details . . .

Concurrent page request comparisons

I have been hoping to find out what different server setups equate to in theory for concurrent page requests, and the answer always seems to be soaked in voodoo and sorcery. What is the approximation of max concurrent page requests for the following setups?
apache+php+mysql(1 server)
apache+php+mysql+caching(like memcached or similiar (still one server))
apache+php+mysql+caching+dedicated Database Server (2 servers)
apache+php+mysql+caching+dedicatedDB+loadbalancing(multi webserver/single dbserver)
apache+php+mysql+caching+dedicatedDB+loadbalancing(multi webserver/multi dbserver)
+distributed (amazon cloud elastic) -- I know this one is "as much as you can afford" but it would be nice to know when to move to it.
I appreciate any constructive criticism, I am just trying to figure out when its time to move from one implementation to the next, because they each come with their own implementation feat either programming wise or setup wise.
In your question you talk about caching and this is probably one of the most important factors in a web architecture r.e performance and capacity.
Memcache is useful, but actually, before that, you should be ensuring proper HTTP cache directives on your server responses. This does 2 things; it reduces the number of requests and speeds up server response times (if you have Apache configured correctly). This can also be improved by using an HTTP accelerator like Varnish and a CDN.
Another factor to consider is whether your system is stateless. By stateless, it usually means that it doesn't store sessions on the server and reference them with every request. A good systems architecture relies on state as little as possible. The less state the more horizontally scalable a system. Most people introduce state when confronted with issues of personalisation - i.e serving up different content for different users. In such cases you should first investigate using the HTML5 session storage (i.e store the complete user data in javascript on the client, obviously over https) or if the data set is smaller, secure javascript cookies. That way you can still serve up cached resources and then personalise with javascript on the client.
Finally, your stack includes a database tier, another potential bottleneck for performance and capacity. If you are only reading data from the system then again it should be quite easy to horizontally scale. If there are reads and writes, its typically better to separate the read write datasets into a separate database and have the read only in another. You can then use more relevant methods to scale.
These setups do not spit out a single answer that you can then compare to each other. The answer will vary on way more factors than you have listed.
Even if they did spit out a single answer, then it is just one metric out of dozens. What makes this the most important metric?
Even worse, each of these alternatives is not free. There is engineering effort and maintenance overhead in each of these. Which could not be analysed without understanding your organisation, your app and your cost/revenue structures.
Options like AWS not only involve development effort but may "lock you in" to a solution so you also need to be aware of that.
I know this response is not complete, but I am pointing out that this question touches on a large complicated area that cannot be reduced to a single metric.
I suspect you are approaching this from exactly the wrong end. Do not go looking for technologies and then figure out how to use them. Instead profile your app (measure, measure, measure), figure out the actual problem you are having, and then solve that problem and that problem only.
If you understand the problem and you understand the technology options then you should have an answer.
If you have already done this and the problem is concurrent page requests then I apologise in advance, but I suspect not.

How online-game clients are able to exchange data through internet so fast?

Let's imagine really simple game... We have a labirinth and two players trying to find out exit in real time through internet.
On every move game client should send player's coordinates to server and accept current coordinates of another client. How is it possible to make this exchange so fast (as all modern games do).
Ok, we can use memcache or similar technology to reduce data mining operations on server side. We can also use fastest webserver etc., but we still will have problems with timings.
So, the questions are...
What protocol game clients are usually using for exchanging information with server?
What server technologies are coming to solve this problem?
What algorithms are applied for fighting with delays during game etc.
Usually with Network Interpolation and prediction. Gamedev is a good resource: http://www.gamedev.net/reference/list.asp?categoryid=30
Also check out this one: http://developer.valvesoftware.com/wiki/Source_Multiplayer_Networking
use UDP, not TCP
use a custom protocol, usually a single byte defining a "command", and as few subsequent bytes as possible containing the command arguments
prediction is used to make the other players' movements appear smooth without having to get an update for every single frame
hint: prediction is used anyway to smooth the fast screen update (~60fps) since the actual game speed is usually slower (~25fps).
The other answers haven't spelled out a couple of important misconceptions in the original post, which is that these games aren't websites and operate quite differently. In particular:
There is no or little "data-mining" that needs
to be speeded up. The fastest online
games (eg. first person shooters)
typically are not saving anything to
disk during a match. Slower online
games, such as MMOs, may use a
database, primarily for storing
player information, but for the most
part they hold their player and world data in memory,
not on disk.
They don't use
webservers. HTTP is a relatively slow
protocol, and even TCP alone can be
too slow for some games. Instead they
have bespoke servers that are written just for that particular game. Often these servers are tuned for low latency rather than throughput, because they typically don't serve up big documents like a web server would, but many tiny messages (eg. measured in bytes rather than kilobytes).
With those two issues covered, your speed problem largely goes away. You can send a message to a server and get a reply in under 100ms and can do that several times per second.

Random-access data object in J2ME

I'm planning to develop a small J2ME utility for viewing local public transport schedules using a mobile phone. The data part for those is mostly a big bunch of numbers representing the times when the buses arrive or leave.
What I'm trying to figure out is what is the best way to store that data. The representation needs to
be considerably small (because of persistent storage limitations of a mobile phone)
fit into a single file (for the ease of updating the schedule database afterwards over HTTP) fit into a constant number of files, i.e. (routes.dat, times.dat, ..., agencies.dat), and not (schedule_111.dat, schedule_112.dat, ...)
have a random access ability (unserializing the whole data object into memory would be just too much for a mobile phone :))
if there's some library for accessing that data format, a Java implementation should be present
In other words, if you had to squeeze a big part of GTFS-like data into a mobile device, how would you do that?
Google Protocol Buffers seemed like a good candidate for defining data but it didn't have random access.
What would you suggest?
Persistent storage on J2ME is a tricky business; see this related question for more general background: Best practice for storing large amounts of data with J2ME
In my experience, J2ME persistent storage tends to work best/most reliably with many small records rather than a few monolithic ones. Think about how the program is going to want to access the data, then try to store it in those increments in the J2ME persistent store.
I'd generally recommend decoupling your client-server protocol for downloading updates from the on-device storage format. You can change the latter with every code update, but you're pretty much stuck supporting a client-server protocol forever, unless you want to break older clients out in the field.
Finally, I know there are some people on the Transit Developers group who have built offline transit apps in J2ME, so it's worth asking for tips there.
I made app like this and I used xml-s generated with php. This enabled us to have a single provider for 3 presentation layers which were:
j2me app
website for mobile phones
usual website
We used xslt to convert xml to html on websites and kXML - very light pull parser to do it on j2me app. This worked well even on very old phones with b/w screens and small amounts of memory.
Besides on j2me there is no concept of file. You have the db in which you can store information.
This is a link to "mobile" website.
http://mobi.krakow.pl/rozklady/
and here to the app:
http://www.mobi.krakow.pl/rozklady/j2me/rjk.jar
This is in polish, but I think it's not hard to figure out what's this and that.
If you want, I can provide you with more help and advice or if this is a commercial product then I think we can figure out something too ;)
I think your issue is requirement 2.
Updating 10MB of data just because 4 digits changed somewhere in the middle of the file seems highly inefficient.
Spliting the data into several files allows for a better update granularity that will be well worth the added code complexity.
Real-time public transport schedules are usually modified one bus/train/tram line at a time.