What is a real world use for ConcurrentBag<T>? - .net-4.0

A ConcurrentBag will allow multiple threads to add and remove items from the bag. It is possible that a thread will add an item to the bag and then end up taking that same item right back out. It says that the ConcurrentBag is unordered, but how unordered is it? On a single thread, the bag acts like a Stack. Does unordered mean "not like a linked list"?
What is a real world use for ConcurrentBag?

Because there is no ordering the ConcurrentBag has a performance advantage over ConcurrentStack/Queue. It is implemented by Microsoft as local thread storage. So every thread that adds items does this in it's own space. When retrieving items they come from the local storage. Only when that is empty the thread steals item from another threads storage. So instead of a simple list a ConcurrentBag is a distributed list of items. And is almost lockfree and should scale better with high concurrency.
Unfortunately in .NET 4.0 there was a performance issue (fixed in 4.5) see
http://ayende.com/blog/156097/the-high-cost-of-concurrentbag-in-net-4-0

Bags are really useful for tracking instance counts. For example, if you want to keep a record of which hosts you're servicing web requests for, you can add their IP to the bag when you start servicing the request, and remove it when done.
Using a bag will allow you to tell at a glance which IPs you're currently servicing. It will also let you quickly query whether you're servicing a given IP address.
If you use a set for this rather than a bag, then having multiple concurrent requests from the same IP address will mess up your record-keeping.

Anything where you just need to keep track of what's there and don't need random access or guaranteed order. If you have a thread that adds items to process, and a thread that removes items in order to process them, a concurrent bag would work well if you don't care that they're processed in FIFO order.

Thanks to #Chris Jester-Young I came up with a good, real world, scenario that actually applies to a project i'm working on.
Find - Process - Store
Find - threads 1 & 2 are set to find or scrape data (file system, web, etc). These results are stored in ConcurrentBag1.
Process - threads 3 & 4 are set to take out of ConcurrentBag1, clean/transform/process the data and then store the results in ConcurrentBag2.
Store - threads 5 is set to gather results from ConcurrentBag2 and store the results in SQL.

Related

Max value size for Redis

I've been trying to make replay system. So basically when player moves, system saves his datas(movements, location, animation etc.) into JSON file. In the end of the record, JSON file may be over 50 MB. I'd want to save this data into Redis with expire date (24-48 hours).
My questions are;
Is it bad to save over 50 MB into Redis with expire date?
How many datas that over 50 MB can Redis handle without performance loss?
If players make 500 records in 48 hours, may it be bad for Redis?
How many milliseconds does it takes 50 MB data from Redis with average VDS/VPS?
Storing a large object(in terms of size) is not a good practice. You may read it from here. One of the problem is network. You need to send 50MB payload to a redis server in a single call. Also if you save them as one big object, then while retrieving, updating it (a single field, element etc), you need to get 50 MB back from server and parse it to get a single field, update it back end send back to server. That's a serious problem in terms of network.
Instead of redis strings, you may prefer sorted sets or lists depending on your use case. If you are going to store them with timestamps and get the range of events between these timestamps, then sorted sets may be an ideal solution for you. It's good for pagination etc. One of the crucial drawback is the complexity of adding a new element is O(log(N)).
lists may also provide a good playground for your case. You may use LPUSH/RPUSH to add new events to your list, and since Redis lists are implemented with linked lists, both adding a message to the beginning or end of the list is same, O(1), which is great.
Whenever an event happens, you either call ZADD or RPUSH/LPUSH to send the events to redis. If you need to query those to you may use available functions such as ZRANGEBYSCORE or LRANGE depending on your choice.
While designing your keys you may use an identifier such as user-id just like you mentioned in the comments. You will not have the problems with lists/sorted sets like you will have in strings. But choosing which one is most suitable for your depends on your use case for reads/writes or business rules.
Here some useful links to read;
Redis data types intro
Redis data types
Redis labs documentation about data types

Redis atomic pop and add to sorted set, BRPOPLPUSH equivalent

I have one redis list "waiting" and one redis list "partying".
I have a long running process that safely blocks on the "waiting" list item to come along, and then pops it and pushes it onto the "partying" list atomically using BRPOPLPUSH. Awesome.
Users in the "waiting" list are repeatedly querying "am I in the "partying" list yet?", but there is no fast (i.e. < O(n)) method of checking if a user is in a redis list. You have to grab the whole list and loop through it.
So I'm resorting to switching from a redis list to a redis sorted set, with the 'score' as the unix timestamp of when they joined the "waiting" sorted set. I can blocking pop on the lowest score (user at the head of the queue). Using sorted sets, I can use ZSCORE to check in O(1) time if they're on either list, so it's looking hopeful.
How can I perform the nice atomic equivalent of BRPOPLPUSH on sorted sets?
It's like I need a mythical BZRPOPMIN & ZADD= BZRPOPMINZADD. If the process dies between these two, a user will effectively be disappeared from both sets.
Looking into MULTI EXEC transactions in redis, they are not what they appear to be at first glance, they're more like 'pipelines', in that I can't get the result of the first command (BZRPOPMIN) and feed it into the second command (ZADD). I'm very suspicious of putting the blocking BZRPOPMIN into the MULTI too, am I right to be?
How can I perform the nice atomic equivalent of BRPOPLPUSH on sorted sets? 
Sorry, you can't. We actually discussed this when the ZPOP family was added and decided against it: "However I'm not for the BZPOPZADD part, because instead experience with lists shown that this is not a good idea in general, unfortunately, and that that adding safety of message processing may be used other means. The worst thing abut BZPOPZADD and BRPOPLPUSH and so forth are the cascading effects, they create a lot of problems in replication for instance, and our BRPOPLPUSH replication is still not correct in certain ways (we can talk about it if you want)." (ref: https://github.com/antirez/redis/pull/4879#issuecomment-389116241)
 I'm very suspicious of putting the blocking BZRPOPMIN into the MULTI too, am I right to be?
Definitely, and blocking commands can't be called inside a transaction anyway.

Array list or linked list which one to select

I was given an problem statement and the problem statement was Suppose you’re building an app for restaurants to take customer orders.
Your app needs to store a list of orders. Servers keep adding
orders to this list, and chefs take orders off the list and make them.
It’s an order queue: servers add orders to the back of the queue,
and the chef takes the first order off the queue and cooks it.
Would you use an array or a linked list to implement this queue.
I repaid linked list. Lots of inserts are happening (servers
adding orders), which linked lists excel at. You don’t need search
or random access (what arrays excel at), because the chefs always
take the first order off the queue.
Now please advise was my answer was correct and i was also thinking it mite happen that server put the ten items in queue simultaneously but at the other end chef decides that it will pick the item first which will take less time to prepare so in that case which data structure is best
Linked List Based Queue will be the best option for the above asked Question. In the linked list Queue you have the advantage of O(1) operation for insert(placing the orders by servers) and delete (Taking the orders by chef). Also you have dynamic memory allocation (Memory is used up only when it needs to ).
I have worked on a similar project where we used to club the similar orders which is what chefs generally do in the real world. Ex: 2 French Fries from server 1 and 3 French Fries from server2 will become 5 French Fries order in total.
When you want to Implement a scenario where random access based on the time taken for the preparation of item is considered, I still think the linked list queue will be more useful.
yes, linked-list should be the way to go.
This is typically because not only are you trying to insert from one end, you are also trying to remove elements from the other end.
Now in this linked list, you'll have a head and a tail, for which you'll keep their pointers always available (as you need to remove elements from tail and add elementss on top of head).
As the orders are added towards the head, the head keeps on updating : No problem here.
For the part where chef removes one order: You'll have to remove item from the list and also keep the tail pointer point to the second last object, which is now the last object in the list.
The only way in which I can think of achieving this is by using a doubly linked list, were the tail gets updated every time you remove (or pop) an element from the queue (oops sorry, linked list).
I guess that's how i was taught how to implement a queue, using a doubly linked list.
You can refer to the impelemtation here.
EDIT:
You can do it with a singly linked list as well. Steps are:
Whenever a new element is added, add it to the back of the list and update the tail pointer.
Whenever the chef needs the element from the linked list, access the head, take out one element and update head to head-> next.

Best way to handle timouts on rabbitmq message processing

I am trying to get my head around an issue I have recently encountered and I hope someone will be able to point me in the most reasonable direction of solving it.
I am using Riak KV store and working on CRDT data, where I have some sort of counter inside each CRDT item stored in database.
I have a rabbitmq queue, where each message is a request to increase or decrease a certain amount of aforementioned counters.
Finally, I have a group of service-workers, that listens on the queue, and for each request try to change the amount of counters accordingly.
The issue I have is as follows: While a single worker is processing a request, it may get stuck for a while on a write operation to database – let’s say on a second change of counters out of three. It’s connection with rabbitmq gets lost (timeout) so the message-request gets back on to the queue (I cannot afford to miss one). Then it is picked up by second worker, which begins all processing anew. However, the first worker finishes its work, and as a results I have processed a single message twice.
I can split those increments into single actions, but this still leaves me with dilemma – can still change value of counter twice, if some worker gets stuck on a write operation for a long period.
I do not have possibility of making Riak KV CRDT writes work faster, nor can I accept missing out a message-request. I need to implement some means of checking whether a request was already processed before.
My initial thoughts were to use some alternative, quick KV store for storing rabbitMQ message ID if they are being processed. That way other workers could tell whether they are not starting to process a message that is already parsed elsewhere.
I could use any help and pointers to materials I can read.
You can't have "exactly one delivery" semantic. You can reduce double-sent messages or missed deliveries, so it's up to you to decide which misbehavior is the least inconvenient.
First of all are you sure it's the CRDTs that are too slow ? Are you using simple counters or counters inside maps ? In my experience they are quite fast, although slower than kv. You could try:
- having simple CRDTs (no maps), and more CRDTs objects, to lower their stress( can you split the counters in two ?)
- not using CRDTs but using good old sibling resolution on client side on simple key/values.
- accumulate the count updates orders and apply them in batch, but then you're accepting an increase in latency so it's equivalent to increasing the timeout.
Can you provide some metrics? Like how long the updates take, what numbers you'd expect, if it's as slow when you have few updates or many updates, etc

web app OO concept confusion

This is a concept question, regarding "best practice" and "efficient use" of resources.
Specifically dealing with large data sets in a db and on-line web applications, and moving from a procedural processing approach to a more Object Oriented approach.
Take a "list" page, found in almost all CRUD aspects of the application. The list displays a company, address and contact. For the sake of argument, and "proper" RDBM, assume we've normalized the data such that a company can have multiple addresses, and contacts.
- for our scenario, lets say I have a list of 200 companies, each with 2-10 addresses, each address has a contact. i.e. any franchise where the 'store' is named 'McDonalds', but there may be multiple addresses by that 'name').
TABLES
companies
addresses
contacts
To this point, I'd make a single DB call and use joins to pull back ALL my data, loop over the data and output each line... Some grouping would be done at the application layer to display things in a friendly manner. (this seems like the most efficient way, as the RDBM did the heavy lifting - there was a minimum network calls (one to the db, one from the db, one http request, one http response).
Another way of doing this, if you couldn't group at the application layer, is to query for the company list, loop over that, and inside the loop make separate DB call(s) for the address, contact. less efficient, because you're making multiple DB calls
Now - the question, or sticking point.... Conceptually...
If I have a company object, an address object and a contact object - it seems that in order to achieve the same result - you would call a 'getCompanies' method that would return a list, and you'd loop over the list, and call 'getAdderss' for each, and likewise a 'getContact' - passing in the company ID etc.
In a web app - this means A LOT more traffic from the application layer to the DB for the data, and a lot of smaller DB calls, etc. - it seems SERIOUSLY less effective.
If you then move a fair amount of this logic to the client side, for an AJAX application, you're incurring network traffic ON TOP of the increased internal network overhead.
Can someone please comment on the best ways to approach this. Maybe its a conceptual thing.
Someone suggested that a 'gateway' is when you access these large data-sets, as opposed to smaller more granular object data - but this doesn't really help my understanding,and Im not sure it's accurate.
Of course getting everything you need at once from the database is the most efficient. You don't need to give that up just because you want to write your code as an OO model. Basically, you get all the results from the database first, then translate the tabular data into a hierarchical form to fill objects with. "getCompanies" could make a single database call joining addresses and contacts, and return "company" objects that contain populated lists of "addresses" and "contacts". See Object-relational mapping.
I've dealt with exactly this issue many times. The first and MOST important thing to remember is : don't optimize prematurely. Optimize your code for readability, the DRY principle, etc., then come back and fix things that are "slow".
However, specific to this case, rather than iteratively getting the addresses for each company one at a time, pass a list of all the company IDs to the fetcher, and get all the addresses for all those company ids, then cache that list of addresses in a map. When you need to fetch an address by addressID, fetch it from that local cache. This is called an IdentityMap. However, like I said, I don't recommend recoding the flow for this optimization until needed. Most often there are 10 things on a page, not 100 so you are saving only a few milliseconds by changing the "normal" flow for the optimized flow.
Of course, once you've done this 20 times, writing code in the "optimized flow" becomes more natural, but you also have the experience of when to do it and when not to.