best use of NSUrlConnection when getting multiple json objects that depend on the previous - objective-c

what I am doing is I am querying an API to search for articles in various data bases. There are multiple steps involved, each returns a json object. Each step involves a NSUrlConnection with different query strings to the API
step 1: returns json object indicating status of query & record set ID.
step 2: takes record set id from step 1 and returns list of databases that are valid for querying
step 3: queries each database that was ready from step 2 and gets json data array that has results
I am confused as to the best way of going about this. Is it better to use one nsurlconnection and reopen that connection in connection did finish loading based on which step I am in. Or is it better to open a new connection at the end of each subsequent connection?

A couple of observations
Network latency:
The key phenomenon that we need to be sensitive to here (and it sounds like you are) is network latency. Too often we test our apps in an idea scenario (on simulator with high speed internet access, or on device connected to wifi). But when you use an app in a real-world scenario, network latency can seriously impact performance and you'll want to architect a solution that minimizes this.
Simulating sub-optimal, real-world network situations:
By the way, if you're not doing it already, I'd suggest you install the "Network Link Conditioner" which is part of the "Hardware IO Tools" (available from the "Xcode" menu, choose "Open Developer Tool" - "More Developer Tools"). If you install the "Network Link Conditioner", you can then have your simulator simulate a variety of network experiences (e.g. Good 3G connection, Poor Edge connection, etc.).
Minimize network requests:
Anyway, I'd try to figure out how to minimize separate requests that are dependent upon the previous one. For example, I see step 1 and step 2 and wonder if you could merge those two into a single JSON request. Perhaps that's not possible, but hopefully you get the idea. You want to reduce the number of separate requests that have to happen sequentially.
I'd also look at step 3, and those look like they have to be dependent upon step 2, but perhaps you can run a couple of those step 3 requests concurrently, reducing the latency effect there.
In terms of how this would be implemented, I personally use a concurrent NSOperationQueue with some reasonable maxConcurrentOperationCount setting (e.g. 4 or 5, enough to enjoy concurrency and reduce latency, but not so many as to tax either the device or the server) and submit network operations. In this case, you'll probably submit step 1, with a completion operation that will submit step 2, with a completion operation that will submit a series of step 3 requests and these step 3 requests might run concurrent.
In terms of how to make a good network operation object, I might suggest using something like AFNetworking, which already has a decent network operation object (including one that parses JSON), so maybe you can start there.
In terms of re-using a NSURLConnection, generally its one connection per request. If I have had an app that wanted to have a lengthy exchange of messages with a server (e.g. a chat like service where you want the server to be able to send a message to the client whenever it wants, such as in a chat service), I've done a sockets implementation, but that doesn't seem like the right architecture here.

I would dismiss the first connection and create a new one for each connection.
Just, don't ask me why.
BTW, I would understand the question if this was about reusing vs. creating new objects in some performance sensitive context like scrolling through a table or animations or if it is just about of 10 thousands of iterations where it happens. But you are talking about 3 objects to either create new or reuse the old one. What is the gain of even thinking about it?


Bigquery streaming inserts taking time

During load testing of our module we found that bigquery insert calls are taking time (3-4 s). I am not sure if this is ok. We are using java biguqery client libarary and on an average we push 500 records per api call. We are expecting a million records per second traffic to our module so bigquery inserts are bottleneck to handle this traffic. Currently it is taking hours to push data.
Let me know if we need more info regarding code or scenario or anything.
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
The approach you've chosen if takes hours that means it does not scale, and won't scale. You need to rethink the approach with async processes. In order to finish sooner, you need to run in parallel multiple workers, the streaming performance will be the same. Just having 10 workers in parallel it means time will be 10 times less.
Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd.
Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.
Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type.
You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.
On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.
It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.
A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.
Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging.

Best Way to Transmit LARGE data packages via SOAP web service

We are working with a .NET 3.5 app which is fast approaching legacy status. We have an existing SOAP service which reads records from our database and saves them to a third party MS SQL database, sending all the data rows in a single batch.
This has always worked fine, but recently we've taken on a much larger client than any we've had before, and they are transmitting much larger batches, so much so that they have begun to fail. We've upped the time out and max memory sizes in IIS, and maxed out the maxRequestLength in the web.config, but we are still bumping up against size problems.
So, I understand that long term, we should consider moving away from SOAP and into WCF, and plans for that are in the works. But in the mean time, we need a short term fix for this new client. And of course, to make the business and sales people happy, we need it kinda quickly.
I'm wondering what the best-practice approach might be. Initially I'm thinking something like this, but I could be thinking inside the box too much:
Establish a bench mark of # of records over which we don’t want to attempt to sync all at once.
Before attempting to save the data, check the number of records against that bench mark
If it's above it, then break the transmission down into segments which are each below that benchmark. SELECT TOP 10000 * FROM table WHERE sent = false, etc., if the benchmark is 10000. Then update sent to true for those records once submitted. Repeat.
Obviously, this will slow the process down, so to handle the user experience, we may want to toss in a status bar so they can see the progress.
Am I on the right track?
In addition to the comments from John, you should consider if you are solving the problem in the most optimal way.
It looks like you are triggering a one way sync between 2 database by calling a web service. This approach leads to the time out and memory problems that you are experiencing.
If your goal is to do the one way sync, you could use a free framework such as Microsofts sync framework:

Good ways to decouple GUIs from SOAP/WS-API update/write calls?

Let's assume we have some configuration GUI that in its current form uses direct DB transactions to submit new configurations for more than one configurable component in a consistent manner.
Now let's move the data (DB) stuff behind some SOAP/WS API. The GUI has no direct DB access anymore. The transactional behaviour must remain, but the API should NOT be designed to explcitly accommodate the GUI form submissions. In fact, I don't even know how the new GUI will work or how the user input will be structured. Therefore I need to provide something like WS-AtomicTransaction on the API server side. However, there are (at least) two caveats:
The GUI is written in PHP: I don't think there is any WS-Transaction support in PHP available.
I don't want to keep DB transactions open on the server side while waiting for additional client requests.
Solutions I can think of:
using Camel's aggregation. However, that would make things more complicated in at least two ways:
You cannot use DB row ids of newly inserted rows in the subsequent calls inside the same transaction. You need to use some sort of symbolic back-referencing because there would be no communication between client and server while processing the aggregated messages.
call replies would not be immediate (or the immediate and separate reply to each single call would only be some sort of a stub, ie. not containing any useful information beyond "your message has been attached to TX xyz" -- if that's at all possible in the Camel aggregation case).
the two disadvantages of the previous solution make me think of request batches where possibly the WS standards provide means for referencing call results in subsequent calls inside the batch transaction. Is there any such thing already available? Maybe even as a PHP client?
trying to eliminate lock contention in the database by carefully using row-level locks etc. However, when inserting new elements, my guess is that usually pages and index pages need to be locked by the DB.
maybe some server-side persistence layer using optimistic locking? But again, that would not return any DB IDs back to the client before the final commit if DB writes would be postponed until the commit (don't know if that's possible at all).
What do YOU think?
Transactions are a powerful tool and we easily get into a thinking pattern in which we see every problem as a nail we hit with this big hammer. I can relate to your confusion because I've experienced it myself. Unfortunately I have no better advice for you than to try not think in terms of transactions but of atomic API calls.
When I think in terms of transactions, my thought pattern usually goes like this:
start transaction
read (repeat as required)
update (repeat as required)
commit/roll back
It takes some time to realize that we overuse this pattern. Actual conflicts are rare and there are many other ways of dealing with them. Here is a commonly used one in APIs
read and send data to client (atomic API call)
update data (on the client)
send original + updates back to the server (atomic API call)
start transaction (on server)
compare with original from client
if not same, return error (client should retry)
if same, update
The last six points are part of the implementation of the API call.
Ferenc Mihaly

Best practice for inserting and querying data from memory

We have an application that takes real time data and inserts it into database. it is online for 4.5 hours a day. We insert data second by second in 17 tables. The user at any time may query any table for the latest second data and some record in the history...
Handling the feed and insertion is done using a C# console application...
Handling user requests is done through a WCF service...
We figured out that insertion is our bottleneck; most of the time is taken there. We invested a lot of time trying to finetune the tables and indecies yet the results were not satisfactory
Assuming that we have suffecient memory, what is the best practice to insert data into memory instead of having database. Currently we are using datatables that are updated and inserted every second
A colleague of ours suggested another WCF service instead of database between the feed-handler and the WCF user-requests-handler. The WCF mid-layer is supposed to be TCP-based and it keeps the data in its own memory. One may say that the feed handler might deal with user-requests instead of having a middle layer between 2 processes, but we want to seperate things so if the feed-handler crashes we want to still be able to provide the user with the current records
We are limited in time, and we want to move everything to memory in short period. Is having a WCF in the middle of 2 processes a bad thing to do? I know that the requests add some overhead, but all of these 3 process(feed-handler, In memory database (WCF), user-request-handler(WCF) are going to be on the same machine and bandwidth will not be that much of an issue.
Please assist!
I would look into creating a cache of the data (such that you can also reduce database selects), and invalidate data in the cache once it has been written to the database. This way, you can batch up calls to do a larger insert instead of many smaller ones, but keep the data in-memory such that the readers can read it. Actually, if you know when the data goes stale, you can avoid reading the database entirely and use it just as a backing store - this way, database performance will only affect how large your cache gets.
Invalidating data in the cache will either be based on whether its written to the database or its gone stale, which ever comes last, not first.
The cache layer doesn't need to be complicated, however it should be multi-threaded to host the data and also save it in the background. This layer would sit just behind the WCF service, the connection medium, and the WCF service should be improved to contain the logic of the console app + the batching idea. Then the console app can just connect to WCF and throw results at it.
Update: the only other thing to say is invest in a profiler to see if you are introducing any performance issues in code that are being masked. Also, profile your database. You mention you need fast inserts and selects - unfortunately, they usually trade-off against each other...
What kind of database are you using? MySQL has a storage engine MEMORY which would seem to be suited to this sort of thing.
Are you using DataTable with DataAdapter? If so, I would recommend that you drop them completely. Insert your records directly using DBCommand. When users request reports, read data using DataReader, or populate DataTable objects using DataTable.Load (IDataReader).
Storying data in memory has the risk of losing data in case of crashes or power failures.

monotouch - updating data locally

In my ap we serilaize the data locally for offline use. To ensure the app is always up to date I fire off an update on launch.
To do this I have a set of WCF services that will get a delta for the requested data. Rather than complicate things I have a service to update events, a service to update stages, a service to update acts etc. Which means i have to daisy chain these calls in the callbacks so they run one after the other.
The problem with this is that they can take a short while to update and it seems a bit clunky chaining them like this.
What is the prefered/advised way of updating from multiple services to achieve what i need to here?
For Cracklytics ( as well as a few other enterprise apps I've worked on, I run two service calls in parallel at the same time, instead of doing one after the other.
I spent quite a lot of time testing the performance of making calls one-at-a-time vs two-at-a-time vs three-at-a-time, etc, and I got the best results under 2G and 3G by running 2 threads at once. On wireless, I could start up like 8-10 threads together and they would run really fast.
Besides those two calls, Cracklytics also downloads a few charts from Google at the same time as those 2 calls, but I didn't notice any performance impact from that.
For the implementation, I have one main class that keeps track of all the webservices class and controls when they should be started and finished.
Just as important though is to figure out when web services calls should be canceled, though; for example, if you're downloading data for a table, but the user moves to another screen, you should cancel the call right away, so it doesn't impact the downloading of data for the next screen.
Hope this helps.