Any way to see incoming buffer/records from SQL Server? - sql

Basically I have a bunch of performance analysis that [given naive interpetation] claims 70% of the time is spent in synchronization on our web application under heavy load, and mostly in SNIReadSyncOverAsync which internally in the data reader calls. (SNIReadSyncOverAsync actually ends up sitting on a kernalbase.dll!WaitForSingleObjectEx) It would be informative to see if these waits are caller initiated or callee initiated.
Is there a way to see (interpret) this in a Visual Studio Contention or Concurrency Report? Or some other way?
More importantly for my understanding, is there a way to see the incoming buffer that holds data before the data get's consumed by the data reader?

It seems my question was ill-informed.
The datareader reads a record at a time, but it reads it from the
underlying database driver. The database driver reads data from the
database in blocks, typically using a buffer that is 8 kilobytes.
If your result records are small and you don't get very many, they
will all fit in the buffer, and the database driver will be able to
feed them all to the data reader without having to ask the database
for more data.
If you fetch a result that is larger than the buffer, you will only be
able to read the first part of it and when there will no data exist in
network buffer then datareader will inform sql server to send next
block of data.
How much data can be stored in network buffer when datareader is used

Related

Do size of data from api response affect the speed of application that fetching data from that api?

I am Building A App , where I have to show categories of a Products , is there any changes in performance of app on the basis of data size coming from that if there are 10 categories coming from the api response vs there are 70 categories coming from that response , Do the size of response affect the performance of applictaion??
Of course! Response size drives performance. Imagine having a GET API that withdraws all the data from the table. If you have millions of rows, this is going to result into large chunk of response data. Depending upon what your maxresponsize is configured in your web server, the data will be returned or web server will simply fail to write it to wire.
Not only this results into large response size, the speed of accessing your data store will increase too. I have seen some badly written GET APIs scanning through the entire SQL Server table, millions of rows, resulting into a query that would execute for more than 30 seconds. All this time, the calling application is waiting for response, degrading its performance. Increase the number of parallel requests and you will run into other problems related to memory consumption, connection pool performance issues etc. These are just symptoms, the root cause of the problem is that massive chunk of data that is being read.
Ideally, an application should extract only the part of data that it needs to hand over to the user, i.e. it should compose queries in such a manner that filtering, compaction etc. happens on the database side, not on the application side. Database is the best tool to handle data, not application code.
There is host of other problems if you start drilling into paging, chunked responses, etc.
This is why API design is an important topic of discussion.

How to enrich events using a very large database with azure stream analytics?

I'm in the process of analyzing Azure Stream Analytics to replace a stream processing solutions based on NiFi with some REST microservices.
One step is the enrichment of sensor data form a very large database of sensors (>120Gb).
Is it possible with Azure Stream Analytics? I tried with a very small subset of the data (60Mb) and couldn't even get it to run.
Job logs give me warnings of memory usage being too high. Tried scaling to 36 stream units to see if it was even possible, to no avail.
What strategies do I have to make it work?
If I deterministically (via a hash function) partition the input stream using N partitions by ID and then partition the database using the same hash function (to get id on stream and ID on database to the same partition) can I make this work? Do I need to create several separated stream analytics jobs do be able to do that?
I suppose I can use 5Gb chunks, but I could not get it to work with ADSL Gen2 datalake. Does it really only works with Azure SQL?
Stream Analytics supports reference datasets of up to 5GB. Please note that large reference datasets come with the downside of making jobs/nodes restarts very slow (up to 20 minutes for the ref data to be distributed; restarts that may be user initiated, for service updates, or various errors).
If you can downsize that 120Gb to 5Gb (scoping only the columns and rows you need, converting to types that are smaller in size), then you should be able to run that workload. Sadly we don't support partitioned reference data yet. This means that as of now, if you have to use ASA, and can't reduce those 120Gb, then you will have to deploy 1 distinct job for each subset of stream/reference data.
Now I'm surprised you couldn't get a 60Mb ref data to run, if you have details on what exactly went wrong, I'm happy to provide guidance.

Does Bigtable write operations to the log for every single operation or in batches?

I was wondering how Google's Bigtable stays persistent. When a write operation comes in, the tablet server updates the in-memory "hashmap" and it is also written to a log file. This way, if the tablet server dies, a new tablet server can read all recent operations and be "equal" to the dead tablet.
This makes sense, but doesn't it slow down to write every operation to a log server rather than in batch (because it is written to a disk)?
Let's take each of these questions in turn.
Does Bigtable write operations to the log for every single operation or in batches?
Bigtable writes every single operation to the persistent log as they come in, not in batch. In other words, it's synchronous, rather than asynchronous: by the time the server responds to the client, the data was already written to a log (which is durable and replicated), not just to memory.
If a storage system only writes to memory, and writes out to a log in batches, it will lose data that was only in memory if the server were to crash after accepting some writes, but before it flushed them to a log.
This makes sense, but doesn't it slow down to write every operation to a log server rather than in batch (because it is written to a disk)?
The distributed file system behind Bigtable (formerly Google File System, now Colossus) is much faster than typical file systems, even though it's distributed and each write is replicated.
On benchmarks using YCSB, Google Cloud Bigtable has demonstrated single-digit millisecond latency on both reads and writes even at the tail:

SQL FILESTREAM and Connection Pooling

I am currently enhancing a product to support web delivery of large file-content.
I would like to store it in the database, and whether or not I choose to FILESTREAM by BLOB, the following question still holds.
My WCF method will return a stream, meaning that the file stream will remain open while the content is read by the client. If the connection is slow, then the stream could be open for some time.
Question: Connection pooling assumes that connections are exclusively held, only for a short period of time. Am I correct in assuming, that given I have a connection pool of finite size, there could be a contention problem, if slow network connections are used to download files?
Under this assumption, I really want to use FILESTREAM, and open the file directly from the file-system, rather than the SQL connection. However, if the database is remote, I will have no choice but to pull the content from the SQL connection (until I have a local cache of the file anyway).
I realise I have other options, such as to server-buffer the stream, however that will have implications as well. I wish at this time, to discuss only the issues relating to returning a stream obtained from a DB connection.

Best practice for inserting and querying data from memory

We have an application that takes real time data and inserts it into database. it is online for 4.5 hours a day. We insert data second by second in 17 tables. The user at any time may query any table for the latest second data and some record in the history...
Handling the feed and insertion is done using a C# console application...
Handling user requests is done through a WCF service...
We figured out that insertion is our bottleneck; most of the time is taken there. We invested a lot of time trying to finetune the tables and indecies yet the results were not satisfactory
Assuming that we have suffecient memory, what is the best practice to insert data into memory instead of having database. Currently we are using datatables that are updated and inserted every second
A colleague of ours suggested another WCF service instead of database between the feed-handler and the WCF user-requests-handler. The WCF mid-layer is supposed to be TCP-based and it keeps the data in its own memory. One may say that the feed handler might deal with user-requests instead of having a middle layer between 2 processes, but we want to seperate things so if the feed-handler crashes we want to still be able to provide the user with the current records
We are limited in time, and we want to move everything to memory in short period. Is having a WCF in the middle of 2 processes a bad thing to do? I know that the requests add some overhead, but all of these 3 process(feed-handler, In memory database (WCF), user-request-handler(WCF) are going to be on the same machine and bandwidth will not be that much of an issue.
Please assist!
I would look into creating a cache of the data (such that you can also reduce database selects), and invalidate data in the cache once it has been written to the database. This way, you can batch up calls to do a larger insert instead of many smaller ones, but keep the data in-memory such that the readers can read it. Actually, if you know when the data goes stale, you can avoid reading the database entirely and use it just as a backing store - this way, database performance will only affect how large your cache gets.
Invalidating data in the cache will either be based on whether its written to the database or its gone stale, which ever comes last, not first.
The cache layer doesn't need to be complicated, however it should be multi-threaded to host the data and also save it in the background. This layer would sit just behind the WCF service, the connection medium, and the WCF service should be improved to contain the logic of the console app + the batching idea. Then the console app can just connect to WCF and throw results at it.
Update: the only other thing to say is invest in a profiler to see if you are introducing any performance issues in code that are being masked. Also, profile your database. You mention you need fast inserts and selects - unfortunately, they usually trade-off against each other...
What kind of database are you using? MySQL has a storage engine MEMORY which would seem to be suited to this sort of thing.
Are you using DataTable with DataAdapter? If so, I would recommend that you drop them completely. Insert your records directly using DBCommand. When users request reports, read data using DataReader, or populate DataTable objects using DataTable.Load (IDataReader).
Storying data in memory has the risk of losing data in case of crashes or power failures.