SQL Server select with large varchar columns take time to load - sql

I am trying to run a simple select query and it has column called instructions with varchar(8000) in the select column list. The table has
90,000 records and it took my SQL server management studio console to 10 seconds to return and display the full table data
SELECT id, name, instructions, etc.... FROM TABLE;
however when i remove the instructions from the select list it took only a 1 second to execute and display the result. Can any one please help me to understand the theory behind this
Thanks
Keth

There are some obvious things here that impact the time, and a few more subtle ones around it. The topic of the underlying storage of SQL Server and how it stores / retrieves this data is a book in itself, of which there are many. (I'd personally recommend Kalen Delaney but everyone will have their own preference and I appreciate we should keep away from subjectivity on SO).
90k rows of instructions potentially have to be marshalled across your network connection if you were connected from another machine than the server.
The SSMS console itself, has to display these, which itself takes time.
depending on the size of what you are reading vs your buffer cache and other queries being executed you could be putting pressure on your cache and generating more physical IO load for the server as a whole.
As mentioned in comments, more data is being read, but does this mean more is being read from the disk? This one is far more subtle when looked at in detail.
In terms of the disk IO issue, depending on when the instructions are placed in the row and the settings for the column around inlining of data. It might be that the instructions for the row are stored inline with the row, which means no additional disk IO is actually occurring to read them vs not read them, its more a case of whether SQL Server bothers to decode the value from the page in memory.
The varchar(8000) though might not be inline with the rest of the data, it could be on a row_overflow_data page, sometimes referred to as short large object (SLOB), in which case the instruction field itself stores a pointer where the data is stored, and when you read the instructions it causes SQL Server to have to read another entirely random page (and extent) elsewhere on the disk per row.
Depending how / when instructions are added, you could see a huge level of fragmentation / lack of contiguous extents being allocated for these instructions, although depending on the IO subsystem, this may be immaterial to the problem.
There are a lot of unknowns at this point which makes it harder to give anything definitive - you are in the 'it depends' area of the DB, which would need a lot more specifics and investigation to be able to point at a specific cause, vs the more general (and not entirely complete) list above.
As Tim Biegeleisen mentioned, do not read the instructions unless you need to.

Related

Why Select SQL queries on tables with blobs are slow, even when the blob is not selected?

SELECT queries on tables with BLOBs are slow, even if I don't include the BLOB column. Can someone explain why, and maybe how to circumvent it? I am using SQL Server 2012, but maybe this is more of a conceptual problem that would be common for other distributions as well.
I found this post: SQL Server: select on a table that contains a blob, which shows the same problem, but the marked answer doesn't explain why is this happening, neither provides a good suggestion on how to solve the problem.
If you are asking for a way to solve the performance drag, there are a number of approaches that you can take. Adding indexes to your table should help massively provided you aren't simply selecting the entire recordset. Creating views over the table may also assist. It's also worth checking the levels of index fragmentation on the table as this can cause poor performance and could be addressed with a regular maintenance job. The suggestion of creating a linked table to store the blob data is also a genuinely good one.
However, if your question is asking why it's happening, this is because of the fundamentals of the way MS SQL Server functions. Essentially your database, and all databases on the server and split into pages, 8kb chunks of data with a 96-byte header. Each page representing what is possible in a single I/O operation. Pages are collected contained and grouped within Exents, 64kb collections of eight contiguous pages. SQL Server therefore uses sixteen Exents per megabyte of data. There are a few differing page types, a data page type for example won't contain what are termed "Large Objects". This include the data types text, image, varbinary(max), xml data, etc... These also are used to store variable length columns which exceed 8kb (and don't forget the 96 byte header).
At the end of each page will be a small amount of free space. Database operations obviously shift these pages around all the time and free space allocations can grow massively in a database dealing with large amounts of I/O and random record access / modification. This is why free space on a database can grow massively. There are tools available within the management suite to allow you to reduce or remove free space and basically this re-organizes pages and exents.
Now, I may be making a leap here but I'm guessing that the blobs you have in your table exceed 8kb. Bear in mind if they exceed 64kb they will not only span multiple pages but indeed span multiple exents. The net result of this will be that a "normal" table read will cause massive amounts of I/O requests. Even if you're not interested in the BLOB data, the server may have to read through the pages and exents to get the other table data. This will only be compounded as more transactions make pages and exents that make up a table to become non-contiguous.
Where "Large Objects" are used, SQL Server writes Row-Overflow values which include a 24bit pointer to where the data is actually stored. If you have several columns on your table which exceed the 8kb page size combined with blobs and impacted by random transactions, you will find that the majority of the work your server is doing is I/O operations to move pages in and out of memory, reading pointers, fetching associated row data, etc, etc... All of which represents serious overhead.
I got a suggestion then, have all the blobs in a separate table with an identity ID, then only save the identity ID in your main table
it could be because - maybe SQL cannot cache the table pages as easily, and you have to go to the disk more often. I'm no expert as to why though.
A lot of people frown at BLOBS/images in databases - In SQL 2012 there is some sort of compromise where you can configure the DB to keep objects in a file structure, not in the actual DB anymore - you might want to look for that

Fine tuning oracle query with pipelined function

I have a query (that powers an Oracle Application Express Report) that I was told by our users was executing "slowly" or at an unacceptable speed (wasn't given an actual load time for the page and the query is the only thing on the page).
The query involves many tables and actually references a pipelined function which identifies the currently logged-in users to our website and returns a custom "table" of records they have permission to based upon a custom security scheme we have.
My main question is around Oracle's caching of queries and how they could be affected by our setup.
When I took the query out of the webpage and ran it in Sql Developer (and manually specified a user ID to simulate a logged-in user to the website), the performance went from 71 seconds to 19 seconds to .5 seconds. Clearly, Oracle is utilizing its caching mechanism to make subsequent runs faster.
How is this affected by?:
The fact that different users will get different tables from the
pipe-lined function (all the same columns, just different number of
rows and the values in the rows). Does the pipe-lining prevent
caching from working? Am I only seeing caching because I'm running
a very isolated test?
Further more - is caching easily influenced by the number of people using the system? I'm not sure how "much" can get cached. Therefore, if we have 50 concurrent users that are accessing different parts of the website that are loading different queries all day long, is it likely that oracle won't be able to cache many/any of them because it is constantly seeing different request for queries?
Sorry my question isn't very technical.
I'm a developer who has been asked to help out in this seemingly DBA question.
Also, this is complicated because I can't really determine what the actual load times are since our users don't report that level of detail.
Any thoughts on:
how I can determine if this query is actually slow?
what the average processing time would be?
and how to proceed with fine tuning if it is a problem?
Thanks!
It doesn't sound like this has anything to do with APEX, pipelined table functions, or query caching. It sounds like you are describing the effects of plain old data caching (most likely at the database level but potentially at the operating system and disk subsystem layers).
As a very basic overview, data is stored in rows, rows are stored in blocks (most commonly 8 kb in size), blocks are stored in extents (generally a few MB in size), and extents roll up to segments (i.e. a table). Oracle maintains a buffer cache where the most recently accessed blocks are stored. When you run a query, Oracle figures out which blocks it needs to read in order to get your data (this is the query plan). It then looks to see whether those blocks are in the buffer cache or whether they have to be read from disk. Obviously, reading a block from cache is much more efficient than reading it off the disk since RAM is much faster than disk. If you run the same query with the same set of bind variable values multiple times in a row, you'll be accessing the same set of blocks each time but more and more of the blocks you care about are going to be in the cache. So you'd generally expect that the second and third time that you call the query, you'll see faster performance.
If you run the query with a different set of bind variable values, if the second set of bind variable values causes Oracle to access many of the same blocks, those executions will benefit from the data the prior test cached. Otherwise, you'd be back to square 1 potentially reading all the data you need off disk. Most likely, you'll see some combination of the two.
Remember as well that it is not just Oracle that is caching data. Frequently, the operating system will be caching the most active pieces of the underlying Oracle data files. And the I/O subsystem will be caching the most recently accessed data as well. So even if Oracle thinks that it needs to go out to fetch a block because it is not in the database's buffer cache, the file system or the I/O subsystem may have cached that data so it may not require an actual physical read off of disk. These other caches behave similarly where running the same query multiple times in a row is likely to cause the cache to be "warm" and improve the performance of the later runs.

monetdb in the cloud, scalability, amazon s3

i have recently discovered MonetDB and i am evaluating it for an internal project, so probably my questions are from a really newbie point of view. Maybe someone could point me to a site and/or document where i could find more info (i haven't found too much googling)
regarding scalability, correct me please if i am wrong, but what i understand is that if i need to scale, i would launch more server instances and discover them from the control node, is it right?
is there any limit on the number of servers?
the other point is about storage, is it possible to use amazon S3 to back MonetDB readonly instances?
update we would need to store a massive amount of Call Detail Records from different sources, on a read-only basis. We would aggregate/reduce that data for the day-to-day operation, accessing the bigger tables only when the full detail is required.
We would store the historical data as well to perform longer-term analysis. My concern is mostly about memory, disk storage wouldn't be the issue i think; if the hot dataset involved in a report/analysis eats up the whole memory space (fast response times needed, not sure about how memory swapping would impact), i would like to know if i can scale somehow instead of reingeneering the report/analysis process (maybe i am biased by the horizontal scaling thing :-) )
thanks!
You will find advantages of monetdb easily on net so let me highlight some disadvantages
1. In monetdb deleting rows does not free up the space
Solution: copy data in other table,drop existing table, and rename the other table
2. Joins are little slower
3. We can can not give table name as dynamic variable
Eg: if you have table name stored in one main table then you can't make a query like "for each (select tablename from mytable) select data from tablename)" the sql
You can't make functions with tablename as variable argument.
But it is still damn fast and can store large amount of data.

What Database for extensive logfile analysis?

The task is to filter and analyze a huge amount of logfiles (around 8TB) from a finished research project. The idea is to fill a database with the data to be able to run different analysis tasks later.
The values are stored comma separated. In principle the values are tuples of up to 5 values:
id, timestamp, type, v1, v2, v3, v4, v5
In a first try using MySQL I used one table with one log entry per row. So there is no direct relation between the log values. The downside here is slow querying of subsets.
Because there is no relation I looked into alternatives like NoSQL databases, and column based tables like hbase or cassandra seemed to be a perfect fit for this kind of data. But these systems are made for huge distributed systems, which we not have. In our case the analysis will run on a single machine or perhaps some VMs.
Which kind of database would fit this task? Is it worth to setup a single machine instance with hadoop+hbase... or is this all a bit over-sized?
What database would you choose to do high-performance logfile analysis?
EDIT: Maybe out of my question it is not clear that we cannot spend money for cloud services or new hardware. The Question is if there are benefits in using noSQL approaches instead of mySQL (especially for this data). If there are none, or if they are so small that the effort of setting up a noSQL system is not worth the benefit we can use our ESXi infrastructure and MySQL.
EDIT2: I'm still having the Problem here. I did further experiments with MySQL and just inserted a quarter of all available data. The insert is now running for over 2 days and is not yet finished. Currently there are 2,147,483,647 rows in my single table db. With indeces this takes 211,2 GiB of disk space. And this is just a quarter of all logging data...
A query of the form
SELECT * FROM `table` WHERE `timestamp`>=1342105200000 AND `timestamp`<=1342126800000 AND `logid`=123456 AND `unit`="UNIT40";
takes 761 seconds to complete, in this case returning one row.
There is a combined index on timestamp, logid, unit.
So I think this is not the way to go, because later in analysis I will have to get all entries in a time range and compare the datapoints.
I read bout MongoDB and Redis, but the problem with them is, that they are in Memory databases.
In the later analyzing process there will a very small amount of concurrent database access. In fact the analyzing will be run from one single machine.
I do not need redundancy. I would be able to regenerate the database in case of a failure.
When the database is once completely written, there would also be no need to update or add further row.
What do you think about alternatives like Redis, MongoDB and so on. When I get this right, i would need RAM in the dimension of my data...
Is this task even somehow possible with a single node system or with maybe two nodes?
well i personally would prefer the faster solution, as you said you need a high-perfomance analysis. the problem is, if you have to setup a whole new system to do so and the performance-improvement would be minor in relation to the additional effort you'd need, then stay with SQL.
in our company, we have a quite small Database containing not even half a GB of Data on the VM. the problem now is, as soon as you use a VM, you will have major performance issues, when opening the Database on VM you can go for a coffee in the meantime ;)
But if the time until the Database is loaded to cache is not so important it doesn't matter. It all depends on how much faster you think the new System will be, and how much effort you will have to put in it, but as i said i'd prefer the faster solution if you have to go for "high-performance analysis"

Is there any performance reason to use powers of two for field sizes in my database?

A long time ago when I was a young lat I used to do a lot of assembler and optimization programming. Today I mainly find myself building web apps (it's alright too...). However, whenever I create fields for database tables I find myself using values like 16, 32 & 128 for text fields and I try to combine boolean values into SET data fields.
Is giving a text field a length of 9 going to make my database slower in the long run and do I actually help it by specifying a field length that is more easy memory aligned?
Database optimization is quite unlike machine code optimization. With databases, most of the time you want to reduce disk I/O, and wastefully trying to align fields will only make less records fit in a disk block/page. Also, if any alignment is beneficial, the database engine will do it for you automatically.
What will matter most is indexes and how well you use them. Trying tricks to pack more information in less space can easily end up making it harder to have good indexes. (Do not overdo it, however; not only do indexes slow down INSERTs and UPDATEs to indexed columns, they also mean more work for the planner, which has to consider all the possibilities.)
Most databases have an EXPLAIN command; try using it on your selects (in particular, the ones with more than one table) to get a feel for how the database engine will do its work.
The size of the field itself may be important, but usually for text if you use nvarchar or varchar it is not a big deal. Since the DB will take what you use. the follwoing will have a greater impact on your SQL speed:
don't have more columns then you need. bigger table in terms of columns means the database will be less likely to find the results for your queries on the same disk page. Notice that this is true even if you only ask for 2 out of 10 columns in your select... (there is one way to battle this, with clustered indexes but that can only address one limited scenario).
you should give more details on the type of design issues/alternatives you are considering to get additional tips.
Something that is implied above, but which can stand being made explicit. You don't have any way of knowing what the computer is actually doing. It's not like the old days when you could look at the assembler and know pretty well what steps the program is going to take. A value that "looks" like it's in a CPU register may actually have to be fetched from a cache on the chip or even from the disk. If you are not writing assembler but using an optimizing compiler, or even more surely, bytecode on a runtime engine (Java, C#), abandon hope. Or abandon worry, which is the better idea.
It's probably going to take thousands, maybe tens of thousands of machine cycles to write or retrieve that DB value. Don't worry about the 10 additional cycles due to full word alignments.