Can anyone please explain "storing" vs "indexing" in databases? - sql

What is storing and what is indexing a field when it comes to searching?
Specifically I am talking about MySQL or SOLR.
Is there any thorough article about this, I have made some searches without luck!
Thanks

Storing information in a database just means writing the information to a file.
Indexing a database involves looking at the data in a table and creating an 'index' which is then used to perform a more efficient lookup in the table when you want to retreive the stored data.

From Wikipedia:
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random look ups and efficient access of ordered records. The disk space required to store the index is typically less than that required by the table (since indexes usually contain only the key-fields according to which the table is to be arranged, and excludes all the other details in the table), yielding the possibility to store indexes in memory for a table whose data is too large to store in memory.
Storing is just putting data in the tables.

Storing vs. indexing is a SOLR's concept.
In SOLR, a stored field cannot be searched for or sorted on. It can be retrieved as a result of the query that includes a search on an indexed field.
In MySQL, on contrary, you can search and sort on unindexed fields too: this will be just slower, but still possible (unlike SOLR)

Storing data is just storing data somewhere so you can retrieve it later. Where indexing comes in is retrieving parts of the data efficiently. Wikipedia explains the idea quite well.

storing is just that saving the data to the disk (or whatever) so that the database can retrieve it later on demand.
indexing means creating some separate data structure to optimize the location and retrieval of that data in a faster way than simply reading the entire database (or the entire table) and looking at each and everyt record until the database searching algorithm finds what you asked it for... Generally databases use what is called a Balanced-Tree indices, which is an extension of the concept of a Binary-Tree. Look up Binary Tree on google/wikipedia to get a more indepth understanding of how this works...

Data
L1. This
L2. Is
L3. My Data
And the index is
This -> L1
Is -> L2
My -> L3
Data -> L3
The data/index analogy holds for books as well.

Related

How to find out how much space a SQL Server table uses?

Is it possible to get the amount of space on disk that a particular table uses? Let's say I have a million users stored in my table and I want to know how much space it's required to store all users and/or one of them.
Update:
I'm planning to use redis to cache some fields from one particular table in memory to quickly retrieve the needed data after. So I need to calculate how much space approximately will it take and thus will it fit in the memory or not. Definitely it depends on the data types that I use inside my table but if a table consists of several dozens of fields it would take too much time to count this one by one.
There is exactly such answer for the MySQL though it's not suitable for SQL Server: How can you determine how much disk space a particular MySQL table is taking up? You can check it to see what I mean.
If you have SSMS, you can right-click on the table in the Object Explorer, go to Properties, and then look at the Storage page. The field, Data space, is the size of the data in that table, but it probably does not include some of the overhead costs of the table.
This is really an extended comment, because it does not directly answer the question.
For most purposes, you just use the size of the columns, add them together, and multiply by the number of rows. This lowballs the estimate, but it is reasonable. And (depending on how you handle the types) might be a reasonable estimate of the size of exporting the data.
That said, the storage of tables is a difficult matter. Here are some of the factors you need to take into account:
The size of individuals fields. This is made slightly more difficult because some types have varying sizes, so those are entirely data dependent.
The number of pages occupied by a table (or equivalently how full each data page is). Note that this can vary, depending on how full each table is.
The number of pages occupied by "overflow" data types, such as varchar(max).
Whether or not the data pages are compressed or encrypted.
The indexes for the table.
How full each index page is.
And, no doubt, I've left out a bunch of other relevant internal details (here is a place to start on page layouts).
In other words, there isn't a simple answer. Equivalent tables on two different systems could occupy very different amounts of space. This is true of the "same" table on the same system at different times.
The general answer when working with databases is that you need a lot more space than number of rows * row size -- I seem to recall using a factor of 3 at one point in time. In general, storage is pretty cheap, so this is not the limiting factor using a database.
We would need to see your full database schema, with tables and columns and all fields' data types. Without those pieces of information it's just a lucky guess. Here is a helpful cheat sheet of the sizes of each data type: https://www.connectionstrings.com/sql-server-2012-data-types-reference/
Then you just have to do the Math and calculate the space needed for X, which is your number of records

How to model data for a CouchDB geocoder

I am working on a CouchDB based geocoding application using a large national dataset that is supplied relationally. There are some 250 million records split over 9 tables (The ER Diagram can be viewed at http://bit.ly/1dlgZBt). I am quite new to nosql document databases and CouchDB in particular and am considering how to model this. I have currently loaded the data into a CouchDB database per table with a type field indicating which kind of record it is. The _id attribute is set to be the primary key for table [A] and [C], for everything else it is auto-generated by Couch. I plan on setting up Lucene with Couch for indexing and full text search. The X and Y Point coordinates are all stored in table [A] but to find these I will need to search using data in [Table E], [Tables B, C & D combined] and/or [Table I] with the option of filtering results based on data in [Table F].
My original intention was to create a single CouchDB database which would combine all of these tables into a single structure with [Table A] as the root and all related tables nested under this. I would then build my various search indexes on this and also setup a spatial index using GeoCouch for reverse geocoding. However I have read articles that suggest view collation as an alternative approach.
An important factor here I guess is reads vs writes. The plan is that this data will never be updated, only read. Data is released every quarter at which time the existing DB would be blown away and a new DB created.
I would welcome any suggestions for how best to setup and organise this from any experienced Couch or related document database users.
Many thanks in advance for any assistance.
guygrange,
While I am far from an expert in document database design, the key thing to recognize about documents DBs is that everything is about making your queries fast by keeping all of the necessary information in a single document. Hence, you need to look at your queries and how you expect to access this data. For example, I can easily imagine a geocoding application to not need access to everything in each table for your most frequent queries. Hence, to save on bandwidth, you would make a main document that has the main information you most frequently care about along with a key for the rest of the appropriate data. Then you could fetch the remaining data with that key and merge the dictionaries for easy management in your client code.
Anon,
Andrew

What's the fastest way to copy data from one table to another in Django?

I have two models -
ChatCurrent - (which stores the messages for the current active chats)
ChatArchive - (which archives the messages for the chats that have ended)
The reason I'm doing this is so that the ChatCurrent table always has minimum number of entries, making querying the table fast (I don't know if this works, please let me know if I've got this wrong)
So I basically want to copy (cut) data from the ChatCurrent to the ChatArchive model. What would be the fastest way to do this. From what I've read online, it seems that I might have to execute a raw SQL query, if you would be kind enough to even state the Query I'll be grateful.
Additional details -
Both the models have the same schema.
My opinion is that today they are not reason to denormalize database in this way to improve performance. Indexes or partitioning + indexes should be enought.
Also, in case that, for semantic reasons, you prefer have two tables (models) like: Chat and ChatHistory (or ChatCurrent and ChatActive) as you say and manage it with django, I thing that the right way to keep consistence is to create ToArchive() method in ChatCurrent. This method will move chat entries to historical chat model. You can perform this operation in background mode, then you can thread the swap in a celery process, in this way online users avoid wait for request. Into celery process the fastest method to copy data is a raw sql. Remember that you can encapsulate sql into a stored procedure.
Edited to include reply to your comment
You can perform ChatCurrent.ToArchive() in ChatCurrent.save() method:
class ChatCurrent(model.Model):
closed=models.BooleanField()
def save(self, *args, **kwargs):
super(Model, self).save(*args, **kwargs)
if self.closed:
self.ToArchive()
def ToArchive(self):
from django.db import connection, transaction
cursor = connection.cursor()
cursor.execute("insert into blah blah")
transaction.commit_unless_managed()
#self.delete() #if needed (perhaps deleted on raw sql)
Try something like this:
INSERT INTO "ChatArchive" ("column1", "column2", ...)
SELECT "column1", "column2", ...
FROM "ChatCurrent" WHERE yourCondition;
and than just
DELETE FROM "ChatCurrent" WHERE yourCondition;
The thing you are trying to do is table partitioning.
Most databases support this feature without the need for manual book keeping.
Partitioning will also yield much better results than manually moving parts of the data to a different table. By using partitioning you avoid:
- Data inconsistency. Which is easy to introduce because you will move records in bulk and then remove a lot of them from the source table. It's easy to make a mistake and copy only a portion of the data.
- Performance drop - moving the data around and the associated overhead from transactions will generally neglect any benefit you got from reducing the size of the ChatCurrent table.
For a really quick rundown. Table partitioning allows you to tell the database that parts of the data are stored and retrieved together, this significantly speeds up queries as the database knows that it only has to look into a specific part of the data set. Example: chat's from the current day, last hour, last month etc. You can additionally store each partition on a different drive, that way you can keep your current chatter on a fast SSD drive and your history on regular slower disks.
Please refer to your database manual to know the details about how it handles partitioning.
Example for PostgreSQL: http://www.postgresql.org/docs/current/static/ddl-partitioning.html
Partitioning refers to splitting what is logically one large table into smaller physical pieces. Partitioning can provide several benefits:
Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. The partitioning substitutes for leading columns of indexes, reducing index size and making it more likely that the heavily-used parts of the indexes fit in memory.
When queries or updates access a large percentage of a single partition, performance can be improved by taking advantage of sequential scan of that partition instead of using an index and random access reads scattered across the whole table.
Bulk loads and deletes can be accomplished by adding or removing partitions, if that requirement is planned into the partitioning design. ALTER TABLE NO INHERIT and DROP TABLE are both far faster than a bulk operation. These commands also entirely avoid the VACUUM overhead caused by a bulk DELETE.
Seldom-used data can be migrated to cheaper and slower storage media.
def copyRecord(self,recordId):
emailDetail=EmailDetail.objects.get(id=recordId)
copyEmailDetail= CopyEmailDetail()
for field in emailDetail.__dict__.keys():
copyEmailDetail.__dict__[field] = emailDetail.__dict__[field]
copyEmailDetail.save()
logger.info("Record Copied %d"%copyEmailDetail.id)
As per the above solutions, don't copy over.
If you really want to have two separate tables to query, store your chats in a single table (and for preference, use all the database techniques here mentioned), and then have a Current and Archive table, whose objects simply point to Chat objects/

Is O(1) access to a database row is possible?

I have an table which use an auto-increment field (ID) as primary key. The table is append only and no row will be deleted. Table has been designed to have a constant row size.
Hence, I expected to have O(1) access time using any value as ID since it is easy to compute exact position to seek in file (ID*row_size), unfortunately that is not the case.
I'm using SQL Server.
Is it even possible ?
Thanks
Hence, I expected to have O(1) access
time using any value as ID since it is
easy to compute exact position to seek
in file (ID*row_size),
Ah. No. Autoincrement does not - even without deletions -guarantee no holes. Holes = seek via index. Ergo: your assumption is wrong.
I guess the thing that matters to you is the performance.
Databases use indexes to access records which are written on the disk.
Usually this is done with B+ tree indexes, which are logbn where b for internal nodes is typically between 100 and 200 (optimized to block size, see ref)
This is still strictly speaking logarithmic performance, but given decent number of records, let's say a few million, the leaf nodes can be reached in 3 to 4 steps and that, together with all the overhead for query planning, session initiation, locking, etc (that you would have anyway if you need multiuser, ACID compliant data management system) is certainly for all practical reasons comparable to constant time.
The good news is that an indexed read is O(log(n)) which for large values of n gets pretty close to O(1). That said in this context O notation is not very useful, and actual timings are far more meanigful.
Even if it were possible to address rows directly, your query would still have to go through the client and server protocol stacks and carry out various lookups and memory allocations before it could give the result you want. It seems like you are expecting something that isn't even practical. What is the real problem here? Is SQL Server not fast enough for you? If so there are many options you can use to improve performance but directly seeking an address in a file is not one of them.
Not possible. SQL Server organizes data into a tree-like structure based on key and index values; an "index" in the DB sense is more like a reference book's index and not like an indexed data structure like an array or list. At best, you can get logarithmic performance when searching on an indexed value (PKs are generally treated as an index). Worst-case is a table scan for a non-indexed column, which is linear. Until the database gets very large, the seek time of a well-designed query against a well-designed table will pale in comparison to the time required to send it over the network or even a named pipe.

Lucene Indexing

I would like to use Lucene for indexing a table in an existing database. I have been thinking the process is like:
Create a 'Field' for every column in the table
Store all the Fields
'ANALYZE' all the Fields except for the Field with the primary key
Store each row in the table as a Lucene Document.
While most of the columns in this table are small in size, one is huge. This column is also the one containing the bulk of the data on which searches will be performed.
I know Lucene provides an option to not store a Field. I was thinking of two solutions:
Store the field regardless of the size and if a hit is found for a search, fetch the appropriate Field from Document
Don't store the Field and if a hit is found for a search, query the data base to get the relevant information out
I realize there may not be a one size fits all answer ...
For sure, your system will be more responsive if you store everything on Lucene. Stored field does not affect the query time, it will only make the size of your index bigger. And probably not that bigger if it is only a small portion of the rows that have a lot of data. So if the index size is not an issue for your system, I would go with that.
I strongly disagree with a Pascal's answer. Index size can have major impact on search performance. The main reasons are:
stored fields increase index size. It could be problem with relatively slow I/O system;
stored fields are all loaded when you load Document in memory. This could be good stress for the GC
stored fields are likely to impact reader reopen time.
The final answer, of course, it depends. If the original data is already stored somewhere else, it's good practice to retrieve it from original data store.
When adding a row from the database to Lucene, you can judge if it actually needed to be write to the inverted-index. If not, you can use Index.NOT to avoid writing too much data to the inverted-index.
Meanwhile, you can judge where a column will be queried by key-value. If not, you needn't use Store.YES to store the data.