What data is actually stored in a B-tree database in CouchDB? - indexing

I'm wondering what is actually stored in a CouchDB database B-tree? The CouchDB: The Definitive Guide tells that a database B-tree is used for append-only operations and that a database is stored in a single B-tree (besides per-view B-trees).
So I guess the data items that are appended to the database file are revisions of documents, not the whole documents:
+---------|### ...
| |
+------|###|------+ ... ---+
| | | |
+------+ +------+ +------+ +------+
| doc1 | | doc2 | | doc1 | ... | doc1 |
| rev1 | | rev1 | | rev2 | | rev7 |
+------+ +------+ +------+ +------+
Is it true?
If it is true, then how the current revision of a document is determined based on such a B-tree?
Doesn't it mean, that CouchDB needs a separate "view" database for indexing current revisions of documents to preserve O(log n) access? Wouldn't it lead to race conditions while building such an index? (as far as I know, CouchDB uses no write locks).

The database file on disk is append-only; however the B-tree is conceptually modified in-place. When you update a document,
Its leaf node is written (via append to the DB file)
Its parent node is re-written to reference the new leaf (via append of course)
Repeat step 2 until you update the root node
When the root node is written, that is effectively when the newer revision is "committed." To find a document, you start at the end of the file, get the root node, and work down to your doc id. The latest revision will always be accessible this way.

CouchDB does not store diffs. When you update a document, it appends the whole new document with a new _rev and the same _id as the old version. The old version is removed during compaction.

Related

How to query bin level LUT on Aerospike 5.5?

Assume you have a set as follows:
+-------+-------+
| PK | myBin |
+-------+-------+
| "1" | 24 |
+-------+-------+
1 row in set (1 secs)
How to get the LUT metadata for PK=1 for bin myBin?
NOTE: I'm looking for bin level LUT and not row level
Bin-level LUTs are not exposed to the clients due to backward compatibility reasons. So, you cannot query them from client. Also, bin-level LUTs are not always maintained. They are maintained only under certain XDR configurations.
A work around is to write an additional bin with the time stamp of the update along with the regular bin update. If you have few bins for which you need to know the update time stamp, this is a reasonable workaround with some overhead.

How to cache a sql database mapping independent of an application

I have an application that receives messages from a database via the write ahead logs and every row looks something like this
| id | prospect_id | school_id | something | something else |
|----|-------------|------------|-----------|----------------|
| 1 | 5 | 10 | who | cares |
| 2 | 5 | 11 | what | this |
| 3 | 6 | 10 | is | blah |
Eventually, I will need to query the database for mapping between prospect_id and school name. The query results are in the 10000s. The schools table has a name column that I can query in a simple join. However, I want to store this information somewhere on my server that would be easily accessibly by my application. I need this information:
stored locally so I can access it quickly
capable of being updated once a day asynchronously
independent of the application so that when its deployed or restarted, it doesn't need to be queried again.
What can be done? What are some options?
EDIT
Is pickle a good idea? https://datascience.blog.wzb.eu/2016/08/12/a-tip-for-the-impatient-simple-caching-with-python-pickle-and-decorators/
What are limitations of pickle? The results of the sql query might be in the 10000s
The drawback of using pickle is that it is a python specific protocol. If you intend for other programming languages to read this file, then the tooling might not exist to read it and you would be better storing it in something like a JSON or XML file. If you will only be reading it with python then pickle is fine.
Here are a few options you have:
Load the data from SQL when the application is started up (the SQL data can be stored locally, doesn't have to be on an external system) in a global value.
Use pickle to serialize deserialize the data from a file when needed.
Load the data into redis, an in-memory caching system.

How to perform DNS lookup with multiple questions?

DNS standard allows for specifying more than 1 question per query (I mean inside single DNS packet). I'm writing Snort plugin for DNS analyzis and I need to test whether it behaves properly when there's DNS query containing multiple questions.
DNS packet structure looks like this:
0 1 2 3 4 5 6 7 8 9 A B C D E F
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ID |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|QR| Opcode |AA|TC|RD|RA| Z | RCODE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QDCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ANCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| NSCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ARCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| <ACTUAL QUESTIONS GO HERE> |
| |
| ... |
| |
So if QDCOUNT is greater than 1 there can be multiple DNS questions in single query.
How can I perform such query using linux tools? dig domain1.example domain2.example creates just 2 separate queries with 1 question each. host and nslookup seem to allow querying only 1 name at the time.
See this question for the full details: Requesting A and AAAA records in single DNS query
In short, no actually no one today does multiple questions in a single query. This was never clearly defined, and poses a lot of questions (like: there is only a single return code so what do you do for 2 questions if one failed and not the other?).
It would have been useful for people to do A and AAAA queries at the same time (instead of the deprecated ANY) but it basically does not exist today.
You can retrieve all the records from a zone using a single AXFR request, and then parse out the ones you want.
dig #127.0.0.1 domain.com. AXFR
or
nslookup -query=AXFR domain.com 127.0.0.1
Typically AXFR requests are refused except for slave servers, so you will need to whitelist IPs that are allowed to make this request. (In bind this is done with the allow-transfer option).
This won't work for OP's use case of making a snort plugin that checks QDCOUNT but it does kind of solve the problem of sending multiple questions in a single DNS request.
source: serverfault: How to request/acquire all records from a DNS?

Database model to describe IT environment [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm looking at writing a Django app to help document fairly small IT environments. I'm getting stuck at how best to model the data as the number of attributes per device can vary, even between devices of the same type. For example, a SAN will have 1 or more arrays, and 1 or more volumes. The arrays will then have an attribute of Name, RAID Level, Size, Number of disks, and the volumes will have attributes of Size and Name. Different SANs will have a different number of arrays and volumes.
Same goes for servers, each server could have a different number of disks/partitions, all of which will have attributes of Size, Used space, etc, and this will vary between servers.
Another device type may be a switch, which won't have arrays or volumes, but will have a number of network ports, some of which may be gigabit, others 10/100, others 10Gigabit, etc.
Further, I would like the ability to add device types in the future without changing the model. A new device type may be a phone system, which will have its own unique attributes which may vary between different phone systems.
I've looked into EAV database designs but it seems to get very complicated very quickly, and I'm unclear on whether it's the best way to go about this. I was thinking something along the lines of the model as shown in the picture.
http://i.stack.imgur.com/ZMnNl.jpg
A bonus would be the ability to create 'snapshots' of environments at a particular time, making it possible to view changes to the environment over time. Adding a date column to the attributes table may be a way to solve this.
For the record, this app won't need to scale very much (at most 1000 devices), so massive scalability isn't a big concern.
Since your attributes are per model instance and are different for each instance,
I would suggest going with completely free schema
class ITEntity(Model):
name = CharField()
class ITAttribute(Modle)
name = CharField()
value = CharField()
entity = ForeignKey(ITEntity, related_name="attrs")
This is very simple model and you can do the rest, like templates (i.e. switch template, router template, etc) in you app code - its much more straight-forward then using complicated model like EAV (I do like EAV, but this does not seem the usage case for this).
Adding history is also simple - just add timestamp to ITAttribute. When changing attribute - create new one instead. Then fetching attribute pick the one with the latest timestamp. That way you can always have point-in-time view of your environment.
If you are more comfortable with something along the lines of the image you posted, below is a slightly modified version (sorry I can't upload an image, don't have enough rep).
+-------------+
| Device Type |
|-------------|
| type |--------+
+-------------+ |
^
+---------------+ +--------------------+ +-----------+
| Device |----<| DeviceAttributeMap |>----| Attribute |
|---------------| |--------------------| |-----------|
| name | | Device | | name |
| DeviceType | | Attribute | +-----------+
| parent_device | | value |
| Site | +--------------------+
+---------------+
v
+-------------+ |
| Site | |
|-------------| |
| location |--------+
+-------------+
I added a linker table DeviceAttributeMap so you can have more control over an Attribute catalog, allowing queries for devices with the same Attribute but differing values. I also added a field in the Device model named parent_device intended as a self-referential foreign key to capture a relationship between a device's parent device. You'll likely want to make this field optional. To make the foreign key parent_device optional in Django set the field's null and blank attributes to True.
You could try a document based NoSQL database, like MongoDB. Each document can represent a device with as many different fields as you like.

Why does an insert query occasionally take so long to complete?

This is a pretty simple problem. Inserting data into the table normally works fine, except for a few times where the insert query takes a few seconds. (I am not trying to bulk insert data.) So I setup a simulation for the insert process to find out why the insert query occasionally takes more than 2 seconds to run. Joshua suggested that the index file may be being adjusted; I removed the id (primary key field), but the delay still happens.
I have a MyISAM table: daniel_test_insert (this table starts completely empty):
create table if not exists daniel_test_insert (
id int unsigned auto_increment not null,
value_str varchar(255) not null default '',
value_int int unsigned default 0 not null,
primary key (id)
)
I insert data into it, and sometimes a insert query takes > 2 seconds to run. There are no reads on this table - Only writes, in serial, by a single threaded program.
I ran the exact same query 100,000 times to find why the query occasionall takes a long time. So far, it appears to be a random occurrence.
This query for example took 4.194 seconds (a very long time for an insert):
Query: INSERT INTO daniel_test_insert SET value_int=12345, value_str='afjdaldjsf aljsdfl ajsdfljadfjalsdj fajd as f' - ran for 4.194 seconds
status | duration | cpu_user | cpu_system | context_voluntary | context_involuntary | page_faults_minor
starting | 0.000042 | 0.000000 | 0.000000 | 0 | 0 | 0
checking permissions | 0.000024 | 0.000000 | 0.000000 | 0 | 0 | 0
Opening tables | 0.000024 | 0.001000 | 0.000000 | 0 | 0 | 0
System lock | 0.000022 | 0.000000 | 0.000000 | 0 | 0 | 0
Table lock | 0.000020 | 0.000000 | 0.000000 | 0 | 0 | 0
init | 0.000029 | 0.000000 | 0.000000 | 1 | 0 | 0
update | 4.067331 | 12.151152 | 5.298194 | 204894 | 18806 | 477995
end | 0.000094 | 0.000000 | 0.000000 | 8 | 0 | 0
query end | 0.000033 | 0.000000 | 0.000000 | 1 | 0 | 0
freeing items | 0.000030 | 0.000000 | 0.000000 | 1 | 0 | 0
closing tables | 0.125736 | 0.278958 | 0.072989 | 4294 | 604 | 2301
logging slow query | 0.000099 | 0.000000 | 0.000000 | 1 | 0 | 0
logging slow query | 0.000102 | 0.000000 | 0.000000 | 7 | 0 | 0
cleaning up | 0.000035 | 0.000000 | 0.000000 | 7 | 0 | 0
(This is an abbreviated version of the SHOW PROFILE command, I threw out the columns that were all zero.)
Now the update has an incredible number of context switches and minor page faults. Opened_Tables increases about 1 per 10 seconds on this database (not running out of table_cache space)
Stats:
MySQL 5.0.89
Hardware: 32 Gigs of ram / 8 cores # 2.66GHz; raid 10 SCSI harddisks (SCSI II???)
I have had the hard drives and raid controller queried: No errors are being reported.
CPUs are about 50% idle.
iostat -x 5 (reports less than 10% utilization for harddisks)
top report load average about 10 for 1 minute (normal for our db machine)
Swap space has 156k used (32 gigs of ram)
I'm at a loss to find out what is causing this performance lag. This does NOT happen on our low-load slaves, only on our high load master. This also happens with memory and innodb tables. Does anyone have any suggestions? (This is a production system, so nothing exotic!)
I have noticed the same phenomenon on my systems. Queries which normally take a millisecond will suddenly take 1-2 seconds. All of my cases are simple, single table INSERT/UPDATE/REPLACE statements --- not on any SELECTs. No load, locking, or thread build up is evident.
I had suspected that it's due to clearing out dirty pages, flushing changes to disk, or some hidden mutex, but I have yet to narrow it down.
Also Ruled Out
Server load -- no correlation with high load
Engine -- happens with InnoDB/MyISAM/Memory
MySQL Query Cache -- happens whether it's on or off
Log rotations -- no correlation in events
The only other observation I have at this point is derived from the fact I'm running the same db on multiple machines. I have a heavy read application so I'm using an environment with replication -- most of the load is on the slaves. I've noticed that even though there is minimal load on the master, the phenomenon occurs more there. Even though I see no locking issues, maybe it's Innodb/Mysql having trouble with (thread) concurrency? Recall that the updates on the slave will be single threaded.
MySQL Verion 5.1.48
Update
I think I have a lead for the problem on my case. On some of my servers, I noticed this phenomenon on more than the others. Seeing what was different between the different servers, and tweaking things around, I was lead to the MySQL innodb system variable innodb_flush_log_at_trx_commit.
I found the doc a bit awkward to read, but innodb_flush_log_at_trx_commit can take the values of 1,2,0:
For 1, the log buffer is flushed to
the log file for every commit, and the log
file is flushed to disk for every commit.
For 2, the log buffer is flushed to
the log file for every commit, and the log
file is flushed to disk approximately every 1-2 seconds.
For 0, the log buffer is flushed to
the log file every second, and the log
file is flushed to disk every second.
Effectively, in the order (1,2,0), as reported and documented, you're supposed to get with increasing performance in trade for increased risk.
Having said that, I found that the servers with innodb_flush_log_at_trx_commit=0 were performing worse (i.e. having 10-100 times more "long updates") than the servers with innodb_flush_log_at_trx_commit=2. Moreover, things immediately improved on the bad instances when I switched it to 2 (note you can change it on the fly).
So, my question is, what is yours set to? Note that I'm not blaming this parameter, but rather highlighting that it's context is related to this issue.
I had this problem using INNODB tables. (and INNODB indexes are even slower to rewrite than MYISAM)
I suppose you are doing multiple other queries on some other tables, so the problem would be that MySQL has to handle disk writes in files that get larger and needs to allocate additional space to those files.
If you use MYISAM tables I strongly suggest using
LOAD DATA INFILE 'file-on-disk' INTO TABLE `tablename`
command; MYISAM is sensationally fast with this (even with primary keys) and the file can be formatted as csv and you can specify the column names (or you can put NULL as the value for the autoincrement field).
View MYSQL doc here.
The first Tip I would give you, is to disable the autocommit functionality and than commit manually.
LOCK TABLES a WRITE;
... DO INSERTS HERE
UNLOCK TABLES;
This benefits performance because the index buffer is flushed to disk only once, after all INSERT statements have completed. Normally, there would be as many index buffer flushes as there are INSERT statements.
But propably best you can do, and if that is possible in your application, you do a bulk insert with one single select.
This is done via Vector Binding and it's the fastest way you can go.
Instead
of:
"INSERT INTO tableName values()"
DO
"INSERT INTO tableName values(),(),(),().......(n) " ,
But consider this option only if parameter vector binding is possible with your mysql driver you're using.
Otherwise I would tend to the first possibility and LOCK the table for every 1000 inserts. Don't lock it for 100k inserts, because you'l get a buffer overflow.
Can you create one more table with 400 (not null) columns and run your test again? If the number of slow inserts became higher this could indicate MySQL is wasting time writing your records. (I dont know how it works, but he may be alocating more blocks, or moving something to avoid fragmentation.... really dont know)
We upgraded to MySQL 5.1 and during this event the Query cache became an issue with a lot of "Freeing items?" thread states. We then removed the query cache.
Either the upgrade to MySQL 5.1 or removing the query cache resolved this issue.
FYI, to future readers.
-daniel
We hit exactly the same issue and reported here:
http://bugs.mysql.com/bug.php?id=62381
We are using 5.1.52 and don't have solution yet. We may need to turn QC off to avoid this perf hit.
if you are using multiple insertion at one using for loop, then please take a break after every loop using PHP's sleep("time in seconds") function.
Read this on Myisam Performance:
http://adminlinux.blogspot.com/2010/05/mysql-allocating-memory-for-caches.html
Search for:
'The MyISAM key block size The key block size is important' (minus the single quotes), this could be what's going on. I think they fixed some of these types of issues with 5.1
Can you check the stats on the disk subsystem? Is the I/O satuated? This sounds like internal DB work going on flushing stuff to disk/log.
To check if your disk is behaving badly, and if you're in Windows, you can create a batch cmd file that creates 10,000 files:
#echo OFF
FOR /L %%G IN (1, 1, 10000) DO TIME /T > out%%G.txt
save it in a temp dir, like test.cmd
Enable command extensions running CMD with the /E:ON parameter
CMD.exe /E:ON
Then run your batch and see if the time between the first and the last out file differ in seconds or minutes.
On Unix/Linux you can write a similare shell script.
By any chance is there an SSD drive in the server? Some SSD drives suffer from 'studder', which could cause your symptom.
In any case, I would try to find out if the delay is occurring in MySQL or in the disk subsystem.
What OS is your server, and what file system is the MySQL data on?