I have an application that receives messages from a database via the write ahead logs and every row looks something like this
| id | prospect_id | school_id | something | something else |
|----|-------------|------------|-----------|----------------|
| 1 | 5 | 10 | who | cares |
| 2 | 5 | 11 | what | this |
| 3 | 6 | 10 | is | blah |
Eventually, I will need to query the database for mapping between prospect_id and school name. The query results are in the 10000s. The schools table has a name column that I can query in a simple join. However, I want to store this information somewhere on my server that would be easily accessibly by my application. I need this information:
stored locally so I can access it quickly
capable of being updated once a day asynchronously
independent of the application so that when its deployed or restarted, it doesn't need to be queried again.
What can be done? What are some options?
EDIT
Is pickle a good idea? https://datascience.blog.wzb.eu/2016/08/12/a-tip-for-the-impatient-simple-caching-with-python-pickle-and-decorators/
What are limitations of pickle? The results of the sql query might be in the 10000s
The drawback of using pickle is that it is a python specific protocol. If you intend for other programming languages to read this file, then the tooling might not exist to read it and you would be better storing it in something like a JSON or XML file. If you will only be reading it with python then pickle is fine.
Here are a few options you have:
Load the data from SQL when the application is started up (the SQL data can be stored locally, doesn't have to be on an external system) in a global value.
Use pickle to serialize deserialize the data from a file when needed.
Load the data into redis, an in-memory caching system.
Related
DNS standard allows for specifying more than 1 question per query (I mean inside single DNS packet). I'm writing Snort plugin for DNS analyzis and I need to test whether it behaves properly when there's DNS query containing multiple questions.
DNS packet structure looks like this:
0 1 2 3 4 5 6 7 8 9 A B C D E F
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ID |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|QR| Opcode |AA|TC|RD|RA| Z | RCODE |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| QDCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ANCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| NSCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| ARCOUNT |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| <ACTUAL QUESTIONS GO HERE> |
| |
| ... |
| |
So if QDCOUNT is greater than 1 there can be multiple DNS questions in single query.
How can I perform such query using linux tools? dig domain1.example domain2.example creates just 2 separate queries with 1 question each. host and nslookup seem to allow querying only 1 name at the time.
See this question for the full details: Requesting A and AAAA records in single DNS query
In short, no actually no one today does multiple questions in a single query. This was never clearly defined, and poses a lot of questions (like: there is only a single return code so what do you do for 2 questions if one failed and not the other?).
It would have been useful for people to do A and AAAA queries at the same time (instead of the deprecated ANY) but it basically does not exist today.
You can retrieve all the records from a zone using a single AXFR request, and then parse out the ones you want.
dig #127.0.0.1 domain.com. AXFR
or
nslookup -query=AXFR domain.com 127.0.0.1
Typically AXFR requests are refused except for slave servers, so you will need to whitelist IPs that are allowed to make this request. (In bind this is done with the allow-transfer option).
This won't work for OP's use case of making a snort plugin that checks QDCOUNT but it does kind of solve the problem of sending multiple questions in a single DNS request.
source: serverfault: How to request/acquire all records from a DNS?
I have a saved query in Big Query but it's too big to export as CSV. I don't have permission to export to a new table so is there a way to run the query from the bq cli and export from there?
From the CLI you can't directly access your saved queries as it's a UI-only feature as of now but, as explained here there is a feature request for that.
If you just want to run it once to get the results you can copy the query from the UI and just paste it when using bq.
Using the docs example query you can try the following with a public dataset:
QUERY="SELECT word, SUM(word_count) as count FROM publicdata:samples.shakespeare WHERE word CONTAINS 'raisin' GROUP BY word"
bq query $QUERY > results.csv
The output of cat results.csv should be:
+---------------+-------+
| word | count |
+---------------+-------+
| dispraisingly | 1 |
| praising | 8 |
| Praising | 4 |
| raising | 5 |
| dispraising | 2 |
| raisins | 1 |
+---------------+-------+
Just replace the QUERY variable with your saved query.
Also, take into account if you are using Standard or Legacy SQL with the --use_legacy_sql flag.
Reference docs here.
Despite what you may have understood from the official documentation, you can get large query results from bq query, but there are multiple details you have to be aware of.
To start, here's an example. I got all of the rows of the public table usa_names.usa_1910_2013 from the public dataset bigquery-public-data by using the following commands:
total_rows=$(bq query --use_legacy_sql=false --format=csv "SELECT COUNT(*) AS total_rows FROM \`bigquery-public-data.usa_names.usa_1910_2013\`;" | xargs | awk '{print $2}');
bq query --use_legacy_sql=false --max_rows=$((total_rows + 1)) --format=csv "SELECT * FROM \`bigquery-public-data.usa_names.usa_1910_2013\`;" > output.csv
The result of this command was a CSV file with 5552454 lines, with the first two containing header information. The number of rows in this table is 5552452, so it checks out.
Here's where the caveats come in to play:
Regardless of what the documentation might seem to say when it comes to query download limits specifically, those limits seem to only apply to the Web UI, meaning bq is exempt from them;
At first, I was using the Cloud Shell to run this bq command, but the number of rows was so big that streaming the result set into it killed the Cloud Shell instance! I had to use a Compute instance with at least the same resources that of an n1-standard-4 (4vCPUs, 16GiB RAM), and even with all of this RAM, the query took me 10 minutes to finish (note that the query itself runs server-side, it's just a problem of buffering the results);
I'm manually copy-pasting the query itself, as there doesn't seem to be a way to reference saved queries directly from bq;
You don't have to use Standard SQL, but you have to specify max_rows, because otherwise it'll only return you 100 rows (100 is the current default value of this argument);
You'll still be facing the usual quotas & limits associated with BigQuery, so you might want to run this as a batch job or not, it's up to you. Also, don't forget that the maximum response size for a query is 128 MiB, so you might need to break the query into multiple bq query commands in order to not hit this size limit. If you want a public table that's big enough to hit this limitation during queries, try the samples.wikipedia one from bigquery-public-data dataset.
I think that's about it! Just make sure you're running these commands on a beefy machine and after a few tries it should give you the result you want!
P.S.: There's currently a feature request to increase the size of CSVs you can download from the Web UI. You can find it here.
I am complete desperate with a performance differential and I have absolutely no clue WHY there is one.
Overwiew
VMware Workstation v11 on my local computer. I gave the VM just 2 cores and 4GB memory.
Hyper-V Server 2012 R2 with two 6-core-Xeon's (older ones) and 64GB memory. Just this VM is running with full hardware associated.
Referring to a CPU-benchmark I started in each VM, the VM within Hyper-V should be about 5x faster then my local one.
I stripped my code down to just this one operation which I set in a WHILE-loop to simulate parallel queries - normally this is done by a webserver.
Code
DECLARE #cnt INT = 1
WHILE #cnt <= 1000
BEGIN
BEGIN TRANSACTION Trans
UPDATE [Test].[dbo].[NumberTable]
SET Number = Number + 1
OUTPUT deleted.*
COMMIT TRANSACTION Trans
SET #cnt = #cnt + 1;
END
When I execute this in SSMS it needs:
VMware Workstation: 43s
Hyper-V Server: 59s
...which is about 2x slower although the system is at least 4x faster.
Some facts
the DB is the same - backuped and restored
the table has just 1 row and 13 fields
the table has 3 indexes, none of them is "Number"
logged in user is 'SA'
OS is identical
SQL Server version is identical (same iso)
installed SQL Server features are the same
to be sure Hyper-V is not the bottleneck I also installed a VMware ESXi v6 on another server with even less power - the result is nearly identical to the Hyper-V-machine
settings in SSMS should be identical - checked it twice
execution plan is identical on each machine - just execution time is different
the more loops I choose, the bigger is the relative time difference
ADDED when I comment out the OUTPUT-line to suppress the drawing of the line (and each value) my VMware Workstation does it in under 1s while the Hyper-V needs 5s. When I increase the loop number to 2000, my VMware Workstation needs one more time under 1s, the Hyper-V-version although needs 10s!
When running the full code from a local webserver the difference is about 0.8s versus about 9s! ...no, I have not forgotten the '0.'!!
Can you give me a hint what the hell is going on or what else I can proof?
EDIT
I tested the code above without the OUTPUT-line and with 10,000 passes. The client statistics on both systems look identical, except the time statistics:
VMware Workstation:
+-------------------------------+------+--+------+--+-----------+
| Time statistics | (1) | | (2) | | (3) |
+-------------------------------+------+--+------+--+-----------+
| Client processing time | 2328 | | 1084 | | 1706.0000 |
| Total execution time | 2343 | | 1098 | | 1720.5000 |
| Wait time on server replies | 15 | | 14 | | 14.5000 |
+-------------------------------+------+--+------+--+-----------+
Hyper-V:
+-------------------------------+-------+--+------+--+------------+
| Time statistics | (1) | | (2) | | (3) |
+-------------------------------+-------+--+------+--+------------+
| Client processing time | 55500 | | 1250 | | 28375.0000 |
| Total execution time | 55718 | | 1328 | | 28523.0000 |
| Wait time on server replies | 218 | | 78 | | 148.0000 |
+-------------------------------+-------+--+------+--+------------+
(1) : 10,000 passes without OUTPUT
(2) : 1,000 passes with OUTPUT
(3) : mean
EDIT (for HLGEM)
I compared both execution plans and indeed there are two differences:
fast system:
<QueryPlan DegreeOfParallelism="1" CachedPlanSize="24" CompileTime="0" CompileCPU="0" CompileMemory="176">
<OptimizerHardwareDependentProperties EstimatedAvailableMemoryGrant="104842" EstimatedPagesCached="26210" EstimatedAvailableDegreeOfParallelism="2" />
slow system:
<QueryPlan DegreeOfParallelism="1" CachedPlanSize="24" CompileTime="1" CompileCPU="1" CompileMemory="176">
<OptimizerHardwareDependentProperties EstimatedAvailableMemoryGrant="524272" EstimatedPagesCached="655341" EstimatedAvailableDegreeOfParallelism="10" />
Did you check hardware fully?
It looks as OUTPUT operator spend some time to show data to you.
https://msdn.microsoft.com/en-us/library/ms177564%28v=sql.120%29.aspx
Time differences depend on many things. A local server may be faster because you are not sending data through a full network pipeline. Other work happening concurrently on each server may affect speed.
Typically in dev there is little or no other work load and things can be faster than on Prod where there are thousands of users trying to things at the same time. This is why load testing is important if you have a large system.
You don't mention indexing but that too can be different on different servers (even when it is supposed to be the same!). So at least check that.
Look at the execution plans see if you can find the difference. Outdated statistics can also result in a less than optimal execution plan too.
Does one of the servers run applications other than the database? That could be limiting the amount of memory the server has available for the database to use.
Honestly, this is a huge topic and there are many many things you should be checking. If you are doing this kind of analysis, I would suggest you buy a performance tuning book and read through it to figure out what things can affect this. This s not something that can easily be answered by a question on the Internet; you need to get some in depth knowledge.
Query speed has little to do with CPU/memory speed, especially queries that update data.
Query speed is mainly limited by disk I/O speed, which is at least 1000 times slower than CPU/RAM speed. Making queries faster is ultimately about avoiding unnecessary disk I/O, but your query must read and write every row.
The VM box (probably) uses a virtual drive that is mapped to a file on disk and there is probably some effort required to keep the two aligned, possibly even asynchronously, while other processes are running and contending with the drive.
Maybe your workstation has less contention or a simpler virtual file system etc.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm looking at writing a Django app to help document fairly small IT environments. I'm getting stuck at how best to model the data as the number of attributes per device can vary, even between devices of the same type. For example, a SAN will have 1 or more arrays, and 1 or more volumes. The arrays will then have an attribute of Name, RAID Level, Size, Number of disks, and the volumes will have attributes of Size and Name. Different SANs will have a different number of arrays and volumes.
Same goes for servers, each server could have a different number of disks/partitions, all of which will have attributes of Size, Used space, etc, and this will vary between servers.
Another device type may be a switch, which won't have arrays or volumes, but will have a number of network ports, some of which may be gigabit, others 10/100, others 10Gigabit, etc.
Further, I would like the ability to add device types in the future without changing the model. A new device type may be a phone system, which will have its own unique attributes which may vary between different phone systems.
I've looked into EAV database designs but it seems to get very complicated very quickly, and I'm unclear on whether it's the best way to go about this. I was thinking something along the lines of the model as shown in the picture.
http://i.stack.imgur.com/ZMnNl.jpg
A bonus would be the ability to create 'snapshots' of environments at a particular time, making it possible to view changes to the environment over time. Adding a date column to the attributes table may be a way to solve this.
For the record, this app won't need to scale very much (at most 1000 devices), so massive scalability isn't a big concern.
Since your attributes are per model instance and are different for each instance,
I would suggest going with completely free schema
class ITEntity(Model):
name = CharField()
class ITAttribute(Modle)
name = CharField()
value = CharField()
entity = ForeignKey(ITEntity, related_name="attrs")
This is very simple model and you can do the rest, like templates (i.e. switch template, router template, etc) in you app code - its much more straight-forward then using complicated model like EAV (I do like EAV, but this does not seem the usage case for this).
Adding history is also simple - just add timestamp to ITAttribute. When changing attribute - create new one instead. Then fetching attribute pick the one with the latest timestamp. That way you can always have point-in-time view of your environment.
If you are more comfortable with something along the lines of the image you posted, below is a slightly modified version (sorry I can't upload an image, don't have enough rep).
+-------------+
| Device Type |
|-------------|
| type |--------+
+-------------+ |
^
+---------------+ +--------------------+ +-----------+
| Device |----<| DeviceAttributeMap |>----| Attribute |
|---------------| |--------------------| |-----------|
| name | | Device | | name |
| DeviceType | | Attribute | +-----------+
| parent_device | | value |
| Site | +--------------------+
+---------------+
v
+-------------+ |
| Site | |
|-------------| |
| location |--------+
+-------------+
I added a linker table DeviceAttributeMap so you can have more control over an Attribute catalog, allowing queries for devices with the same Attribute but differing values. I also added a field in the Device model named parent_device intended as a self-referential foreign key to capture a relationship between a device's parent device. You'll likely want to make this field optional. To make the foreign key parent_device optional in Django set the field's null and blank attributes to True.
You could try a document based NoSQL database, like MongoDB. Each document can represent a device with as many different fields as you like.
I'm wondering what is actually stored in a CouchDB database B-tree? The CouchDB: The Definitive Guide tells that a database B-tree is used for append-only operations and that a database is stored in a single B-tree (besides per-view B-trees).
So I guess the data items that are appended to the database file are revisions of documents, not the whole documents:
+---------|### ...
| |
+------|###|------+ ... ---+
| | | |
+------+ +------+ +------+ +------+
| doc1 | | doc2 | | doc1 | ... | doc1 |
| rev1 | | rev1 | | rev2 | | rev7 |
+------+ +------+ +------+ +------+
Is it true?
If it is true, then how the current revision of a document is determined based on such a B-tree?
Doesn't it mean, that CouchDB needs a separate "view" database for indexing current revisions of documents to preserve O(log n) access? Wouldn't it lead to race conditions while building such an index? (as far as I know, CouchDB uses no write locks).
The database file on disk is append-only; however the B-tree is conceptually modified in-place. When you update a document,
Its leaf node is written (via append to the DB file)
Its parent node is re-written to reference the new leaf (via append of course)
Repeat step 2 until you update the root node
When the root node is written, that is effectively when the newer revision is "committed." To find a document, you start at the end of the file, get the root node, and work down to your doc id. The latest revision will always be accessible this way.
CouchDB does not store diffs. When you update a document, it appends the whole new document with a new _rev and the same _id as the old version. The old version is removed during compaction.