For example, is this noticeably slower:
DELETE FROM [table] WHERE [REOID] IN ( 1, 1, 1, 2, 2, 1, 3, 5)
Than this:
DELETE FROM [table] WHERE [REOID] IN ( 1, 2, 3, 4)
SQL Server 2008 R2.
Thanks!
Most engines would eliminate duplicates in the constant IN list on parsing stage.
Such a query would parse marginally slower than that with a non-duplicated list, will produce the same plan and with most real-world scenarios, you will hardly notice any difference.
Related
I am trying to aggregate multi PB (around 7PB) worth of BigQuery Table into another BigQuery Table
I have (partition_key, clusterkey1, clusterkey2, col1, col2, val)
Where partition_key is used for bigquery partition and clusterkey is used for clustering.
For example
(timestamp1, timestamp2, 0, 1, 2, 1)
(timestamp3, timestamp4, 0, 1, 2, 7)
(timestamp31, timestamp22, 2, 1, 2, 2)
(timestamp11, timestamp12, 2, 1, 2, 3)
should result to
(0, 1, 2, 8)
(2, 1, 2, 5)
I want to aggregate base on (clusterkey2, col1, col2), across all partition_key and all clusterkey1 for val
What is a feasible way to do this?
Should I write a custom loader and just read all data from it line by line, or is there a native way to do this?
Depending on where / how you are executing this you can do it by writing a simple sql script and defining the target output, for example:
SELECT clusterkey2
, col1
, col2
, sum(val)
from table
group by clusterkey2, col1, col2
This will get you the desired results.
From here you can do a few things, but they are mostly all outlined here in the documentation:
https://cloud.google.com/bigquery/docs/writing-results#writing_query_results
Specifically from the above you are looking to set the destination table.
One thing to note, you may want to include a partition key in the where clause to help narrow down your data if you do not want the aggregate results of the whole table.
There is a data set as shown below;
When input for event_type is 4, 1, 2, 3 for example, I would like to get 3, 999, 3, 9 from cnt_stamp in this order. I created a SQL code as shown below, but it seems like it always returns 999, 3, 9, 3 regardless the order of the input.
How can I fix the SQL to achieve this? Thank you for taking your time, and please let me know if you have any question.
SELECT `cnt_stamp` FROM `stm_events` WHERE `event_type` in (4,1,2,3)
Add ORDER BY FIELD(event_type, 4, 1, 2, 3) in your query. It should look like:
SELECT cnt_stamp FROM stm_events WHERE event_type in (4,1,2,3) ORDER BY FIELD(event_type, 4, 1, 2, 3);
its cannot because as default the data sort by ascending, if u want result like u want,, better u create 1 column for indexing
I have some data ordered like so:
date, uid, grouping
2018-01-01, 1, a
2018-01-02, 1, a
2018-01-03, 1, b
2018-02-01, 2, x
2018-02-05, 2, x
2018-02-01, 3, z
2018-03-01, 3, y
2018-03-02, 3, z
And I wanted a final form like:
uid, path
1, "a-a-b"
2, "x-x"
3, "z-y-z"
but running something like
select
a.uid
,concat(grouping) over (partition by date, uid) as path
from temp1 a
Doesn't seem to want to play well with SQL or Google BigQuery (which is the specific environment I'm working in). Is there an easy enough way to get the groupings concatenated that I'm missing? I imagine there's a way to brute force it by including a bunch of if-then statements as custom columns, then concatenating the result, but I'm sure that will be a lot messier. Any thoughts?
You are looking for string_agg():
select a.uid, string_agg(grouping, '-' order by date) as path
from temp1 a
group by a.uid;
I want to store sets in a such a way that I can query for sets that are a superset of, subset of, or intersect with another set.
For example, if my database has the sets { 1, 2, 3 }, { 2, 3, 5 }, { 5, 10, 12} and I query it for:
Sets which are supersets of { 2, 3 } it should give me { 1, 2, 3 }, { 2, 3, 5 }
Sets which are subsets of { 1, 2, 3, 4 } it should give me { 1, 2, 3 }
Sets which intersect with { 1, 10, 20 } it should give me { 1, 2, 3 }, { 5, 10, 12}
Since some sets are unknown in advance (your comment suggests they come from the client as a search criteria), you cannot "precook" the set relationships into the database. Even if you could, that would represent a redundancy and therefore an opportunity for inconsistencies.
Instead, I'd do something like this:
CREATE TABLE "SET" (
ELEMENT INT, -- Or whatever the element type is.
SET_ID INT,
PRIMARY KEY (ELEMENT, SET_ID)
)
Additional suggestions:
Note how ELEMENT field is at the primary key's leading edge. This should aid the queries below better than PRIMARY KEY (SET_ID, ELEMENT). You can still add the latter if desired, but if you don't, then you should also...
Cluster the table (if your DBMS supports it), which means that the whole table is just a single B-Tree (and no table heap). That way, you maximize the performance of queries below, and minimize storage requirements (and cache effectiveness).
You can then find IDs of sets that are equal to or supersets of (for example) set {2, 3} like this:
SELECT SET_ID
FROM "SET"
WHERE ELEMENT IN (2, 3)
GROUP BY SET_ID
HAVING COUNT(*) = 2;
And sets that intersect {2, 3} like this:
SELECT SET_ID
FROM "SET"
WHERE ELEMENT IN (2, 3)
GROUP BY SET_ID;
And sets that are equal to or are subsets of {2, 3} like this:
SELECT SET_ID
FROM "SET"
WHERE SET_ID NOT IN (
SELECT SET_ID
FROM "SET" S2
WHERE S2.ELEMENT NOT IN (2, 3)
)
GROUP BY SET_ID;
"Efficient" can mean a lot of things, but the normalized way would be to have an Items table with all the possible elements and a Sets table with all the sets, and an ItemsSets lookup table. If you have sets A and B in your Sets table, queries like (doing this for clarity rather than optimization... also "Set" is a bad name for a table or field, given it is a keyword)
SELECT itemname FROM Items i
WHERE i.itemname IN
(SELECT itemname FROM ItemsSets isets WHERE isets.setname = 'A')
AND i.name IN
(SELECT itemname FROM ItemsSets isets WHERE isets.setname = 'B')
That, for instance, is the intersection of A and B (you can almost certainly speed this up as a JOIN; again, "efficient" can mean a lot of things, and you'll want an architecture that allows a query like that). Similar queries can be made to find out the difference, complement, test for equality, etc.
Now, I know you asked about efficiency, and this is a horribly slow way to query, but this is the only reliably scalable architecture for the tables to do this, and the query was just an easy one to show how the tables are built. You can do all sorts of crazy things to, say, cache intersections, or store multiple items that are in a set in one field and process that, or what have you. But don't. Cached info will eventually get stale; static limits on the number of items in the field size will be surpassed; ad-hoc members of new tuples will be misinterpreted.
Again, "efficient" can mean a lot of different things, but ultimately an information architecture you as a programmer can understand and reason about is going to be the most efficient.
Suppose I have a table with column which takes values from 1 to 10. I need to select columns with all values except for 9 and 10. Will there be a difference (performance-wise) when I use this query:
SELECT * FROM tbl WHERE col NOT IN (9, 10)
and this one?
SELECT * FROM tbl WHERE col IN (1, 2, 3, 4, 5, 6, 7, 8)
Use "IN" as it will most likely make the DBMS use an index on the corresponding column.
"NOT IN" could in theory also be translated into an index usage, but in a more complicated way which DBMS might not "spend overhead time" using.
When it comes to performance you should always profile your code (i.e. run your queries few thousand times and measure each loops performance using some kind of stopwatch. Sample).
But here I highly recommend using the first query for better future maintaining. The logic is that you need all records but 9 and 10. If you add value 11 to your table and use second query, logic of your application will be broken that will lead to bug, of course.
Edit: I remember this was tagged as php that's why I provided sample in php, but I might be mistaken. I guess it won't be hard to rewrite that sample in the language you're using.
I have seen Oracle have trouble optimizing some queries with NOT IN if columns are nullable. If you can write your query either way, IN is preferred as far as I'm concerned.
For a list of constants, MySQL will internally expand your code to:
SELECT * FROM tbl WHERE ((col <> 9 and col <> 10))
Same for the other one, with 8 times = instead.
So yes, the first one will be faster, less comparisons to be done. Chances that it is measurable are negligible though, the overhead of a handful of constant comparisons is nothing compared to the general overhead of parsing SQL and retrieving data.
"IN" statement works internally like a serie of "OR" statements.
For example:
SELECT * FROM tbl WHERE col IN (1, 2, 3)
Its equals to
SELECT * FROM tbl WHERE col = 1 OR col = 2 OR col = 3
"OR" statements could cause some performance issues as explained in this article:
https://bertwagner.com/2018/02/20/or-vs-union-all-is-one-better-for-performance/
When you do a NOT IN statement, its all the same, but the result has a logical denial. BUT, you could write and equivalent query much better in performance. In your example:
SELECT * FROM tbl WHERE col NOT IN (9, 10)
Its equals to
SELECT * FROM tbl WHERE col <> 9 AND col <> 10
With an "AND" statement, the database stop analizing when one of all conditionals its false, so, its much better in performance than "OR" used in "IN" statement.