Suppose I have a table of relationships like in a directed graph. For some pairs of ids there are both 1->2 and 2->1 relations, for others there are not. Some nodes are only present in one column.
a b
1 2
2 1
1 3
4 1
5 2
Now I want to work with it as undirected graph. For example, grouping, filtering using both columns present. For example filter node 5 and count neighbors of the rest
node neighbor_count
1 3
2 1
3 1
4 1
Is it possible to compose queries in such a way that first column a is used and then column b is used in the same manner?
I know it is achievable by doubling the table:
select a,count(distinct(b))
from
(select * from grap
union all
select b as a, a as b from grap)
where (not a in (5,6,7)) and (not b in (5,6,7))
group by a;
However, the real tables are quite large (10^9 - 1^10 of pairs). Would union require additional disk usage? A single scan through the base is already quite slow for me. Are there better ways to do this?
(Currently database is sqlite, but the less platform specific the answer the better)
The union all is generated only for the duration of the query. Does it use more disk space? Not permanently.
If the processing of the query requires saving the data out to disk, then it will use more temporary storage for intermediate results.
I would suggests, though, that if you want an undirected graph with this representation, then add in the addition pairs that are not already in the table. This will use more disk space. But you won't have to play games with queries.
Related
For my application I have a table with these three columns: user, item, value
Here's some sample data:
user item value
---------------------
1 1 50
1 2 45
1 23 35
2 1 88
2 23 44
3 2 12
3 1 27
3 5 76
3 23 44
What I need to do is, for a given user, perform simple arithmetic against everyone else's values.
Let's say I want to compare user 1 against everyone else. The calculation looks something like this:
first_user second_user result
1 2 SUM(ABS(50-88) + ABS(35-44))
1 3 SUM(ABS(50-27) + ABS(45-12) + ABS(35-44))
This is currently the bottleneck in my program. For example, many of my queries are starting to take 500+ milliseconds, with this algorithm taking around 95% of the time.
I have many rows in my database and it is O(n^2) (it has to compare all of user 1's values against everyone else's matching values)
I believe I have only two options for how to make this more efficient. First, I could cache the results. But the resulting table would be huge because of the NxN space required, and the values need to be relatively fresh.
The second way is to make the algorithm much quicker. I searched for "postgres SIMD" because I think SIMD sounds like the perfect solution to optimize this. I found a couple related links like this and this, but I'm not sure if they apply here. Also, they seem to both be around 5 years old and relatively unmaintained.
Does Postgres have support for this sort of feature? Where you can "vectorize" a column or possibly import or enable some extension or feature to allow you to quickly perform these sorts of basic arithmetic operations against many rows?
I'm not sure where you get O(n^2) for this. You need to look up the rows for user 1 and then read the data for everyone else. Assuming there are few items and many users, this would be essentially O(n), where "n" is the number of rows in the table.
The query could be phrased as:
select t1.user, t.user, sum(abs(t.value - t1.value))
from t left join
t t1
on t1.item = t.item and
t1.user <> t.user and
t1.user = 1
group by t1.user, t.user;
For this query, you want an index on t(item, user, value).
I have read that normal indexes in (atleast Oracle) database are basically B- tree structures, and hence store the records treating appropriate root nodes. Records 'lesser than' the root are iteratively stored in the left portion of the tree, while records 'greater than' the root are stored to the right portion. It is this storage approach that helps in a faster scan, through tree traversal since depth and breadth is reduced.
However,while creating indexes or for performance tuning of a where clause, most guides speak about first prioritize the columns where equality is to be considered (IN or = clause) and then alone move to the columns with inequality clauses. (NOT IN, <>). What is the cause of this advise? Should it not be feasible to predict that a given value does not exist as easily as it is to predict a given value exists, using tree traversal?
Do indexes not work with negation?
The issue is locality within the index. If you have two columns with letters in col1 and numbers in col 2, then an index might look like:
Ind col1 col2
1 A 1
2 A 1
3 A 1
4 A 2
5 B 1
6 B 1
7 B 2
8 B 3
9 B 3
10 C 2
11 C 3
(ind is the position in the index. The record locator is left out.)
If you are looking for col1 = 'B', then you can find position 5 and then scan the index until position 9. If you are looking for col1 <> 'B', then you need to find the first record that is not 'B' scan and repeat for the first record after. This becomes worse with IN and NOT IN.
An additional factor is that if a relative handful of records satisfy the equality condition, then almost all records will fail -- and often indexes are not useful when almost all records need to be read. One sometimes-exception to this are clustered indexes.
Oracle has better index optimizations than most databases -- it will do multiple scans starting in different locations. Even so, an inequality is often much less useful for an index.
This is my NEWSPAPER table.
National News A 1
Sports D 1
Editorials A 12
Business E 1
Weather C 2
Television B 7
Births F 7
Classified F 8
Modern Life B 1
Comics C 4
Movies B 4
Bridge B 2
Obituaries F 6
Doctor Is In F 6
When i run this query
select feature,section,page from NEWSPAPER
where section = 'F'
order by page;
It gives this output
Doctor Is In F 6
Obituaries F 6
Births F 7
Classified F 8
But in Kevin Loney's Oracle 10g Complete Reference the output is like this
Obituaries F 6
Doctor Is In F 6
Births F 7
Classified F 8
Please help me understand how is it happening?
If you need reliable, reproducible ordering to occur when two values in your ORDER BY clause's first column are the same, you should always provide another, secondary column to also order on. While you might be able to assume that they will sort themselves based on order entered (almost always the case to my knowledge, but be aware that the SQL standard does not specify any form of default ordering) or index, you never should (unless it is specifically documented as such for the engine you are using--and even then I'd personally never rely on that).
Your query, if you wanted alphabetical sorting by feature within each page, should be:
SELECT feature,section,page FROM NEWSPAPER
WHERE section = 'F'
ORDER BY page, feature;
In relational databases, tables are sets and are unordered. The order by clause is used primarily for output purposes (and a few other cases such as a subquery containing rownum).
This is a good place to start. The SQL standard does not specify what has to happen when the keys on an order by are the same. And this is for good reason. Different techniques can be used for sorting. Some might be stable (preserving original order). Some methods might not be.
Focus on whether the same rows are in the sets, not their ordering. By the way, I would consider this an unfortunate example. The book should not have ambiguous sorts in its examples.
When you use the SELECT statement to query data from a table, the order which rows appear in the result set may not be what you expected.
In some cases, the rows that appear in the result set are in the order that they are stored in the table physically. However, in case the query optimizer uses an index to process the query, the rows will appear as they are stored in the index key order. For this reason, the order of rows in the result set is undetermined or unpredictable.
The query optimizer is a built-in software component in the database
system that determines the most efficient way for an SQL statement to
query the requested data.
I have a database with about five possible index columns, all of which are useful in different ways. Let's call them System, Source, Heat, Time, and Row. Using System and Row together will make a unique key, and if sorted by System-Row the database will also be sorted for any combination of the five index variables (in the order I listed them above).
My problem is that I use all combinations of these columns: sometimes I want to JOIN each System-Row to the next System-(Row+1), sometimes I want to GROUP or WHERE by System-Source-Heat, sometimes I want to look at all entries of System-Source WHERE Time is in a specific window, etc.
Basically, I want an index structure that functions similarly to every possible permutation of those five indexes (in the correct order, of course), without actually making every permutation (although I am willing to do so if necessary). I'm doing statistics / analytics, not traditional database work, so the size of the index and speed of creating / updating it is not a concern; I only care about speeding my improvised queries as I tend to think them up, run them, wait 5-10 minutes, and then never use them again. Thus my main concern is reducing the "wait 5-10 minutes" to something more like "wait 1-2 minutes."
My sorted data would look something like this:
Sys So H Ti R
1 1 0 .1 1
1 1 1 .2 2
1 1 1 .3 3
1 1 2 .3 4
1 2 0 .5 5
1 2 0 .6 6
1 2 1 .8 7
1 2 2 .8 8
EDIT: It may simplify things a bit that System virtually always needs to be included as the first column to make any of the other 4 columns in sorted order.
If you are ONLY concerned with SELECT speed and don't care about INSERT, then you can materialize ALL the combinations as INDEXED views. You only need 24 times the storage of the original table, making one table and 23 INDEXED VIEWs of 5 columns each.
e.g.
create table data (
id int identity primary key clustered,
sys int,
so int,
h float,
ti datetime,
r int);
GO
create view dbo.data_v1 with schemabinding as
select sys, so, h, ti, r
from dbo.data;
GO
create unique clustered index cix_data_v1 on data_v1(sys, h, ti, r, so)
GO
create view dbo.data_v2 with schemabinding as
select sys, so, h, ti, r
from dbo.data;
GO
create unique clustered index cix_data_v2 on data_v2(sys, ti, r, so, h)
GO
-- and so on and so forth, keeping "sys" anchored at the front
Do note, however
Q. Why isn't my indexed view being picked up by the query optimizer for use in the query plan? (search within linked article)
If space IS an issue, then the next best thing is to create individual indexes on each of the 4 columns, leading with system, i.e. (sys,ti), (sys,r) etc. These can be used together if it will help the query, otherwise it will revert to a full table scan.
Sorry for taking a while to get back to this, I had to work on something else for a few weeks. Anyway, after trying a bunch of things (including everything suggested here, even the brute-force "make an index for every permutation" method), I haven't found any indexing method that significantly improves performance.
However, I HAVE found an alternate, non-indexing solution: selecting only the rows and columns I'm interested in into intermediary tables, and then working with those instead of the complete table (so I use about 5 mil rows of 6 cols instead of 30 mil rows of 35 cols). The initial select and table creation is a bit slow, but the steps after that are so much faster I actually save time even if I only run it once (and considering how often I change things, it's usually much more than once).
I have a suspicion that the reason for this vast improvement will be obvious to most SQL users (probably something about pagefile size), and I apologize if so. My only excuse is that I'm a statistician trying to teach myself how to do this as I go, and while I'm pretty decent at getting what I want done to happen (eventually), my understanding of the mechanics of how it's being done are distressingly close to "it's a magic black box, don't worry about it."
What is the shortest or fastest SQL select query or SQL procedure to crawl a social graph. Imagine we have this table:
UId FriendId
1 2
2 1
2 4
1 3
5 7
7 5
7 8
5 9
9 7
We have two subset of people here, i'm talking about a sql query or procedure which if we pass:
Uid = 4 return the result set rows with uid : {1, 2, 3}
or if
Uid = 9 return the result set rows with uid : {5, 7, 8}
Sorry for my poor english.
So you want get all friends of someone, including n-th degree friends? I don't think it is possible without recursion.
How you can do that is explained here:
https://inviqa.com/blog/graphs-database-sql-meets-social-network
If you are storing your values in an adjacency list, the easiest way I've found to crawl it is to translate it into a graphing language and query that. For example, if you were working in PHP, you could use the Image_GraphViz package. Or, if you want to use AJAX, you might consider cytoscapeweb. Both work well.
In either case, you'd SELECT * FROM mytable and feed all the records into the graph package as nodes. This means outputting them in dot or GraphML (or other graphing language). Then you can easily query them.
If you don't wish to translate the dataset, consider storing it as nested sets. Nested sets, though a bit of a pain to maintain, are much better than adjacency lists for the kind of queries you are looking to do.
If you are storing your values in an adjacency list, and you want n-th degree you can simply recursively INNER JOIN the UID's. For example:
Select t1.uid, t2.uid, t3.uid FROM t1 INNER JOIN t2 ON t1.uid=t2.uid INNER JOIN t3 ON t2.uid=t3.uid
This query is like a DFS with a fixed depth.