A more efficient way to sum the difference between columns in postgres? - sql

For my application I have a table with these three columns: user, item, value
Here's some sample data:
user item value
---------------------
1 1 50
1 2 45
1 23 35
2 1 88
2 23 44
3 2 12
3 1 27
3 5 76
3 23 44
What I need to do is, for a given user, perform simple arithmetic against everyone else's values.
Let's say I want to compare user 1 against everyone else. The calculation looks something like this:
first_user second_user result
1 2 SUM(ABS(50-88) + ABS(35-44))
1 3 SUM(ABS(50-27) + ABS(45-12) + ABS(35-44))
This is currently the bottleneck in my program. For example, many of my queries are starting to take 500+ milliseconds, with this algorithm taking around 95% of the time.
I have many rows in my database and it is O(n^2) (it has to compare all of user 1's values against everyone else's matching values)
I believe I have only two options for how to make this more efficient. First, I could cache the results. But the resulting table would be huge because of the NxN space required, and the values need to be relatively fresh.
The second way is to make the algorithm much quicker. I searched for "postgres SIMD" because I think SIMD sounds like the perfect solution to optimize this. I found a couple related links like this and this, but I'm not sure if they apply here. Also, they seem to both be around 5 years old and relatively unmaintained.
Does Postgres have support for this sort of feature? Where you can "vectorize" a column or possibly import or enable some extension or feature to allow you to quickly perform these sorts of basic arithmetic operations against many rows?

I'm not sure where you get O(n^2) for this. You need to look up the rows for user 1 and then read the data for everyone else. Assuming there are few items and many users, this would be essentially O(n), where "n" is the number of rows in the table.
The query could be phrased as:
select t1.user, t.user, sum(abs(t.value - t1.value))
from t left join
t t1
on t1.item = t.item and
t1.user <> t.user and
t1.user = 1
group by t1.user, t.user;
For this query, you want an index on t(item, user, value).

Related

sql use different columns for same query (directed graph as undirected )

Suppose I have a table of relationships like in a directed graph. For some pairs of ids there are both 1->2 and 2->1 relations, for others there are not. Some nodes are only present in one column.
a b
1 2
2 1
1 3
4 1
5 2
Now I want to work with it as undirected graph. For example, grouping, filtering using both columns present. For example filter node 5 and count neighbors of the rest
node neighbor_count
1 3
2 1
3 1
4 1
Is it possible to compose queries in such a way that first column a is used and then column b is used in the same manner?
I know it is achievable by doubling the table:
select a,count(distinct(b))
from
(select * from grap
union all
select b as a, a as b from grap)
where (not a in (5,6,7)) and (not b in (5,6,7))
group by a;
However, the real tables are quite large (10^9 - 1^10 of pairs). Would union require additional disk usage? A single scan through the base is already quite slow for me. Are there better ways to do this?
(Currently database is sqlite, but the less platform specific the answer the better)
The union all is generated only for the duration of the query. Does it use more disk space? Not permanently.
If the processing of the query requires saving the data out to disk, then it will use more temporary storage for intermediate results.
I would suggests, though, that if you want an undirected graph with this representation, then add in the addition pairs that are not already in the table. This will use more disk space. But you won't have to play games with queries.

Access SQL - Add Row Number to Query Result for a Multi-table Join

What I am trying to do is fairly simple. I just want to add a row number to a query. Since this is in Access is a bit more difficult than other SQL, but under normal circumstances is still doable using solutions such as DCount or Select Count(*), example here: How to show row number in Access query like ROW_NUMBER in SQL or Access SQL how to make an increment in SELECT query
My Issue
My issue is I'm trying to add this counter to a multi-join query that orders by fields from numerous tables.
Troubleshooting
My code is a bit ridiculous (19 fields, seven of which are long expressions, from 9 different joined tables, and ordered by fields from 5 of those tables). To make things simple, I have an simplified example query below:
Example Query
SELECT DCount("*","Requests_T","[Requests_T].[RequestID]<=" & [Requests_T].[RequestID]) AS counter, Requests_T.RequestHardDeadline AS Deadline, Requests_T.RequestOverridePriority AS Priority, Requests_T.RequestUserGroup AS [User Group], Requests_T.RequestNbrUsers AS [Nbr of Users], Requests_T.RequestSubmissionDate AS [Submitted on], Requests_T.RequestID
FROM (((((((Requests_T
INNER JOIN ENUM_UserGroups_T ON ENUM_UserGroups_T.UserGroups = Requests_T.RequestUserGroup)
INNER JOIN ENUM_RequestNbrUsers_T ON ENUM_RequestNbrUsers_T.NbrUsers = Requests_T.RequestNbrUsers)
INNER JOIN ENUM_RequestPriority_T ON ENUM_RequestPriority_T.Priority = Requests_T.RequestOverridePriority)
ORDER BY Requests_T.RequestHardDeadline, ENUM_RequestPriority_T.DisplayOrder DESC , ENUM_UserGroups_T.DisplayOrder, ENUM_RequestNbrUsers_T.DisplayOrder DESC , Requests_T.RequestSubmissionDate;
If the code above is trying to select a field from a table not included, I apologize - just trust the field comes from somewhere (lol i.e. one of the other joins I excluded to simply the query). A great example of this is the .DisplayOrder fields used in the ORDER BY expression. These are fields from a table that simply determines the "priority" of an enum. Example: Requests_T.RequestOverridePriority displays to the user as an combobox option of "Low", "Med", "High". So in a table, I assign a numerical priority to these of "1", "2", and "3" to these options, respectively. Thus when ENUM_RequestPriority_T.DisplayOrder DESC is called in order by, all "High" priority requests will display above "Medium" and "Low". Same holds true for ENUM_UserGroups_T.DisplayOrder and ENUM_RequestNbrUsers_T.DisplayOrder.
I'd also prefer to NOT use DCOUNT due to efficiency, and rather do something like:
select count(*) from Requests_T where Requests_T.RequestID>=RequestID) as counter
Due to the "Order By" expression however, my 'counter' doesn't actually count my resulting rows sequentially since both of my examples are tied to the RequestID.
Example Results
Based on my actual query results, I've made an example result of the query above.
Counter Deadline Priority User_Group Nbr_of_Users Submitted_on RequestID
5 12/01/2016 High IT 2-4 01/01/2016 5
7 01/01/2017 Low IT 2-4 05/06/2016 8
10 Med IT 2-4 07/13/2016 11
15 Low IT 10+ 01/01/2016 16
8 Low IT 2-4 01/01/2016 9
2 Low IT 2-4 05/05/2016 2
The query is displaying my results in the proper order (those with the nearest deadline at the top, then those with the highest priority, then user group, then # of users, and finally, if all else is equal, it is sorted by submission date). However, my "Counter" values are completely wrong! The counter field should simply intriment +1 for each new row. Thus if displaying a single request on a form for a user, I could say
"You are number: Counter [associated to RequestID] in the
development queue."
Meanwhile my results:
Aren't sequential (notice the first four display sequentially, but then the final two rows don't)! Even though the final two rows are lower in priority than the records above them, they ended up with a lower Counter value simply because they had the lower RequestID.
They don't start at "1" and increment +1 for each new record.
Ideal Results
Thus my ideal result from above would be:
Counter Deadline Priority User_Group Nbr_of_Users Submitted_on RequestID
1 12/01/2016 High IT 2-4 01/01/2016 5
2 01/01/2017 Low IT 2-4 05/06/2016 8
3 Med IT 2-4 07/13/2016 11
4 Low IT 10+ 01/01/2016 16
5 Low IT 2-4 01/01/2016 9
6 Low IT 2-4 05/05/2016 2
I'm spoiled by PLSQL and other software where this would be automatic lol. This is driving me crazy! Any help would be greatly appreciated.
FYI - I'd prefer an SQL option over VBA if possible. VBA is very much welcomed and will definitely get an up vote and my huge thanks if it works, but I'd like to mark an SQL option as the answer.
Unfortuantely, MS Access doesn't have the very useful ROW_NUMBER() function like other clients do. So we are left to improvise.
Because your query is so complicated and MS Access does not support common table expressions, I recommend you follow a two step process. First, name that query you already wrote IntermediateQuery. Then, write a second query called FinalQuery that does the following:
SELECT i1.field_primarykey, i1.field2, ... , i1.field_x,
(SELECT field_primarykey FROM IntermediateQuery i2
WHERE t2.field_primarykey <= t1.field_primarykey) AS Counter
FROM IntermediateQuery i1
ORDER BY Counter
The unfortunate side effect of this is the more data your table returns, the longer it will take for the inline subquery to calculate. However, this is the only way you'll get your row numbers. It does depend on having a primary key in the table. In this particular case, it doesn't have to be an explicitly defined primary key, it just needs to be a field or combination of fields that is completely unique for each record.

Do indexes work in NOT IN or <> clause?

I have read that normal indexes in (atleast Oracle) database are basically B- tree structures, and hence store the records treating appropriate root nodes. Records 'lesser than' the root are iteratively stored in the left portion of the tree, while records 'greater than' the root are stored to the right portion. It is this storage approach that helps in a faster scan, through tree traversal since depth and breadth is reduced.
However,while creating indexes or for performance tuning of a where clause, most guides speak about first prioritize the columns where equality is to be considered (IN or = clause) and then alone move to the columns with inequality clauses. (NOT IN, <>). What is the cause of this advise? Should it not be feasible to predict that a given value does not exist as easily as it is to predict a given value exists, using tree traversal?
Do indexes not work with negation?
The issue is locality within the index. If you have two columns with letters in col1 and numbers in col 2, then an index might look like:
Ind col1 col2
1 A 1
2 A 1
3 A 1
4 A 2
5 B 1
6 B 1
7 B 2
8 B 3
9 B 3
10 C 2
11 C 3
(ind is the position in the index. The record locator is left out.)
If you are looking for col1 = 'B', then you can find position 5 and then scan the index until position 9. If you are looking for col1 <> 'B', then you need to find the first record that is not 'B' scan and repeat for the first record after. This becomes worse with IN and NOT IN.
An additional factor is that if a relative handful of records satisfy the equality condition, then almost all records will fail -- and often indexes are not useful when almost all records need to be read. One sometimes-exception to this are clustered indexes.
Oracle has better index optimizations than most databases -- it will do multiple scans starting in different locations. Even so, an inequality is often much less useful for an index.

Index structure to maximize speed across any combination of index columns

I have a database with about five possible index columns, all of which are useful in different ways. Let's call them System, Source, Heat, Time, and Row. Using System and Row together will make a unique key, and if sorted by System-Row the database will also be sorted for any combination of the five index variables (in the order I listed them above).
My problem is that I use all combinations of these columns: sometimes I want to JOIN each System-Row to the next System-(Row+1), sometimes I want to GROUP or WHERE by System-Source-Heat, sometimes I want to look at all entries of System-Source WHERE Time is in a specific window, etc.
Basically, I want an index structure that functions similarly to every possible permutation of those five indexes (in the correct order, of course), without actually making every permutation (although I am willing to do so if necessary). I'm doing statistics / analytics, not traditional database work, so the size of the index and speed of creating / updating it is not a concern; I only care about speeding my improvised queries as I tend to think them up, run them, wait 5-10 minutes, and then never use them again. Thus my main concern is reducing the "wait 5-10 minutes" to something more like "wait 1-2 minutes."
My sorted data would look something like this:
Sys So H Ti R
1 1 0 .1 1
1 1 1 .2 2
1 1 1 .3 3
1 1 2 .3 4
1 2 0 .5 5
1 2 0 .6 6
1 2 1 .8 7
1 2 2 .8 8
EDIT: It may simplify things a bit that System virtually always needs to be included as the first column to make any of the other 4 columns in sorted order.
If you are ONLY concerned with SELECT speed and don't care about INSERT, then you can materialize ALL the combinations as INDEXED views. You only need 24 times the storage of the original table, making one table and 23 INDEXED VIEWs of 5 columns each.
e.g.
create table data (
id int identity primary key clustered,
sys int,
so int,
h float,
ti datetime,
r int);
GO
create view dbo.data_v1 with schemabinding as
select sys, so, h, ti, r
from dbo.data;
GO
create unique clustered index cix_data_v1 on data_v1(sys, h, ti, r, so)
GO
create view dbo.data_v2 with schemabinding as
select sys, so, h, ti, r
from dbo.data;
GO
create unique clustered index cix_data_v2 on data_v2(sys, ti, r, so, h)
GO
-- and so on and so forth, keeping "sys" anchored at the front
Do note, however
Q. Why isn't my indexed view being picked up by the query optimizer for use in the query plan? (search within linked article)
If space IS an issue, then the next best thing is to create individual indexes on each of the 4 columns, leading with system, i.e. (sys,ti), (sys,r) etc. These can be used together if it will help the query, otherwise it will revert to a full table scan.
Sorry for taking a while to get back to this, I had to work on something else for a few weeks. Anyway, after trying a bunch of things (including everything suggested here, even the brute-force "make an index for every permutation" method), I haven't found any indexing method that significantly improves performance.
However, I HAVE found an alternate, non-indexing solution: selecting only the rows and columns I'm interested in into intermediary tables, and then working with those instead of the complete table (so I use about 5 mil rows of 6 cols instead of 30 mil rows of 35 cols). The initial select and table creation is a bit slow, but the steps after that are so much faster I actually save time even if I only run it once (and considering how often I change things, it's usually much more than once).
I have a suspicion that the reason for this vast improvement will be obvious to most SQL users (probably something about pagefile size), and I apologize if so. My only excuse is that I'm a statistician trying to teach myself how to do this as I go, and while I'm pretty decent at getting what I want done to happen (eventually), my understanding of the mechanics of how it's being done are distressingly close to "it's a magic black box, don't worry about it."

A simple SQL Select query to crawl all connected people in a social graph?

What is the shortest or fastest SQL select query or SQL procedure to crawl a social graph. Imagine we have this table:
UId FriendId
1 2
2 1
2 4
1 3
5 7
7 5
7 8
5 9
9 7
We have two subset of people here, i'm talking about a sql query or procedure which if we pass:
Uid = 4 return the result set rows with uid : {1, 2, 3}
or if
Uid = 9 return the result set rows with uid : {5, 7, 8}
Sorry for my poor english.
So you want get all friends of someone, including n-th degree friends? I don't think it is possible without recursion.
How you can do that is explained here:
https://inviqa.com/blog/graphs-database-sql-meets-social-network
If you are storing your values in an adjacency list, the easiest way I've found to crawl it is to translate it into a graphing language and query that. For example, if you were working in PHP, you could use the Image_GraphViz package. Or, if you want to use AJAX, you might consider cytoscapeweb. Both work well.
In either case, you'd SELECT * FROM mytable and feed all the records into the graph package as nodes. This means outputting them in dot or GraphML (or other graphing language). Then you can easily query them.
If you don't wish to translate the dataset, consider storing it as nested sets. Nested sets, though a bit of a pain to maintain, are much better than adjacency lists for the kind of queries you are looking to do.
If you are storing your values in an adjacency list, and you want n-th degree you can simply recursively INNER JOIN the UID's. For example:
Select t1.uid, t2.uid, t3.uid FROM t1 INNER JOIN t2 ON t1.uid=t2.uid INNER JOIN t3 ON t2.uid=t3.uid
This query is like a DFS with a fixed depth.