What is the use case that makes EAVT index preferable to EATV? - indexing

From what I understand, EATV (which Datomic does not have) would be great fit for as-of queries. On the other hand, I see no use-case for EAVT.

This is analogous to row/primary key access. From the docs: "The EAVT index provides efficient access to everything about a given entity. Conceptually this is very similar to row access style in a SQL database, except that entities can possess arbitrary attributes rather then being limited to a predefined set of columns."
The immutable time/history side of Datomic is a motivating use case for it, but in general, it's still optimized around typical database operations, e.g. looking up an entity's attributes and their values.
Update:
Datomic stores datoms (in segments) in the index tree. So you navigate to a particular E's segment using the tree and then retrieve the datoms about that E in the segment, which are EAVT datoms. From your comment, I believe you're thinking of this as the navigation of more b-tree like structures at each step, which is incorrect. Once you've navigated to the E, you are accessing a leaf segment of (sorted) datoms.

You are not looking for a single value at a specific point in time. You are looking for a set of values up to a specific point in time T. History is on a per value basis (not attribute basis).
For example, assert X, retract X then assert X again. These are 3 distinct facts over 3 distinct transactions. You need to compute that X was added, then removed and then possibly added again at some point.
You can do this with SQL:
create table Datoms (
E bigint not null,
A bigint not null,
V varbinary(1536) not null,
T bigint not null,
Op bit not null --assert/retract
)
select E, A, V
from Datoms
where E = 1 and T <= 42
group by E, A, V
having 0 < sum(case Op when 1 then +1 else -1 end)
The fifth component Op of the datom tells you whether the value is asserted (1) or retracted (0). By summing over this value (as +1/-1) we arrive at either 1 or 0.
Asserting the same value twice does nothing, and you always retract the old value before you assert a new value. The last part is a prerequisite for the algorithm to work out this nicely.
With an EAVT index, this is a very efficient query and it's quite elegant. You can build a basic Datomic-like system in just 150 lines of SQL like this. It is the same pattern repeated for any permutation of EAVT index that you want.

Related

SQL Query for Search Page

I am working on a small project for an online databases course and i was wondering if you could help me out with a problem I am having.
I have a web page that is searching a movie database and retrieving specific columns using a movie initial input field, a number input field, and a code field. These will all be converted to strings and used as user input for the query.
Below is what i tried before:
select A.CD, A.INIT, A.NBR, A.STN, A.ST, A.CRET_ID, A.CMNT, A.DT
from MOVIE_ONE A
where A.INIT = :init
AND A.CD = :cd
AND A.NBR = :num
The way the page must search is in three different cases:
(initial and number)
(code)
(initial and number and code)
The cases have to be independent so if certain field are empty, but fulfill a certain case, the search goes through. It also must be in one query. I am stuck on how to implement the cases.
The parameters in the query are taken from the Java parameters in the method found in an SQLJ file.
If you could possibly provide some aid on how i can go about this problem, I'd greatly appreciate it!
Consider wrapping the equality expressions in NVL (synonymous to COALESCE) so if parameter inputs are blank, corresponding column is checked against itself. Also, be sure to kick the a-b-c table aliasing habit.
SELECT m.CD, m.INIT, m.NBR, m.STN, m.ST, m.CRET_ID, m.CMNT, m.DT
FROM MOVIE_ONE m
WHERE m.INIT = NVL(:init, m.INIT)
AND m.CD = NVL(:cd, m.CD)
AND m.NBR = COALESCE(:num, m.NBR)
To demonstrate, consider below DB2 fiddles where each case can be checked by adjusting value CTE parameters all running on same exact data.
Case 1
WITH
i(init) AS (VALUES('db2')),
c(cd) AS (VALUES(NULL)),
n(num) AS (VALUES(53)),
cte AS
...
Case 2
WITH
i(init) AS (VALUES(NULL)),
c(cd) AS (VALUES(2018)),
n(num) AS (VALUES(NULL)),
cte AS
...
Case 3
WITH
i(init) AS (VALUES('db2')),
c(cd) AS (VALUES(2018)),
n(num) AS (VALUES(53)),
cte AS
...
However, do be aware the fiddle runs a different SQL due to nature of data (i.e., double and dates). But query does reflect same concept with NVL matching expressions on both sides.
SELECT *
FROM cte, i, c, n
WHERE cte.mytype = NVL(i.init, cte.mytype)
AND YEAR(CAST(cte.mydate AS date)) = NVL(c.cd, YEAR(CAST(cte.mydate AS date)))
AND ROUND(cte.mynum, 0) = NVL(n.num, ROUND(cte.mynum, 0));

Matching observations based on similarity of categorical variables

I was wondering, if someone has a good way how to match two observations based on categorical (non-ordinal) variables.
The exercise I am working on is matching mentees with mentors based on interests and other characteristics that are (non-ordinal or ordinal) categorical variables.
Variable Possible Values
Sport “Baseball”, “Football”, “Basketball” (…)
Marital Status “Single, no kids”, “Single, young kids”, “Married, no kids”, “Married, young kids”, (…)
Job Level 1, 2, 3, 4, 5, 6
Industry “Retail”, “Finance”, “Wholesale”, (…)
There are also indicators if any of the variables is important to the person. I understand, I could force marital status into one or two ordinal variables like (“Single”, “Married”, “Widow”) and (“no kids”, “young kids”, “grown kids”). But I don’t know how to handle industry and sport as there is no logical order to them. My plan was originally to use a clustering technique to find a match between the mentor and the mentee set based on the shortest distance or the given points. But that would ignore the fact that people could decide, if the variable is important to them or not (“Yes”, “No”).
Now, I am thinking to just brute force logic on it by using nested IF statements that check, if there is a perfect match based on the importance and the actual values. ELSE check if there is a matching record that has all matches, but one category etc. This seems very inefficient, so I was hoping if someone came across a similar problem, I would find a better way how to handle this.
Would it make sense to create two variables one for the importance sequence (eg: "YesNoYesNoNo") and one for the interests (eg "BasketballSingleNokids6Retail") and then employ fuzzy matching?
Best regards,
One approach would be to decide first on which variables you must have an exact match, do a cartesian join on those, then generate a score based on other non-mandatory matches and output records where the score exceeds a threshold. The more mandatory matches you require, the better the query will perform.
E.g.
%let MATCH_THRESHOLD = 2; /*At least this many optional variables must match*/
proc sql;
create table matches as
select * from mentors a inner join mentees b
/*Mandatory matches*/
on a.m_var1 = b.m_var1
and a.m_var2 = b.m_var2
and ...
/*Optional threshold-based matches*/
where a.o_var1 = b.o_var1
+ a.o_var2 = b.o_var2
+ ...
>= &MATCH_THRESHOLD;
quit;
Going further - if you have inconsistently entered data, you could use soundex or edit distance matching rather than exact matching for the optional conditions. If some optional matches are worth more than others, you can weight their contribution to the score.

Spatial SQL query showing parcels containing centroid of building

I am trying to write a query that selects parcels that contain the centroid of a certain building code (bldg_code = 3).
The parcels are listed in the table "city.zoning" and contains a column for a PIN, geometry, and area of each parcel. The table "buildings" contains a column for bldg_type and bldg_code indicating the building type and its corresponding code. The building type of interest for this query has a bldg_code of 3.
So far I've developed a query that shows parcels that interact with the building type of interest:
select a.*
from city.zoning a, username.buildings b
where b.bldg_code = 3 and sdo_anyinteract(a.geom,b.geom) = 'TRUE';
Any ideas?
You can use SDO_GEOM.SDO_CENTROID (documentation) to find the centroid of a geometry.
Note that the centroid provided by this function is the mathematical centroid only and may not always lie inside the geometry, for example, if your polygon is L shaped. SpatialDB Adviser has a good article on this, but here's a quick illustration:
If this isn't a problem for you and you don't need that level of accuracy, just use the built-in, but if you do consider this to be a problem (as I did in the past), then SpatialDB Adviser has a standalone PL/SQL package that corrrectly calculates centroids.
Depending on your performance needs, you could calculate the centroids on-the-fly and just use them in your query directly, or alternatively, add a centroid column to the table and compute and cache the values with application code (best case) or trigger (worst case).
Your query would look something like this:
SELECT a.*
FROM city.zoning a
JOIN username.buildings b ON sdo_contains(a.geom, b.centroid) = 'TRUE'
WHERE b.bldg_code = 3
Note that this is using SDO_CONTAINS on the basis of the a.geom column being spatially indexed and a new column b.centroid that has been added and populated (note - query not tested). If the zoning geometry is not spatially indexed, then you would need to use SDO_GEOM.RELATE, or index the centroid column and invert the logic to use SDO_INSIDE.

Select pair of rows that obey a rule

I have a big table (1M rows) with the following columns:
source, dest, distance.
Each row defines a link (from A to B).
I need to find the distances between a pair using anoter node.
An example:
If want to find the distance between A and B,
If I find a node x and have:
x -> A
x -> B
I can add these distances and have the distance beetween A and B.
My question:
How can I find all the nodes (such as x) and get their distances to (A and B)?
My purpose is to select the min value of distance.
P.s: A and B are just one connection (I need to do it for 100K connections).
Thanks !
As Andomar said, you'll need the Dijkstra's algorithm, here's a link to that algorithm in T-SQL: T-SQL Dijkstra's Algorithm
Assuming you want to get the path from A-B with many intermediate steps it is impossible to do it in plain SQL for an indefinite number of steps. Simply put, it lacks the expressive power, see http://en.wikipedia.org/wiki/Expressive_power#Expressive_power_in_database_theory . As Andomar said, load the data into a process and us Djikstra's algorithm.
This sounds like the traveling salesman problem.
From a SQL syntax standpoint: connect by prior would build the tree your after using the start with and limit the number of layers it can traverse; however, doing will not guarantee the minimum.
I may get downvoted for this, but I find this an interesting problem. I wish that this could be a more open discussion, as I think I could learn a lot from this.
It seems like it should be possible to achieve this by doing multiple select statements - something like SELECT id FROM mytable WHERE source="A" ORDER BY distance ASC LIMIT 1. Wrapping something like this in a while loop, and replacing "A" with an id variable, would do the trick, no?
For example (A is source, B is final destination):
DECLARE var_id as INT
WHILE var_id != 'B'
BEGIN
SELECT id INTO var_id FROM mytable WHERE source="A" ORDER BY distance ASC LIMIT 1
SELECT var_id
END
Wouldn't something like this work? (The code is sloppy, but the idea seems sound.) Comments are more than welcome.
Join the table to itself with destination joined to source. Add the distance from the two links. Insert that as a new link with left side source, right side destination and total distance if that isn't already in the table. If that is in the table but with a shorter total distance then update the existing row with the shorter distance.
Repeat this until you get no new links added to the table and no updates with a shorter distance. Your table now contains a link for every possible combination of source and destination with the minimum distance between them. It would be interesting to see how many repetitions this would take.
This will not track the intermediate path between source and destination but only provides the shortest distance.
IIUC this should do, but I'm not sure if this is really viable (performance-wise) due to the big amount of rows involved and to the CROSS JOIN
SELECT
t1.src AS A,
t1.dest AS x,
t2.dest AS B,
t1.distance + t2.distance AS total_distance
FROM
big_table AS t1
CROSS JOIN
big_table AS t2 ON t1.dst = t2.src
WHERE
A = 'insert source (A) here' AND
B = 'insert destination (B) here'
ORDER BY
total_distance ASC
LIMIT
1
The above snippet will work for the case in which you have two rows in the form A->x and x->B but not for other combinations (e.g. A->x and B->x). Extending it to cover all four combiantions should be trivial (e.g. create a view that duplicates each row and swaps src and dest).

INT vs VARCHAR in search

Which one of the following queries will be faster and more optimal (and why):
SELECT * FROM items WHERE w = 320 AND h = 200 (w and h are INT)
SELECT * FROM items WHERE dimensions = '320x200'(dimensions is VARCHAR)
Here are some actual measurements. (Using SQLite; may try it with MySQL later.)
Data = All 1,000,000 combinations of w, h ∈ {1...1000}, in randomized order.
CREATE TABLE items (id INTEGER PRIMARY KEY, w INTEGER, h INTEGER)
Average time (of 20 runs) to execute SELECT * FROM items WHERE w = 320 and h = 200 was 5.39±0.29 µs.
CREATE TABLE items (id INTEGER PRIMARY KEY, dimensions TEXT)
Average time to execute SELECT * FROM items WHERE dimensions = '320x200' was 5.69±0.23 µs.
There is no significant difference, efficiency-wise.
But
There is a huge difference in terms of usability. For example, if you want to calculate the area and perimeter of the rectangles, the two-column approach is easy:
SELECT w * h, 2 * (w + h) FROM items
Try to write the corresponding query for the other way.
Intuitively, if you do not create INDEXes on those columns, integer comparison seems faster.
In integer comparison, you compare directly 32-bit values equality with logical operators.
On the other hand, strings are character arrays, it will be difficult to compare them. Character-by-character.
However, another point is that, in 2nd query you have 1 field to compare, in 1st query you have 2 fields. If you have 1,000,000 records and no indexes on columns, that means you may have 1,000,000 string comparisons on worst case (unluckily last result is the thing you've looking for or not found at all)
On the other hand you have 1,000,000 records and all are w=320, then you'll be comparing them for h,too. That means 2,000,000 comparisons. However you create INDEXes on those fields IMHO they will be almost identical since VARCHAR will be hashed (takes O(1) constant time) and will be compared using INT comparison and take O(logn) time.
Conclusion, it depends. Prefer indexes on searchable columns and use ints.
Probably the only way to know that is to run it. I would suspect that if all columns used are indexed, there would be basically no difference. If INT is 4 bytes, it will be almost the same size as the string.
The one wrinkle is in how VARCHAR is stored. If you used a constant string size, it might be faster than VARCHAR, but mostly because your select * needs to go get it.
The huge advantage of using INT is that you can do much more sophisticated filtering. That alone should be a reason to prefer it. What if you need a range, or just width, or you want to do math on width in the filtering? What about constraints based on the columns, or aggregates?
Also, when you get the values into your programming language, you won't need to parse them before using them (which takes time).
EDIT: Some other answers are mentioning string compares. If indexed, there won't be many string compares done. And it's possible to implement very fast compare algorithms that don't need to loop byte-by-byte. You'd have to know the details of what mysql does to know for sure.
Second query, as the chances to match the exact string is smaller (which mean smaller set of records but with greater cardinality)
First query, chances matching first column is higher and more rows are potentially matched (lesser cardinality)
of course, assuming index are defined for both scenario
first one because it is faster to compare numeric data.
It depends on the data and the available indexes. But it is quite possible for the VARCHAR version to be faster because searching a single index can be faster than two. If the combination of values provides a unique (or "mostly" unique) result while each individual H/W value has multiple entries, then it could narrow the down to a much smaller set using the single index.
On the other hand, if you have a multiple column index on the to integer columns, that would likely be the most efficient.