Extremely slow query using CONNECT BY Oracle 10 - sql

I have a table challenge containing about 12000 rows. Every point connects to the four points around it, for example 100 connects to 99 101 11 and 189. I tried this with a small scale table and it worked just fine but as I increased the size of the table the query became exponentially slower and now it won't even finish. Here's my query
SELECT level, origin, destination
FROM challenge
WHERE destination = 2500
START WITH origin = 1
CONNECT BY NOCYCLE PRIOR destination = origin;
Any advice on how to optimize this query would be greatly appreciated.

So you're finding every path from node 1 to node 2500 in a degree-4 graph (rectangular lattice?) of thousands of nodes. I expect there'll be quite a lot of them. Did the challenge just ask you to count them? Because I think the point was that you have to figure out how many there are by doing math, not brute force computation.
For example, if it's a 50x50 rectangular grid with node 1 and node 2500 in opposite corners, then the minimum path length is 100 steps. A path of 100 steps will have 50 of them horizontal and 50 of them vertical, and they can come in any order. Figure out how many ways you can arrange a string of 50 H's and 50 V's and you might find it's a number that even the mighty Oracle will have a bit of a problem with. (Generating the rows, that is. Doing the calculation just requires large integer arithmetic, which Oracle can probably do quite quickly once you tell it the formula.)
And your query is actually worse than that. It doesn't ask only for minimum-length paths. So it will also return all the paths of length 102 that take a step away from the destination somewhere along the way. And paths of length 104 that take 2 backward steps. And paths of length 2498 that visit almost all of the nodes! Counting those paths is more complicated than counting the short paths because you have to exclude the ones that cross themselves.

Related

Optimizing Geometry.STIntersect Query

I am attempting to move simple Geoprocessing routines from ESRI based processes to SQL Server. My assumption is that it will be far more efficient. For my initial test I am working on an intersection routine to associate overlapping linear data.
In my WCASING table I have 1610 records. I am trying to associate these Casings with their associated mains. I have ~277,000 Mains.
I am running the query below to get a general sense of how long it will take to find individual matches. This query returned 5 valid intersections in 40 seconds.
SELECT Top 5 [WCASING].[OBJECTID] As CasingOBJECTID,
[WPUMPPRESSUREMAIN].[OBJECTID] AS MainObjectID, [WCASING].[Shape]
FROM [dbo].[WPUMPPRESSUREMAIN]
JOIN [WCASING]
ON [WCASING].[Shape].STIntersects([WPUMPPRESSUREMAIN].[Shape]) = 1
My Primary questions;
Will this process faster depending on the search order
Finding 'A' inside of 'B' vs
Finding 'B' inside of 'A'
Initial return on 5 records from these datasets is that it does not matter
Will this process faster, if I first buffer to limit to a smaller main set and then search
Can I use SQL Server Tuning to work with Geometry based queries
I will be working on these processes over the next few weeks. In the meantime I would greatly appreciate insight and pointers to white papers associated with these tuning options. Thus far I have not found a great resource.
Thank You,
Rick

Storing trillions of document similarities

I wrote a program to compute similarities among a set of 2 million documents. The program works, but I'm having trouble storing the results. I won't need to access the results often, but will occasionally need to query them and pull out subsets for analysis. The output basically looks like this:
1,2,0.35
1,3,0.42
1,4,0.99
1,5,0.04
1,6,0.45
1,7,0.38
1,8,0.22
1,9,0.76
.
.
.
Columns 1 and 2 are document ids, and column 3 is the similarity score. Since the similarity scores are symmetric I don't need to compute them all, but that still leaves me with 2000000*(2000000-1)/2 ≈ 2,000,000,000,000 lines of records.
A text file with 1 million lines of records is already 9MB. Extrapolating, that means I'd need 17 TB to store the results like this (in flat text files).
Are there more efficient ways to store these sorts of data? I could have one row for each document and get rid of the repeated document ids in the first column. But that'd only go so far. What about file formats, or special database systems? This must be a common problem in "big data"; I've seen papers/blogs reporting similar analyses, but none discuss practical dimensions like storage.
DISCLAIMER: I don't have any practical experience with this, but it's a fun exercise and after some thinking this is what I came up with:
Since you have 2.000.000 documents you're kind of stuck with an integer for the document id's; that makes 4 bytes + 4 bytes; the comparison seems to be between 0.00 and 1.00, I guess a byte would do by encoding the 0.00-1.00 as 0..100.
So your table would be : id1, id2, relationship_value
That brings it to exactly 9 bytes per record. Thus (without any overhead) ((2 * 10^6)^2)*9/2bytes are needed, that's about 17Tb.
Off course that's if you have just a basic table. Since you don't plan on querying it very often I guess performance isn't that much of an issue. So you could go 'creative' by storing the values 'horizontally'.
Simplifying things, you would store the values in a 2 million by 2 million square and each 'intersection' would be a byte representing the relationship between their coordinates. This would "only" require about 3.6Tb, but it would be a pain to maintain, and it also doesn't make use of the fact that the relations are symmetrical.
So I'd suggest to use a hybrid approach, a table with 2 columns. First column would hold the 'left' document-id (4 bytes), 2nd column would hold a string of all values of documents starting with an id above the id in the first column using a varbinary. Since a varbinary only takes the space that it needs, this helps us win back some space offered by the symmetry of the relationship.
In other words,
record 1 would have a string of (2.000.000-1) bytes as value for the 2nd column
record 2 would have a string of (2.000.000-2) bytes as value for the 2nd column
record 3 would have a string of (2.000.000-3) bytes as value for the 2nd column
etc
That way you should be able to get away with something like 2Tb (inc overhead) to store the information. Add compression to it and I'm pretty sure you can store it on a modern disk.
Off course the system is far from optimal. In fact, querying the information will require some patience as you can't approach things set-based and you'll pretty much have to scan things byte by byte. A nice 'benefit' of this approach would be that you can easily add new documents by adding a new byte to the string of EACH record + 1 extra record in the end. Operations like that will be costly though as it will result in page-splits; but at least it will be possible without having to completely rewrite the table. But it will cause quite bit of fragmentation over time and you might want to rebuild the table once in a while to make it more 'aligned' again. Ah.. technicalities.
Selecting and Updating will require some creative use of SubString() operations, but nothing too complex..
PS: Strictly speaking, for 0..100 you only need 7 bytes, so if you really want to squeeze the last bit out of it you could actually store 8 values in 7 bytes and save another ca 300Mb, but it would make things quite a bit more complex... then again, it's not like the data is going to be human-readable anyway =)
PS: this line of thinking is completely geared towards reducing the amount of space needed while remaining practical in terms of updating the data. I'm not saying it's going to be fast; in fact, if you'd go searching for all documents that have a relation-value of 0.89 or above the system will have to scan the entire table and even with modern disks that IS going to take a while.
Mind you that all of this is the result of half an hour brainstorming; I'm actually hoping that someone might chime in with a neater approach =)

Oracle APEX Tree Hierarchy Limitations/Issues

I am using Oracle APEX 4.2.2 and have constructed a Tree region based off a view.
Now when I take this query (see below) and run this query say in Oracle SQL Developer - all is fine but when I place this same query within the page in Oracle APEX based off a Tree region - all saves correctly but when I run this query, no records/tree is displayed at all.
Now the underlying view can change in record size but for the example I am talking about here, I have just over 6000 records that I need to build a Oracle Tree hierarchy from.
One thing I have noticed is that if I reduce the record size to say 500 rows, the tree displays perfectly.
Questions:
1) Now is there a limitation that I am not aware of as I really need to get this going based on whether there are 500 records or 6000 records?
2) Is 6000 rows too many for a tree hierarchy representation?
3) Could it possibly be because that Oracle APEX 4.2.2 is now using js for building trees and there causing issues due to the quantity of data?
4) Is there a means of reducing the depth of the tree records so that I can still at least display something to the user?
My query is something like:
SELECT case when connect_by_isleaf = 1 then 0
when level = 1 then 1
else -1
end as status,
level,
c as title,
null as icon,
c as value,
null as tooltip,
null as link
FROM t
start with p IS NULL
CONNECT BY NOCYCLE PRIOR c = p;
Also I've noticed that if I try and run the query in SQL Workshop, it doesn't work there either unless I reduce the record size down to say 500 records.
I asked about using IE because the 'too large tree' issue especially plays up in IE. I've seen this issue pass by and asked about a couple of times already. The conclusion was simply that there isn't much to be done about it and generally the browser(s) don't cope too well with a tree with such a large dataset. Usually the issue isn't there or is minimal in ff or chrome though, and ie is mostly not playing ball, and my guess is that this has to do with memory and dom manipulation.
1) Now is there a limitation that I am not aware of as I really need
to get this going based on whether there are 500 records or 6000
records?
No limitation.
2) Is 6000 rows too many for a tree hierarchy representation?
Probably, yes.
3) Could it possibly be because that Oracle APEX 4.2.2 is now using js
for building trees and there causing issues due to the quantity of
data?
Trees are being built with jstree since 4.0 (don't know about 3.2). Apex puts out a global variable in the tree region which holds all the data. The initialization of the widget will then create the complete ul-li list structure. Part of the issue might be that there are so many nodes to begin with, and then how this is ran through jstree, and the huge amount of dom manipulation occuring. I'm not sure if this would go better with the newer release of jstree (apex version is 0.9.9 while 1.x has been released for a while now).
4) Is there a means of reducing the depth of the tree records so that
I can still atleast display something to the user?
If you want to limit the depth you can limit the query by using level in the where clause. eg
WHERE level <= 3
Alternative options will probably be non-apex solutions. Dynamic trees, ajax for the tree nodes, another plugin,... I haven't really explored those as I haven't had to deal with such a big tree yet.
I experienced, that the number of displayable tree nodes depends also on the text lenghts in you tree (e.g. nodes and tooltips). The shorter the texts, the more nodes your tree can display. However, it makes a difference of maybe 50 nodes, so it won't solve your problem, as it didn't solve mine.
My mediocre educated guess is, that this ul-li is limited in size.
I built in a drop-down prefilter, so the user has to narrow down what she/he wants to have displayed.

SDK2 query for counting: which is more efficient?

I have an app that is displaying metrics about defects in a project.
I have the option of making one query that returns all the defects, and from that I can break out about four different metrics (How many defects escaped QA in 90 days, 180 days, and then the same metrics again but only counting sev1/sev2 defects).
I could make four queries and limit the results to one so that I just get a count for each. Or I could make one query that encompass them all (all defects that escaped QA in 180 days) and then count up the difference.
I'm figuring worst case, the number of defects that escaped QA in the last six months will generally be less than 100, certainly less 500 worst case.
Which would you do-- four queryies with one result each, or one single query that on average might return 50, perhaps worst case 500?
And I guess the key question is-- where are the inflections points? Perhaps I have more metrics tomorrow (who knows, 8?) and a different average defect counts. Is there a rule of thumb I could use to help choose which approach?
Well I would probably make the series of four queries and use the result count. If you are expecting 500 defects that will end up being three queries each with 200 defects anyways.
The solution where you do each individual query and use the total result count would be safe with even a very large amount of defects. Plus I usually find it to be a bad plan to think that I know the data sets that an App will be dealing with. Most of my Apps end up living much longer and being used on larger datasets than I intended.
The max page size is 200, so it sounds like you'd be requesting between 1 and 3 pages to get all the data vs. 4 queries with a page size of 1 and using the TotalResultCount...
You'd definitely have less aggregation code to write if you use the multi query approach (letting the server do the counting for you based on your supplied filters).
I'd guess the 4 independent queries might be faster but it would be interesting to hear back your experimental results...

How can I test that my hash function is good in terms of max-load?

I have read through various papers on the 'Balls and Bins' problem and it seems that if a hash function is working right (ie. it is effectively a random distribution) then the following should/must be true if I hash n values into a hash table with n slots (or bins):
Probability that a bin is empty, for large n is 1/e.
Expected number of empty bins is n/e.
Probability that a bin has k balls is <= 1/ek! (corrected).
Probability that a bin has at least k collisions is <= ((e/k)**k)/e (corrected).
These look easy to check. But the max-load test (the maximum number of collisions with high probability) is usually stated vaguely.
Most texts state that the maximum number of collisions in any bin is O( ln(n) / ln(ln(n)) ).
Some say it is 3*ln(n) / ln(ln(n)). Other papers mix ln and log - usually without defining them, or state that log is log base e and then use ln elsewhere.
Is ln the log to base e or 2 and is this max-load formula right and how big should n be to run a test?
This lecture seems to cover it best, but I am no mathematician.
http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture07.pdf
BTW, with high probability seems to mean 1 - 1/n.
That is a fascinating paper/lecture-- makes me wish I had taken some formal algorithms class.
I'm going to take a stab at some answers here, based on what I've just read from that, and feel free to vote me down. I'd appreciate a correction, though, rather than just a downvote :) I'm also going to use n and N interchangeably here, which is a big no-no in some circles, but since I'm just copy-pasting your formulae, I hope you'll forgive me.
First, the base of the logs. These numbers are given as big-O notation, not as absolute formulae. That means that you're looking for something 'on the order of ln(n) / ln(ln(n))', not with an expectation of an absolute answer, but more that as n gets bigger, the relationship of n to the maximum number of collisions should follow that formula. The details of the actual curve you can graph will vary by implementation (and I don't know enough about the practical implementations to tell you what's a 'good' curve, except that it should follow that big-O relationship). Those two formulae that you posted are actually equivalent in big-O notation. The 3 in the second formula is just a constant, and is related to a particular implementation. A less efficient implementation would have a bigger constant.
With that in mind, I would run empirical tests, because I'm a biologist at heart and I was trained to avoid hard-and-fast proofs as indications of how the world actually works. Start with N as some number, say 100, and find the bin with the largest number of collisions in it. That's your max-load for that run. Now, your examples should be as close as possible to what you expect actual users to use, so maybe you want to randomly pull words from a dictionary or something similar as your input.
Run that test many times, at least 30 or 40. Since you're using random numbers, you'll need to satisfy yourself that the average max-load you're getting is close to the theoretical 'expectation' of your algorithm. Expectation is just the average, but you'll still need to find it, and the tighter your std dev/std err about that average, the more you can say that your empirical average matches the theoretical expectation. One run is not enough, because a second run will (most likely) give a different answer.
Then, increase N, to say, 1000, 10000, etc. Increase it logarithmically, because your formula is logarithmic. As your N increases, your max-load should increase on the order of ln(n) / ln(ln(n)). If it increases at a rate of 3*ln(n) / ln(ln(n)), that means that you're following the theory that they put forth in that lecture.
This kind of empirical test will also show you where your approach breaks down. It may be that your algorithm works well for N < 10 million (or some other number), but above that, it starts to collapse. Why could that be? Maybe you have some limitation to 32 bits in your code without realizing it (ie, using a 'float' instead of a 'double'), or some other implementation detail. These kinds of details let you know where your code will work well in practice, and then as your practical needs change, you can modify your algorithm. Maybe making the algorithm work for very large datasets makes it very inefficient for very small ones, or vice versa, so pinpointing that tradeoff will help you further characterize how you could adapt your algorithm to particular situations. Always a useful skill to have.
EDIT: a proof of why the base of the log function doesn't matter with big-O notation:
log N = log_10 (N) = log_b (N)/log_b (10)= (1/log_b(10)) * log_b(N)
1/log_b(10) is a constant, and in big-O notation, constants are ignored. Base changes are free, which is why you're encountering such variation in the papers.
Here is a rough start to the solution of this problem involving uniform distributions and maximum load.
Instead of bins and balls or urns or boxes or buckets or m and n, people (p) and doors (d) will be used as designations.
There is an exact expected value for each of the doors given a certain number of people. For example, with 5 people and 5 doors, the expected maximum door is exactly 1.2864 {(1429-625) / 625} above the mean (p/d) and the minimum door is exactly -0.9616 {(24-625) / 625} below the mean. The absolute value of the highest door's distance from the mean is a little larger than the smallest door's because all of the people could go through one door, but no less than zero can go through one of the doors. With large numbers of people (p/d > 3000), the difference between the absolute value of the highest door's distance from the mean and the lowest door's becomes negligible.
For an odd number of doors, the center door is essentially zero and is not scalable, but all of the other doors are scalable from certain values representing p=d. These rounded values for d=5 are:
-1.163 -0.495 0* 0.495 1.163
* slowly approaching zero from -0.12
From these values, you can compute the expected number of people for any count of people going through each of the 5 doors, including the maximum door. Except for the middle ordered door, the difference from the mean is scalable by sqrt(p/d).
So, for p=50,000 and d=5:
Expected number of people going through the maximum door, which could be any of the 5 doors, = 1.163 * sqrt(p/d) + p/d.
= 1.163 * sqrt(10,000) + 10,000 = 10,116.3
For p/d < 3,000, the result from this equation must be slightly increased.
With more people, the middle door slowly becomes closer and closer to zero from -0.11968 at p=100 and d=5. It can always be rounded up to zero and like the other 4 doors has quite a variance.
The values for 6 doors are:
-1.272 -0.643 -0.202 0.202 0.643 1.272
For 1000 doors, the approximate values are:
-3.25, -2.95, -2.79 … 2.79, 2.95, 3.25
For any d and p, there is an exact expected value for each of the ordered doors. Hopefully, a good approximation (with a relative error < 1%) exists. Some professor or mathematician somewhere must know.
For testing uniform distribution, you will need a number of averaged ordered sessions (750-1000 works well) rather than a greater number of people. No matter what, the variances between valid sessions are great. That's the nature of randomness. Collisions are unavoidable. *
The expected values for 5 and 6 doors were obtained by sheer brute force computation using 640 bit integers and averaging the convergence of the absolute values of corresponding opposite doors.
For d=5 and p=170:
-6.63901 -2.95905 -0.119342 2.81054 6.90686
(27.36099 31.04095 33.880658 36.81054 40.90686)
For d=6 and p=108:
-5.19024 -2.7711 -0.973979 0.734434 2.66716 5.53372
(12.80976 15.2289 17.026021 18.734434 20.66716 23.53372)
I hope that you may evenly distribute your data.
It's almost guaranteed that all of George Foreman's sons or some similar situation will fight against your hash function. And proper contingent planning is the work of all good programmers.
After some more research and trial-and-error I think I can provide something part way to to an answer.
To start off, ln and log seem to refer to log base-e if you look into the maths behind the theory. But as mmr indicated, for the O(...) estimates, it doesn't matter.
max-load can be defined for any probability you like. The typical formula used is
1-1/n**c
Most papers on the topic use
1-1/n
An example might be easiest.
Say you have a hash table of 1000 slots and you want to hash 1000 things. Say you also want to know the max-load with a probability of 1-1/1000 or 0.999.
The max-load is the maximum number of hash values that end up being the same - ie. collisions (assuming that your hash function is good).
Using the formula for the probability of getting exactly k identical hash values
Pr[ exactly k ] = ((e/k)**k)/e
then by accumulating the probability of exactly 0..k items until the total equals or exceeds 0.999 tells you that k is the max-load.
eg.
Pr[0] = 0.37
Pr[1] = 0.37
Pr[2] = 0.18
Pr[3] = 0.061
Pr[4] = 0.015
Pr[5] = 0.003 // here, the cumulative total is 0.999
Pr[6] = 0.0005
Pr[7] = 0.00007
So, in this case, the max-load is 5.
So if my hash function is working well on my set of data then I should expect the maxmium number of identical hash values (or collisions) to be 5.
If it isn't then this could be due to the following reasons:
Your data has small values (like short strings) that hash to the same value. Any hash of a single ASCII character will pick 1 of 128 hash values (there are ways around this. For example you could use multiple hash functions, but slows down hashing and I don't know much about this).
Your hash function doesn't work well with your data - try it with random data.
Your hash function doesn't work well.
The other tests I mentioned in my question also are helpful to see that your hash function is running as expected.
Incidentally, my hash function worked nicely - except on short (1..4 character) strings.
I also implemented a simple split-table version which places the hash value into the least used slot from a choice of 2 locations. This more than halves the number of collisions and means that adding and searching the hash table is a little slower.
I hope this helps.