Time Complexity for Spark/Distributed Algorithms

Time Complexity for Spark/Distributed Algorithms - time-complexity

If we have below time complexity
for some sequential algorithm, how can we express this time complexity for the same algorithm implemented in Spark (distributed version). Assuming that we have 1 master node and 3 worker nodes in the cluster?
Similarly, how can we express O(n^2) time complexity for Spark algorithm?
Moreover, how can we express Space Complexity in HDFS with replication factor 3?
Thanks in advance!!!!

Ignoring orchestration and communication time (which is often not the case, ex. in case of sorting the whole data, the operation cannot be just "split" on different partitions).
Let's make another convenient assumption: the data is perfectly partitioned among the 3 partitions: every node holds n/3 data.
This said, I think we can consider an O(n^2) algorithm as sum of three O((n/3) ^ 2) partial computations (hence a final O((n/3) ^ 2)). This goes similarly for any other complexity ( O(n^2 log n) will be O((n/3)^2 log(n/3)) ).
As for the replication factor in hadoop, given the assumptions above, since the operations will be executed in parallel among replicas (!= from partitions), the complexity will be the same as an execution of a single "replica".

Related

Oracle SQL or PLSQL scale with load

Suppose I have query ( it has joins on multiple tables ) and assuming it is tuned, and optimized. This query runs on the target database/tables with N number of records and query results R number of records and takes time T. Now gradually the load increases and say the target records become N2, and result it give is R2 and time it takes as T2. Assuming that I have allocated enough memory to the Oracle, L2/L1 will be close to T2/T1. Means the proportional increase in the load will result proportional increase in execution time. For this question lets say L2 = 5L1, means load has increased to 5times. Then time take to complete by this query would also be 5times or little more, right? So, to reduce the proportional growth in time, do we have options in Oracle, like parallel hint etc? In Java we split the job in multiple threads and 2times the load with 2times the worker thread we get almost same time to complete. So with increasing load we increase the worker thread and achieve the scaling issue reasonably well. Is such thing possible in Oracle or does Oracle take care of such thing in the back end and will scale, by splitting the load internally into parallel processing? Here, I have multi core processors. I Will experiment it, but if expert opinion is available it will help.

No. Query algorithms do not necessarily grow linearly.
You should probably learn something about algorithms and complexity. But many algorithms used in a data are super-linear. For instance, ordering a set of rows has a complexity of O(n log n), meaning that if you double the data size, the time taken for sorting more than doubles.
This is also true of index lookups and various join algorithms.
On the other hand, if your query is looking up a few rows using a b-tree index, then the complex is O(log n) -- this is sublinear. So index lookups grow more slowly than the size of the data.
So, in general you cannot assume that increasing the size of data by a factor of n has a linear effect on the time.

Neo4j scalability and indexing

An argument in favor of graph dbms with native storage over relational dbms made by neo4j (also in the neo4j graph databases book) is that "index-free adjacency" is the most efficient means of processing data in a graph (due to the 'clustering' of the data/nodes in a graph-based model).
Based on some benchmarking I've performed, where 3 nodes are sequentially connected (A->B<-C) and given the id of A I'm querying for C, the scalability is clearly O(n) when testing the same query on databases with 1M, 5M, 10M and 20M nodes - which is reasonable (with my limited understanding) considering I am not limiting my query to 1 node only hence all nodes need to be checked for matching. HOWEVER, when I index the queried node property, the execution time for the same query, is relatively constant.
Figure shows execution time by database node size before and after indexing. Orange plot is O(N) reference line, while the blue plot is the observed execution times.
Based on these results I'm trying to figure out where the advantage of index-free adjacency comes in. Is this advantageous when querying with a limit of 1 for deep(er) links? E.g. depth of 4 in A->B->C->D->E, and querying for E given A. Because in this case we know that there is only one match for A (hence no need to brute force through all the other nodes not part of this sub-network).
As this is highly dependent on the query, I'm listing an example of the Cypher query below for reference (where I'm matching entity labeled node with id of 1, and returning the associated node (B in the above example) and the secondary-linked node (C in the above example)):
MATCH (:entity{id:1})-[:LINK]->(result_assoc:assoc)<-[:LINK]-(result_entity:entity) RETURN result_entity, result_assoc
UPDATE / ADDITIONAL INFORMATION
This source states: "The key message of index-free adjacency is, that the complexity to traverse the whole graph is O(n), where n is the number of nodes. In contrast, using any index will have complexity O(n log n).". This statement explains the O(n) results before indexing. I guess the O(1) performance after indexing is identical to a hash list performance(?). Not sure why using any other index the complexity is O(n log n) if even using a hash list the worst case is O(n).

From my understanding, the index-free aspect is only pertinent for adjacent nodes (that's why it's called index-free adjacency). What your plots are demonstrating, is that when you find A, the additional time to find C is negligible, and the question of whether to use an index or not, is only to find the initial queried node A.
To find A without an index it takes O(n), because it has to scan through all the nodes in the database, but with an index, it's effectively like a hashlist, and takes O(1) (no clue why the book says O(n log n) either).
Beyond that, finding the adjacent nodes are not that hard for Neo4j, because they are linked to A, whereas in RM the linkage is not as explicit - thus a join, which is expensive, and then scan/filter is required. So to truly see the advantage, one should compare the performance of graph DBs and RM DBs, by varying the depth of the relations/links. It would also be interesting to see how a query would perform when the neighbours of the entity nodes are increased (ie., the graph becomes denser) - does Neo4j rely on the graphs never being too dense? Otherwise the problem of looking through the neighbours to find the right one repeats itself.

How to efficiently union non-overlapping sets in redis?

I have a use case where I know for a fact that some sets I have materialized in my redis store are disjoint. Some of my sets are quite large, as a result, their sunion or sunionstore takes quite a large amount of time. Does redis provide any functionality for handling such unions?
Alternatively, if there is a way to add elements to a set in Redis without checking for uniqueness before each insert, it could solve my issue.

Actually, there is no need for such feature, because of the relative cost of operations.
When you build Redis objects (such as sets or lists), the cost is not dominated by the data structure management (hash table or linked lists), because the amortized complexity of individual insertion operations is O(1). The cost is dominated by the allocation and initialization of all the items (i.e. the set objects or the list objects). When you retrieve those objects, the cost is dominated by the allocation and formatting of the output buffer, not by the access paths in the data structure.
So bypassing the uniqueness property of the sets does not bring a significant optimization.
To optimize a SUNION command if the sets are disjoint, the best is to replace it by a pipeline of several SMEMBERS commands to retrieve the individual sets (and build the union on client side).
Optimizing a SUNIONSTORE is not really possible since disjoint sets is a worst case for the performance. The performance is dominated by the number of resulting items, so the less items in common, the more response time.

What is the mathematical relationship between "no. of rows affected" and "execution time" of a sql query?

The query remains constant i.e it will remain the same.
e.g. a select query takes 30 minutes if it returns 10000 rows.
Would the same query take 1 hour if it has to return 20000 rows?
I am interested in knowing the mathematical relation between no. of rows(N) and execution time(T) keeping other parameters as constant(K).
i.e T= N*K or
T=N*K + C or
any other formula?
Reading http://research.microsoft.com/pubs/76556/progress.pdf if it helps. Anybody who can understand this before me, please do reply. Thanks...

Well that is good question :), but there is not exact formula, because it depends of execution plan.
SQL query optimizer could choose another execution plan on query which return different number of rows.
I guess if the query execution plan is the same for both query's and you have some "lab" conditions then time growth could be linear. You should research more on sql execution plans and statistics

Take the very simple example of reading every row in a single table.
In the worst case, you will have to read every page of the table from your underlying storage. The worst case for this is having to do a random seek. The seek time will dominate all other factors. So you can estimate the total time.
time ~= seek time x number of data pages
Assuming your rows are of a fairly regular size, then this is linear in the number of rows.
However databases do a number of things to try and avoid this worst case. For example, in SQL Server table storage is often allocated in extents of 8 consecutive pages. A hard drive has a much faster streaming IO rate than random IO rate. If you have a clustered index, reading the pages in cluster order tend to have a lot more streaming IO than random IO.
The best case time, ignoring memory caching, is (8KB is the SQL Server page size)
time ~= 8KB * number of data pages / streaming IO rate in KB/s
This is also linear in the number of rows.
As long as you do a reasonable job managing fragmentation, you could reasonably extrapolate linearly in this simple case. This assumes your data is much larger than the buffer cache. If not, you also have to worry about the cliff edge where your query changes from reading from buffer to reading from disk.
I'm also ignoring details like parallel storage paths and access.

Complexity of adding n entries to a database

What the complexity in big O notation of adding n entries to a database with m entries with i indexes in MySQL and afterwards committing?

Inserting into a MyISAM table without indexes takes O(n) (linear) time.
Inserting into an InnoDB table and into any index takes log(m) * O(n) (linear time depending on the number of already existing records) time (assuming m >> n), since InnoDB tables and indexes are B-Trees.
Overall time is the sum of these values.

That would depend on the number of indexes you have in your tables, among other factors.
Each individual operation in a database has a different complexity. For example, the time complexity for B-Tree search operations is O(log n), and the time for an actual search depends on whether a table scan takes place, which is O(n).
I would imagine that you could build an equation that is quite complex for what you are describing. You would have to account for each operation individually, and I'm not sure that can be done in a deterministic way, given database systems' propensity for deciding in an adhoc way how they will execute things using query plans, etc.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Time Complexity for Spark/Distributed Algorithms - time-complexity

Related

Oracle SQL or PLSQL scale with load

Neo4j scalability and indexing

How to efficiently union non-overlapping sets in redis?

What is the mathematical relationship between "no. of rows affected" and "execution time" of a sql query?

Complexity of adding n entries to a database

Categories

Resources