How is estimated key calculated in scylla. How does compaction strategy or RF affect it? - scylla

I can get the number of estimated row using api family/metrics/estimated_row_count
I wonder how accurate is this number and how could it be affected by compaction strategy and RF?
Say I actually have N rows, so I guess with RF = 3, it will report roughly 3*N rows?

The number should be the unreplicated one, so it's not a function of the RF.

Related

Detect sharp drops in time series data (Pandas)

I have a time series DataFrame which is indexed by "Cycles" and I'd like to know if there's a way to detect sharp prolonged drops.
The data is smoothed using a rolling mean of 10k cycles. And the index is not "symmetric", meaning that each row doesn't always increase the same number of cycles.
Index(Cycle) RawData DecimatedData
4 9.032 9.363
6 9.721 9.359
10 9.831 9.363
11 8.974 9.352
15 9.143 9.354
Even after decimating there are many "small" drops along the way which I'd like to ignore.
I've tried using df['DecimatedData'].diff() with different windows but it hasn't yielded good results.
I was thinking of putting a counter and that continues to increase as the .diff() function yields negative values, and then after it reaches a certain limit I can flag it. But I think there might be a more elegant way of doing this.

Hash tables Time Complexity Confusion

I just started learning about Hash Dictionaries. Currently we are implementing a hash dictionary with separate buckets that are made of chains (linked lists). The book posed this problem and I am having a lot of trouble figuring it out. Imagine we have an initial table size of 10 ie 10 buckets. If we want to know the time complexity for n insertions and a single lookup, how do we figure this out? (Assuming a pointer access is one unit of time).
It poses three scenarios:
A hash dictionary that does not resize, what is the time complexity for n insertions and 1 lookup?
A hash dictionary that resizes by 1 when the load factor exceeds .8, what is the time complexity for n insertions and 1 lookup?
A hash dictionary that resizes by doubling the table size when the load factor exceeds .8, what is the time complexity for n insertions and 1 lookup?
MY initial thoughts had me really confused. I couldn't quite figure out how to know the length of some given chain for an insertion. Assuming k length (I thought), there is the pointer access of the for loop going through the whole chain so k units of time. Then, in each iteration to insert it checks if the current node's data is equivalent to the key trying to be inserted (if it exists, overwrite it) so either 2k units of time if not found, 2k+1 if found. Then, it does 5 pointer accesses to prepend some element. So, 2k+5 or 2k+1 to insert 1 time. Thus, O(kn) for the first scenario for n insertions. To lookup, it seems to be 2k+1 or 2k. So for 1 lookup, o(k). I don't have a clue how to approach the other two scenarios. Some help would be great. Once again to clarify: k isn't mentioned in the problem. The only facts given are an initial size of 10 and the information given in the scenarios, so k can't be used as the results for the time complexity of n insertions or 1 lookup.
if you have a hash dictionary then your insert, delete and search operation will take O(n) of Time-Complexity for 1 key in the worst case scenario. For n insertions it would be O(n^2). It doesn't matter what the size of your table is.
|--------|
|element1| -> element2 -> element3 -> element4 -> element5
|--------|
| null |
|--------|
| null |
|--------|
| null |
|--------|
| null |
|--------|
Now for Average Case
Scenario one will have the load factor fixed (assuming m slots) : n/m. Therefore, one insert function will be O(1+n/m). 1 for the hash function computation and n/m for the lookup.
For the 2nd and 3rd scenario it should be O(1+n/m+1) and O(1+n/2m) respectively.
For your confusion, you can ask yourself a question that what will be the expected chain length for any random set of keys. The solution will be that we can't be sure at all.
That's where the idea of load factor comes into place to define the average case scenario, we give each slot equal probability to form a chain, if our no. of keys is greater than the slot count.
Imagine we have an initial table size of 10 ie 10 buckets. If we want to know the time complexity for n insertions and a single lookup, how do we figure this out?
When we talk about time complexity, we're looking at the steepness of the n-vs-time-for-operation curve as n approaches infinity. In the case above, you're saying there are only ten buckets, so - assuming the hash function scatters the insertions across the buckets with near-uniform distribution (as it should), n insertions will result in 10 lists of roughly n/10 elements.
During each insertion, you can hash to the correct bucket in O(1) time. Now - a crucial factor here is whether you want your hash table implementation to protect you against duplicate insertions.
If you simply trust there will be no duplicates, or the hash table is allowed to have duplicates (e.g. C++'s unordered_multiset), then the insertion itself can be done without inspecting the existing bucket content, at an accessible end of the bucket's list (i.e. using a head or tail pointer), also in O(1) time. That means the overall time per insertion is O(1), and the total time for n insertions is O(n).
If the implementation much identify and avoid duplicates, then for each insertion it has to search along the existing linked list, the size of which is related to n by a constant #buckets factor (1/10) and varies linearly during insertion from 1 to 1/10 of the final number of elements, so on average is n/2/10 which - removing constant factors - simplifies to n. In other words, each insertion is O(n).
Presumably the question intends to ask the time for a single lookup done after all elements are inserted: in that case you have the 10 linked lists of ~n/10 length, so the lookup will hash to one of those lists and then on average have to look half way along the list before finding the desired value: that's roughly n/20 elements searched, but as /20 is a constant factor it can be dropped, and we can say the average complexity is O(n).
A hash dictionary that does not resize, what is the time complexity for n insertions and 1 lookup?
Well, we discussed that above with our hash table size stuck at 10.
A hash dictionary that resizes by 1 when the load factor exceeds .8, what is the time complexity for n insertions and 1 lookup?
Say the table has 100 buckets and 80 elements, you insert an 81st element, it resizes to 101, the load factor is then about .802 - should it immediately resize again, or wait until doing another insertion? Anyway, ignoring that -each resize operation involves visiting, rehashing (unless the elements or nodes cache the hash values), and "rewiring" the linked lists for all existing elements: that's O(s) where s is the size of the table at that point in time. And you're doing that once or twice (depending on your answer to "immediately resize again" behaviour above) for s values from 1 to n, so s averages n/2, which simplifies to n. The insertion itself may or may not involve another iteration of the bucket's linked list (you could optimise to search while resizing). Regardless the overall time complexity is O(n2).
The lookup then takes O(1), because the resizing has kept the load factor below a constant amount (i.e. the average linked list length is very, very short (even ignoring the empty buckets).
A hash dictionary that resizes by doubling the table size when the load factor exceeds .8, what is the time complexity for n insertions and 1 lookup?
If you consider the resultant hash table there with n elements inserted, about half the elements will have been inserted without needing to be rehashed, while for about a quarter, they'll have been rehashed once, and an eight rehashed twice, a sixteenth rehashed 3 times, a 32nd rehashed 4 times: if you sum up that series - 1/4 + 2/8 + 3/16 + 4/32 + 5/64 + 6/128... - the series approaches 1 as n goes to infinity. In other words, the average amount of repeated rehashing/linking work done per element in the final table size doesn't increase with n - it's constant. So, the total time to insert is simply O(n). Then because the load factor is kept below 0.8 - a constant rather than a function of n - the lookup time is O(1).

BigQuery. Long execution time on small datasets

I created a new Google cloud project and set up BigQuery data base. I tried different queries, they all are executing too long. Currently we don't have a lot of data, so high performance was expected.
Below are some examples of queries and their execution time.
Query #1 (Job Id bquxjob_11022e81_172cd2d59ba):
select date(installtime) regtime
,count(distinct userclientid) users
,sum(fm.advcost) advspent
from DWH.DimUser du
join DWH.FactMarketingSpent fm on fm.date = date(du.installtime)
group by 1
The query failed in 1 hour + with error "Query exceeded resource limits. 14521.457814668494 CPU seconds were used, and this query must use less than 12800.0 CPU seconds."
Query execution plan: https://prnt.sc/t30bkz
Query #2 (Job Id bquxjob_41f963ae_172cd41083f):
select fd.date
,sum(fd.revenue) adrevenue
,sum(fm.advcost) advspent
from DWH.FactAdRevenue fd
join DWH.FactMarketingSpent fm on fm.date = fd.date
group by 1
Execution time ook 59.3 sec, 7.7 MB processed. What is too slow.
Query Execution plan: https://prnt.sc/t309t4
Query #3 (Job Id bquxjob_3b19482d_172cd31f629)
select date(installtime) regtime
,count(distinct userclientid) users
from DWH.DimUser du
group by 1
Execution time 5.0 sec elapsed, 42.3 MB processed. What is not terrible but must be faster for such small volumes of data.
Tables used :
DimUser - Table size 870.71 MB, Number of rows 2,771,379
FactAdRevenue - Table size 6.98 MB, Number of rows 53,816
FaceMarketingSpent - Table size 68.57 MB, Number of rows 453,600
The question is what am I doing wrong so that query execution time is so big? If everything is ok, I would be glad to hear any advice on how to reduce execution time for such simple queries. If anyone from google reads my question, I would appreciate if jobids are checked.
Thank you!
P.s. Previously I had experience using BigQuery for other projects and the performance and execution time were incredibly good for tables of 50+ TB size.
Posting same reply i've given in the gcp slack workspace:
Both your first two queries looks like you have one particular worker who is overloaded. Can see this because in the compute section, the max time is very different from the avg time. This could be for a number of reasons, but i can see that you are joining a table of 700k+ rows (looking at the 2nd input) to a table of ~50k (looking at the first input). This is not good practice, you should switch it so the larger table is the left most table. see https://cloud.google.com/bigquery/docs/best-practices-performance-compute?hl=en_US#optimize_your_join_patterns
You may also have a heavily skew in your join keys (e.g. 90% of rows are on 1/1/2020, or NULL). check this.
For the third query, that time is expected, try a approx count instead to speed it up. Also note BQ starts to get better if you perform the same query over and over, so this will get quicker.

Algorithm to decide if true based on 0% - 100% frequency threshold

Sorry if this is a duplicate question. I did a search but wasn't sure exactly what to search for.
I'm writing an app that performs a scan. When the scan is complete we need to decide if an item was found or not. Whether or not the item is found is decided by a threshold that the user can set: 0% of the time, 25% of the time, 50% of the time, 75% of the time or 100% of the time.
Obviously if the user chooses 0% or 100% we can use true/false but for the frequency but I'm drawing a blank on how this should work for the other thresholds.
I assume I'd need to store and increase some value every time a monster is found.
Thanks for any help in advance!
As #nix points out it sounds like you want to generate a random number and threshold based on the percentage of the time you wish to have 'found' something.
You need to be careful that the range you select and how you threshold achieve the desired result as well as the distribution of the random number generator you use. When dealing in percentages an obvious approach is to generate 1 of 100 uniformly distributed options and threshold appropriately e.g. 0-99 and check that the number is less than your percentage.
A quick check shows us that you will never get a number less than 0 so 0% achieves the expected result, you will always get a number less than 100 so 100% achieves the expected result and there a 50 options (0-49) less than 50 out of 100 options (0-99) so 50% achieves the expected result as well.
A subtly different approach, given that the user can only choose ranges in 25% increments, would be to generate numbers in the range 0-3 and return True if the number is less than the percentage / 25. If you were to store the user-selection as a number from 0-4 (0: 0%, 1: 25% .. 4: 100%) this might be even simpler.
Various approaches to pseudo-random number generation in Objective-C are discussed here: Generating random numbers in Objective-C.
Note that mention is made of the uniformity of the random numbers potentially being sensitive to the range depending on the method you go with.
To be confident you can always do some straight-forward testing by calling your function a large number of times, keeping track of the number of times it returns true and comparing this to the desired percentage.
Generate a random number between 0 and 100. If the number is greater than the threshold, an item is found. Otherwise, no item is found.

Trending 100 million+ rows

I have a system which records some measured values every second. What is the best way to store trend data which are values corresponding to a specific second?
1 day = 86.400 seconds
1 month = 2.592.000 seconds
Around 1000 values to keep track of every seconds.
Currently there are 50 tables grouping the trend data for 20 columns each. These tables contain more than 100 million rows.
TREND_TIME datetime (clustered_index)
TREND_DATA1 real
TREND_DATA2 real
...
TREND_DATA20 real
Have you considered RRDTool - it provides a round robin database, or circular buffer, for time series data. You can store data at whatever interval you like, then define consolidation points and a consolidation function, for example (sum, min, max, avg) for a given period, 1 second, 5 seconds, 2 days, etc. Because it knows what consolidation points you want, it doesn't need to store all the data points once they've been agregated.
Ganglia and Cacti use this under the covers and it's quite easy to use from many languages.
If you do need all the datapoints, consider using it just for the aggregation.
I would change the data saving approach and instead of saving 'raw' data as values I would save 5-20 minutes of data in an array (Memory, BL side), compress that array using LZ based algorithm and then store the data in the database as binary data. Also, it would be nice to save Max/Min/Avg/etc.. info for that binary chunk.
When you want to process the data you can process the data chunk after chunk and by that you keep a low memory profile for your application. this approach is a little more complex but very scalable in terms of memory/processing.
hope this helps.
Is the problem the database schema?
1 second to many trends obviously first shows you a separate table with a seconds-table foreign key. Alternatively, if the "many trend values" is represented by the columns and not rows you can always append the columns to the seconds table and incur null values.
Have you tried that? Was performance poor?