Speed-up search in pandas python - dataframe

I am trying to perform an analysis on some data, however, the speed shall be quite faster!
these are the steps that I follow. Please recommend any solutions that you think might speed up the processing time.
ts is a datetime object and the "time" column in Data is in epoch time. Note that Data might include up to 500000 records.
Data = pd.DataFrame(RawData) # (RawData is a list of lists)
Data.loc[:, 'time'] = pd.to_datetime(Data.loc[:, 'time'], unit='s')
I find the index of the first row in Data which has a time object greater than my ts as follows:
StartIndex = Data.loc[:, 'time'].searchsorted(ts)
StartIndex is usually very low and is found within a few records from the beginning, however, I have no idea if the size of Data would affect fining this index.
Now we get to the hard part: within Data there is column called "PDNumber". I have two other variables called Max_1 and Min_1. I have to find the index of the the row in which the "PDNumber" value goes above Max_1 or comes below Min_1. Note that this search shall start from StartIndex through the end of dataframe. whichever happens first, the search shall stop and the found Index is called SecondStartIndex. Now we have another two variables called Max_2 and Min_2. Again, we have to search the "PDNumber" column to find the index of the first row that goes above 'Max_2' or comes below Min_2; this index is called ThirdIndex
right now, I use a for loop to go through data adding the index by 1 in each step and see if I have reached the SecondIndex and when reached, I use a while loop till the end of dataframe to find the ThirdIndex. I use a counter in while loop as well.
Any suggestions on speeding up the process time?

Related

Multi-process Pandas Groupby Failing to do Anything After Processes Kick Off

I have a fairly large pandas dataframe that is about 600,000 rows by 50 columns. I would like to perform a groupby.agg(custom_function) to get the resulting data. The custom_function is a function that takes the first value of non-null data in the series or returns null if all values in the series are null. (My dataframe is hierarchically sorted by data quality, the first occurrence of a unique key has the most accurate data, but in the event of null data in the first occurrence I want to take values in the second occurence... and so on.)
I have found the basic groupby.agg(custom_function) syntax is slow, so I have implemented multi-processing to speed up the computation. When this code is applied over a dataframe that is ~10,000 rows long the computation takes a few seconds, however, when I try to use the entirety of the data, the process seems to stall out. Multiple processes kick off, but memory and cpu usage stay about the same and nothing gets done.
Here is the trouble portion of the code:
# Create list of individual dataframes to feed map/multiprocess function
grouped = combined.groupby(['ID'])
grouped_list = [group for name, group in grouped]
length = len(grouped)
# Multi-process execute single pivot function
print('\nMulti-Process Pivot:')
with concurrent.futures.ProcessPoolExecutor() as executor:
with tqdm.tqdm(total=length) as progress:
futures = []
for df in grouped_list:
future = executor.submit(custom_function, df)
future.add_done_callback(lambda p: progress.update())
futures.append(future)
results = []
for future in futures:
result = future.result()
results.append(result)
I think the issue has something to do with the multi-processing (maybe queuing up a job this large is the issue?). I don't understand why a fairly small job creates no issues for this code, but increasing the size of the input data seems to hang it up rather than just execute more slowly. If there is a more efficient way to take the first value in each column per unique ID, I'd be interested to hear it.
Thanks for your help.

Limiting the number of rows returned by `.where(...)` in pytables

I am dealing with tables having having up to a few billion rows and I do a lot of "where(numexpr_condition)" lookups using pytables.
We managed to optimise the HDF5 format so a simple where-query over 600mio rows is done under 20s (we still struggling to find out how to make this faster, but that's another story).
However, since it is still too slow for playing around, I need a way to limit the number of results in a query like this simple example one (the foo column is of course indexed):
[row['bar'] for row in table.where('(foo == 234)')]
So this would return lets say 100mio entries and it takes 18s, which is way to slow for prototyping and playing around.
How would you limit the result to lets say 10000?
The database like equivalent query would be roughly:
SELECT bar FROM row WHERE foo==234 LIMIT 10000
Using the stop= attribute is not the way, since it simply takes the first n rows and applies the condition to them. So in worst case if the condition is not fulfilled, I get an empty array:
[row['bar'] for row in table.where('(foo == 234)', stop=10000)]
Using slice on the list comprehension is also not the right way, since it will first create the whole array and then apply the slice, which of course is no speed gain at all:
[row['bar'] for row in table.where('(foo == 234)')][:10000]
However, the iterator must know its own size while the list comprehension exhaustion so there is surely a way to hack this together. I just could not find a suitable way doing that.
Btw. I also tried using zip and range to force a StopIteration:
[row['bar'] for for _, row in zip(range(10000), table.where('(foo == 234)'))]
But this gave me repeated numbers of the same row.
Since it’s an iterable and appears to produce rows on demand, you should be able to speed it up with itertools.islice.
rows = list(itertools.islice(table.where('(foo == 234)'), 10000))

redis scan returns empty results but nonzero cursor

I have a redis database with a few million keys. Sometimes I need to query keys by the pattern e.g. 2016-04-28:* for which I use scan. First call should be
scan 0 match 2016-04-28:*
it then would return a bunch of keys and next cursor or 0 if the search is complete.
However, if I run a query and there are no matching keys, scan still returns non-zero cursor but an empty set of keys. This keeps happening to every successive query, so the search does not seem to end for a really long time.
Redis docs say that
SCAN family functions do not guarantee that the number of elements returned per call are in a given range. The commands are also allowed to return zero elements, and the client should not consider the iteration complete as long as the returned cursor is not zero.
So I can't just stop when I get empty set of keys.
Is there a way I can speed things up?
You'll always need to complete the scan (i.e. get cursor == 0) to be sure there no no matched. You can, however, use the COUNT option to reduce the number of iterations. The default value of 10 is fast If this is a common scenario with your match pattern - start increasing it (e.g. double or powers of two but put a max cap just in case) with every empty reply, to make Redis "search harder" for keys. By doing so, you'll be saving on network round trips so it should "speed things up".

How is insertion for a Singly Linked List and Doubly Linked List constant time?

Thinking about it, I thought the time complexity for insertion and search for any data structure should be the same, because to insert, you first have to search for the location you want to insert, and then you have to insert.
According to here: http://bigocheatsheet.com/, for a linked list, search is linear time but insertion is constant time. I understand how searching is linear (start from the front, then keep going through the nodes on the linked list one after another until you find what you are searching for), but how is insertion constant time?
Suppose I have this linked list:
1 -> 5 -> 8 -> 10 -> 8
and I want to insert the number 2 after the number 8, then would I have to first search for the number 8 (search is linear time), and then take an extra 2 steps to insert it (so, insertion is still linear time?)?
#insert y after x in python
def insert_after(x, y):
search_for(y)
y.next = x.next
x.next = y
Edit: Even for a doubly linked list, shouldn't it still have to search for the node first (which is linear time), and then insert?
So if you already have a reference to the node you are trying to insert then it is O(1). Otherwise, it is search_time + O(1). It is a bit misleading but on wikipedia there is a chart explains it a bit better:
Contrast this to a dynamic array, which, if you want to insert at the beginning is: Θ(n).
Just for emphasis: The website you reference is referring to the actual act of inserting given we already know where we want to insert.
Time to insert = Time to set three pointers = O(3) = constant time.
Time to insert the data is not the same as time to insert the data at a particular location. The time asked is the time to insert the data only.

How to check if a value is in an array without the use of a For loop?

Is this possible and/or recommended? Currently, the issue I'm having is that the processing time of this code I have checks for a value in an array of ~40 for a value, once it finds it we set a boolean. This same for loop is called up to 20 times so I was wondering if there was a way I could optimize this code in a better way to where I don't need to have several for loops checking for a single answer.
Here's an example of the code
For i = 0 to iCount 'iCount up to 40
If name = UCase((m_Array(i, 1))) Then
<logic>
End If
Next
Above is an example of what I"m looking at, this little chunk of code checks the array which is prepopulated prior to running this function and is usually around 30-40 items in the array. With this being called up to 20 times I feel I could reduce the amount of time it takes to run this if I could find another way to do it without the use of so many for loops.
LINQ provides a Contains extension method, which returns a Boolean, but that won't work for multidimensional arrays. Even if it did work, if performance is the concern, then Contains wouldn't help much since, internally, all the Contains method does is loop through the items until it finds the matching item.
One way to make it faster, is to use an Exit For statement to exit the loop once the first matching item is found. At least then it won't continue searching through the rest of the items after it finds the one for which it was looking:
For i = 0 to iCount 'iCount up to 40
If name = UCase((m_Array(i, 1))) Then
' logic...
Exit For
End If
Next
If you don't want it to have to search through the array at all, you would need to index your data. The simplest way to index your data is with a hash table. The Dictionary class is an easy to use implementation of a hash table. However, in the end, a hash table (just like any other indexing method) will only help performance if the situation is right. In this situation, where the array only contains 40 or so items, it's quite possible that a hash table will be slower. The only way to know for sure is to test it both ways and see if it makes any difference.
You're currently searching the list repeatedly and doing something if a member has certain properties. Instead you could check the properties of an item once, when you add an item to the list, and perform the logic then. Instead of repeating the tests it's then only done once. No searching at all is better than even the fastest search.