Best way to parallelize multi-table function in Python (using Pandas) - pandas

I have this function below that iterates through every row of a data frame (using pandas apply) and determines what values are valid from a prediction-probability matrix (L2) by referencing another data frame (GST) to obtain the valid values for a given row. The function just returns the row back with the maximum valid probability assigned to the previously blank value for that row (Predicted Level 2) in the data frame passed to the function (test_x2)
Not a terribly complex function and it works fine on smaller datasets but when I scale to like 3-5 million records, it starts to take way too long. I tried using the multiprocessing module as well as dask/numba but nothing was able to improve the runtime (not sure if this is just due to the fact the function is not vectorizable).
My question is two fold:
1) Is there a better way to write this? (I'm guessing there is)
2) If not, what parallel computing strategies could work with this type of function? I've already tried a number of different python options but I'm just leaning more towards running the larger datasets on totally separate machines at this point. Feel free to provide any suggested code to parallelize something like this. Thanks in advance for any guidance provided.
l2 = MNB.predict_proba(test_x)
l2_classes = MNB.classes_
L2 = pd.DataFrame(l2, columns = MNB.classes_)
test_x2["Predicted Level 2"] = ""
def predict_2(row):
s = row["Predicted Level 1"]
s = GST.loc[s,:]
s.reset_index(inplace = True)
Valid_Level2s = s["GST Level 2"].tolist()
p2 = L2.ix[row.name, Valid_Level2s]
max2 = p2.idxmax(axis = 1)
output = row["Predicted Level 2"] = max2
return row
test_x2 = test_x2.apply(predict_2, axis = 1)

Related

Building relationships in Neo4j via Py2neo is very slow

We have 5 different types of nodes in database. Largest one has ~290k, the smallest is only ~3k. Each node type has an id field and they are all indexed. I am using py2neo to build relationship, but it is very slow (~ 2 relationships inserted per second)
I used pandas read from a relationship csv, iterate each row to create a relationship wrapped in transaction. I tried batch out 10k creation statements in one transaction, but it does not seem to improve the speed a lot.
Below is the code:
df = pd.read_csv(r"C:\relationship.csv",dtype = datatype, skipinitialspace=True, usecols=fields)
df.fillna('',inplace=True)
def f(node_1 ,rel_type, node_2):
try:
tx = graph.begin()
tx.evaluate('MATCH (a {node_id:$label1}),(b {node_id:$label2}) MERGE (a)-[r:'+rel_type+']->(b)',
parameters = {'label1': node_1, 'label2': node_2})
tx.commit()
except Exception as e:
print(str(e))
for index, row in df.iterrows():
if(index%1000000 == 0):
print(index)
try:
f(row["node_1"],row["rel_type"],row["node_2"])
except:
print("error index: " + index)
Can someone help me what I did wrong here. Thanks!
You state that there are "5 different types of nodes" (which I interpret to mean 5 node labels, in neo4j terminology). And, furthermore, you state that their id properties are already indexed.
But your f() function is not generating a Cypher query that uses the labels at all, and neither does it use the id property. In order to take advantage of your indexes, your Cypher query has to specify the node label and the id value.
Since there is currently no efficient way to parameterize the label when performing a MATCH, the following version of the f() function generates a Cypher query that has hardcoded labels (as well as a hardcoded relationship type):
def f(label_1, id_1, rel_type, label_2, id_2):
try:
tx = graph.begin()
tx.evaluate(
'MATCH' +
'(a:' + label_1 + '{id:$id1}),' +
'(b:' + label_2 + '{id:$id2}) ' +
'MERGE (a)-[r:'+rel_type+']->(b)',
parameters = {'id1': id_1, 'id2': id_2})
tx.commit()
except Exception as e:
print(str(e))
The code that calls f() will also have to be changed to pass in both the label names and the id values for a and b. Hopefully, your df rows will contain that data (or enough info for you to derive that data).
If your aim is for better performance then you will need to consider a different pattern for loading these, i.e. batching. You're currently running one Cypher MERGE statement for each relationship and wrapping that in its own transaction in a separate function call.
Batching these by looking at multiple statements per transaction or per function call will reduce the number of network hops and should improve performance.

VTK cutter output

I am seeking a solution of connecting all the lines that have the same slope and share a common point. For example, after I load a STL file and cut it using a plane, the cutter output includes the points defining the contour. Connecting them one by one forms a (or multiple) polyline. However, some lines can be merged when their slopes are the same and they share a common point. E.g., [[0,0,0],[0,0,1]] and [[0,0,1],[0,0,2]] can be represented by one single line [[0,0,0],[0,0,2]].
I wrote a function that can analyse all the lines and connect them if they can be merged. But when the number of lines are huge, this process is slow. I am thinking in the VTK pipeline, is there a way to do the line merging?
Cheers!
plane = vtk.vtkPlane()
plane.SetOrigin([0,0,5])
plane.SetNormal([0,0,1])
cutter = vtk.vtkCutter()
cutter.SetCutFunction(plane)
cutter.SetInput(triangleFilter.GetOutput())
cutter.Update()
cutStrips = vtk.vtkStripper()
cutStrips.SetInputConnection(cutter.GetOutputPort())
cutStrips.Update()
cleanDataFilter = vtk.vtkCleanPolyData()
cleanDataFilter.AddInput(cutStrips.GetOutput())
cleanDataFilter.Update()
cleanData = cleanDataFilter.GetOutput()
print cleanData.GetPoint(0)
print cleanData.GetPoint(1)
print cleanData.GetPoint(2)
print cleanData.GetPoint(3)
print cleanData.GetPoint(4)
The output is:
(0.0, 0.0, 5.0)
(5.0, 0.0, 5.0)
(10.0, 0.0, 5.0)
(10.0, 5.0, 5.0)
(10.0, 10.0, 5.0)
Connect the above points one by one will form a polyline representing the cut result. As we can see, the line [point0, point1] and [point1, point2] can be merged.
Below is the code for merging the lines:
Assume that the LINES are represented by list: [[(p0),(p1)],[(p1),(p2)],[(p2),(p3)],...]
appended = 0
CurrentLine = LINES[0]
CurrentConnectedLine = CurrentLine
tempLineCollection = LINES[1:len(LINES)]
while True:
for HL in tempLineCollection:
QCoreApplication.processEvents()
if checkParallelAndConnect(CurrentConnectedLine, HL):
appended = 1
LINES.remove(HL)
CurrentConnectedLine = ConnectLines(CurrentConnectedLine, HL)
processedPool.append(CurrentConnectedLine)
if len(tempLineCollection) == 1:
processedPool.append(tempLineCollection[0])
LINES.remove(CurrentLine)
if len(LINES) >= 2:
CurrentLine = LINES[0]
CurrentConnectedLine = CurrentLine
tempLineCollection = LINES[1:len(LINES)]
appended = 0
else:
break
Solution:
I figured out a way of further accelerating this process using some vtk data structure. I found out that a polyline line will be stored in a cell, which can be checked by using GetCellType(). Since the point order for a polyline is sorted already, We do not need to search globally which lines are colinear with the current one. For each point on the polyline, I just need to check the point[i-1], point[i], point[i+1]. And if they are colinear, the end of the line will be updated to the next point. This process continues until the end of the polyline is reached. The speed increases by a huge amount compared with the global search approach.
Not sure if it is the main source of slowness (depends on how many positive hits on the colinearity you have), but removing items from a vector is costly (O(n)), since it requires reorganizing the rest of the vector, you should avoid it. But even without hits on colinearity, the LINES.remove(CurrentLine) call is surely slowing things down and there isn't really any need for it - just leave the vector untouched, write the final results to a new vector (processedPool) and get rid of the LINES vector in the end. You can modify your algorithm by making a bool array (vector), initialized at "false" for each item, then when you remove a line, you don't actually remove it, but only mark it as "true" and you skip all lines for which you have "true", i.e. something like this (I don't speak python so the syntax is not accurate):
wasRemoved = bool vector of the size of LINES initialized at false for each entry
for CurrentLineIndex = 0; CurrentLineIndex < sizeof(LINES); CurrentLineIndex++
if (wasRemoved[CurrentLineIndex])
continue // skip a segment that was already removed
CurrentConnectedLine = LINES[CurrentLineIndex]
for HLIndex = CurrentLineIndex + 1; HLIndex < sizeof(LINES); HLIndex++:
if (wasRemoved[HLIndex])
continue;
HL = LINES[HLIndex]
QCoreApplication.processEvents()
if checkParallelAndConnect(CurrentConnectedLine, HL):
wasRemoved[HLIndex] = true
CurrentConnectedLine = ConnectLines(CurrentConnectedLine, HL)
processedPool.append(CurrentConnectedLine)
wasRemoved[CurrentLineIndex] = true // this is technically not needed since you won't go back in the vector anyway
LINES = processedPool
BTW, the really correct data structure for LINES to use for that kind of algorithm would be a linked list, since then you would have O(1) complexity for removal and you wouldn't need the boolean array. But a quick googling showed that that's not how lists are implemented in Python, also don't know if it would not interfere with other parts of your program. Alternatively, using a set might make it faster (though I would expect times similar to my "bool array" solution), see python 2.7 set and list remove time complexity
If this does not do the trick, I suggest you measure times of individual parts of the program to find the bottleneck.

Accessing intermediate results from a tensorflow graph

If I have a complex calculation of the form
tmp1 = tf.fun1(placeholder1,placeholder2)
tmp2 = tf.fun2(tmp1,placeholder3)
tmp3 = tf.fun3(tmp2)
ret = tf.fun4(tmp3)
and I calculate
ret_vals = sess.run(ret,feed_dict={placeholder1: vals1, placeholder2: vals2, placeholder3: vals3})
fun1, fun2 etc are possibly costly operations on a lot of data.
If I run to get ret_vals as above, is it possible to later or at the same time access the intermediate values as well without re-running everything up to that value? For example, to get tmp2, I could re-run everything using
tmp2_vals = sess.run(tmp2,feed_dict={placeholder1: vals1, placeholder2: vals2, placeholder3: vals3})
But this looks like a complete waste of computation? Is there a way to access several of the intermediate results in a graph after performing one run?
The reason why I want to do this is for debugging or testing or logging of progress when ret_vals gets calculated e.g. in an optimization loop. Every step where I run the ret_vals calculations is costly but I want to see some of the intermediate results that were calculated.
If I do something like
tmp2_vals, ret_vals = sess.run([tmp2, ret], ...)
does this guarantee that the graph will only get run once (instead of one time for tmp2 and one time for ret) like I want it?
Have you looked at tf.Print? This is an identity op with printing funciton. You can insert it in your graph right after tmp2 to get the value of it. Note that the default setting only allows you to print the first n values of the tensor, you can modify the value n by giving attribute first_n to the op.

error in LDA in r: each row of the input matrix needs to contain at least one non-zero entry

I am a starter in text mining topic. When I run LDA() over a huge dataset with 996165 observations, it displays the following error:
Error in LDA(dtm, k, method = "Gibbs", control = list(nstart = nstart, :
Each row of the input matrix needs to contain at least one non-zero entry.
I am pretty sure that there is no missing values in my corpus and also. The table of "DocumentTermMatrix" and "simple_triplet_matrix" is:
table(is.na(dtm[[1]]))
#FALSE
#57100956
table(is.na(dtm[[2]]))
#FALSE
#57100956
A little confused how "57100956" comes. But as my dataset is pretty large, I don't know how to check why does this error occurs. My LDA command is:
ldaOut<-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
Can anyone provide some insights? Thanks.
In my opinion the problem is not the presence of missing values, but the presence of all 0 rows.
To check it:
raw.sum=apply(table,1,FUN=sum) #sum by raw each raw of the table
Then you can delete all raws which are all 0 doing:
table=table[raw.sum!=0,]
Now table should has all "non 0" raws.
I had the same problem. The design matrix, dtm, in your case, had rows with all zeroes because dome documents did not contain certain words (i.e. their frequency was zero). I suppose this somehow causes a singular matrix problem somewhere along the line. I fixed this by adding a common word to each of the documents so that every row would have at least one non-zero entry. At the very least, the LDA ran successfully and classified each of the documents. Hope this helps!

h5py selective read in

I have a problem regarding a selective read-in routine while using h5py.
f = h5py.File('file.hdf5','r')
data = f['Data']
I have several positive values in the 'Data'- dataset and also some placeholders with -9999.
How I can get only all positive values for calculations like np.min?
np.ma.masked_array creates a full copy of the array and all the benefits from using h5py are lost ... (regarding memory usage). The problem is, that I get errors if I try to read data sets that exceed 100 millions of values per data set using data = f['Data'][:,0]
Or if this is not possible is something like that possible?
np.place(data[...], data[...] <= -9999, float('nan'))
Thanks in advance
You could use:
mask = f['Data'] >= 0
data = f['Data'][mask]
although I am not sure how much memory the mask calculation itself uses.