How to use multiprocess in openpyxl object? - openpyxl

I need to deal with a excel with many sheet in it, but every sheet have large data.If use openpyxl to load this excel, that will spend a lot of time,so I want to analysis each sheet by multiprocess.
the brief code like this:
import multiprocessing as mp
import openpyxl
def LoadEx():
wb=openpyxl.load_workbook('example.xlsx')
sheetnames=wb.get_sheet_names()
return sheetnames, wb
def job(sheet,wb):
gs=wb.get_sheet_by_name(sheet)
for i in range(10):
if gs.cell(row=i,column=2).value=='Target':
gs.cell(row=i,column=3).value='OK'
if __name__=='__main__':
sheetnames,wb=LoadEx()
pool=mp.Pool()
for sheetname in sheetnames:
res=pool.apply_async(job, (sheetname,wb))
pool.close()
pool.join()
wb.save('example_output.xlsx')
but,the file 'example_output.xlsx' seem not to save the job()'s result,
How should i do to obtain multiprocess's effect in this case?
may someone could help me, thinks

You can do it using multiprocessing, but you have to pay for it.
The global wb become a copy for every process you use.
Therefore using 4 processes, your memory have to be large enough to hold 4 copies of your Workbook.
Given that wb is a copy, your changes belong to this copy.
You have to copy your changes into one Workbook.
Copying Worksheets could be time consuming.
To overcome the Pickling Error, I changed from doing queueing ws to wsDiff.
Instead of writing to the ws copy, aggreate changes to wsDiff.
As a bonus, copy to target wb will be faster.
Time table: cpu_count=2, 10 Worksheets, workload: def ws_job(...
Job Processes without mp 2 4
Time 0:00:21.260746 0:00:10.214942 0:00:07.097369
This working example will fit for the given Question def job.
, for instance:
import multiprocessing as mp
import queue, os, time
import random as rd
import openpyxl
class wsDiff(object):
def __init__(self, row, column, value):
self.row = row
self.column = column
self.value = value
def ws_job(wb, ws_idx):
ws = wb.worksheets[ws_idx]
print('pid %s: process (%s)' % (os.getpid(), ws.title))
# *** DO SOME STUFF HERE***
# Simulate workload
time.sleep(rd.randrange(1, 4))
diff = []
for i in range(1, 11):
if ws.cell(row=i, column=2).value == 'Target':
#ws.cell(row=i, column=3).value = 'OK'
diff.append( wsDiff(i, 3, 'OK') )
return diff
def job(fq, q, wb):
while True:
try:
ws_idx = fq.get_nowait()
except queue.Empty:
print('pid %s: exit job' % os.getpid())
exit(0)
q.put((ws_job(wb, ws_idx), ws_idx))
time.sleep(0.1)
def writer(q, wb):
print('start writer_handler')
while True:
try:
diff, i_ws = q.get()
except ValueError:
print('writer ValueError exit(1)')
exit(1)
if diff == None:
wb.save('../test/example_output.xlsx')
exit(0)
ws = wb.worksheets[i_ws]
print('pid %s: update sheet %s from diff' % (os.getpid(), ws.title))
for d in diff:
ws.cell(row=d.row, column=d.column).value = d.value
def mpRun():
wb = openpyxl.load_workbook('../test/example.xlsx')
f_q = mp.Queue()
for i in range(len(wb.worksheets)):
f_q.put(i)
w_q = mp.Queue()
w_p = mp.Process(target=writer, args=(w_q, wb))
w_p.start()
time.sleep(0.1)
pool = [mp.Process(target=job, args=(f_q, w_q, wb)) for p in range(os.cpu_count() + 2)]
for p in pool:
p.start()
time.sleep(0.1)
for p in pool:
p.join()
time.sleep(0.2)
# Terminate Process w_p after all Sheets done
w_q.put((None, None))
w_p.join()
print('EXIT __main__')
Tested with Python:3.4.2 - openpyxl:2.4.1 - LibreOffice:4.3.3.2

Unfortunately, Workbooks are not suited for multiprocessing as there is a lot of shared state.

Related

Correct way of passing dataframe to ray

I am trying to do the simplest thing with Ray, but no matter what I do it just never releases memory and fails.
The usage case is simply
read parquet files to DF -> pass to pool of actors -> make changes to DF -> return DF
class Main_func:
def calculate(self,data):
#do some things with the DF
return df.copy(deep=True) <- one of many attempts to fix the problem, but didnt work
cpus = 24
actors = []
for _ in range(cpus):
actors.append(Main_func.remote())
from ray.util import ActorPool
pool = ActorPool(actors)
import os
arr = os.listdir("/some/files")
def to_ray():
try:
filename = arr.pop(0)
pf = ParquetFile("/some/files/" + filename)
df = pf.to_pandas()
pool.submit(lambda a,v:a.calculate.remote(v),df.copy(deep=True)
except Exception as e:
print(e)
for _ in range(cpus):
to_ray()
while(True):
res = pool.get_next_unordered()
write('./temp/' + random_filename, res,compression='GZIP')
del res
to_ray()
I have tried other ways of doing the same thing, manually submitting rather than the map command, but whatever i do it always locks memory and fails after a few 100 dataframes.
Does each task needs to preserve state among different files? Ray has tasks abstraction that should simplify things:
import ray
ray.init()
#ray.remote
def read_and_write(path):
df = pd.read_parquet(path)
... do things
df.to_parquet("./temp/...")
import os
arr = os.listdir("/some/files")
results = ray.get([read_and_write.remote(path) for path in arr])

Python multiprocessing how to update a complex object in a manager list without using .join() method

I started programming in Python about 2 months ago and I've been struggling with this problem in the last 2 weeks.
I know there are many similar threads to this one but I can't really find a solution which suits my case.
I need to have the main process which is the one which interacts with Telegram and another process, buffer, which understands the complex object received from the main and updates it.
I'd like to do this in a simpler and smoother way.
At the moment objects are not being updated due to the use of multi-processing without the join() method.
I tried then to use multi-threading instead but it gives me compatibility problems with Pyrogram a framework which i am using to interact with Telegram.
I wrote again the "complexity" of my project in order to reproduce the same error I am getting and in order to get and give the best help possible from and for everyone.
a.py
class A():
def __init__(self, length = -1, height = -1):
self.length = length
self.height = height
b.py
from a import A
class B(A):
def __init__(self, length = -1, height = -1, width = -1):
super().__init__(length = -1, height = -1)
self.length = length
self.height = height
self.width = width
def setHeight(self, value):
self.height = value
c.py
class C():
def __init__(self, a, x = 0, y = 0):
self.a = a
self.x = x
self.y = y
def func1(self):
if self.x < 7:
self.x = 7
d.py
from c import C
class D(C):
def __init__(self, a, x = 0, y = 0, z = 0):
super().__init__(a, x = 0, y = 0)
self.a = a
self.x = x
self.y = y
self.z = z
def func2(self):
self.func1()
main.py
from b import B
from d import D
from multiprocessing import Process, Manager
from buffer import buffer
if __name__ == "__main__":
manager = Manager()
lizt = manager.list()
buffer = Process(target = buffer, args = (lizt, )) #passing the list as a parameter
buffer.start()
#can't invoke buffer.join() here because I need the below code to keep running while the buffer process takes a few minutes to end an instance passed in the list
#hence I can't wait the join() function to update the objects inside the buffer but i need objects updated in order to pop them out from the list
import datetime as dt
t = dt.datetime.now()
#library of kind of multithreading (pool of 4 processes), uses asyncio lib
#this while was put to reproduce the same error I am getting
while True:
if t + dt.timedelta(seconds = 10) < dt.datetime.now():
lizt.append(D(B(5, 5, 5)))
t = dt.datetime.now()
"""
#This is the code which looks like the one in my project
#main.py
from pyrogram import Client #library of kind of multithreading (pool of 4 processes), uses asyncio lib
from b import B
from d import D
from multiprocessing import Process, Manager
from buffer import buffer
if __name__ == "__main__":
api_id = 1234567
api_hash = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
app = Client("my_account", api_id, api_hash)
manager = Manager()
lizt = manager.list()
buffer = Process(target = buffer, args = (lizt, )) #passing the list as a parameter
buffer.start()
#can't invoke buffer.join() here because I need the below code to run at the same time as the buffer process
#hence I can't wait the join() function to update the objects inside the buffer
#app.on_message()
def my_handler(client, message):
lizt.append(complex_object_conatining_message)
"""
buffer.py
def buffer(buffer):
print("buffer was defined")
while True:
if len(buffer) > 0:
print(buffer[0].x) #prints 0
buffer[0].func2() #this changes the class attribute locally in the class instance but not in here
print(buffer[0].x) #prints 0, but I'd like it to be 7
print(buffer[0].a.height) #prints 5
buffer[0].a.setHeight(10) #and this has the same behaviour
print(buffer[0].a.height) #prints 5 but I'd like it to be 10
buffer.pop(0)
This is the whole code about the problem I am having.
Literally every suggestion is welcome, hopefully constructive, thank you in advance!
At last I had to change the way to solve this problem, which was using asyncio like the framework was doing as well.
This solution offers everything I was looking for:
-complex objects update
-avoiding the problems of multiprocessing (in particular with join())
It is also:
-lightweight: before I had 2 python processes 1) about 40K 2) about 75K
This actual process is about 30K (and it's also faster and cleaner)
Here's the solution, I hope it will be useful for someone else like it was for me:
The part of the classes is skipped because this solution updates complex objects absolutely fine
main.py
from pyrogram import Client
import asyncio
import time
def cancel_tasks():
#get all task in current loop
tasks = asyncio.Task.all_tasks()
for t in tasks:
t.cancel()
try:
buffer = []
firstWorker(buffer) #this one is the old buffer.py file and function
#the missing loop and loop method are explained in the next piece of code
except KeyboardInterrupt:
print("")
finally:
print("Closing Loop")
cancel_tasks()
firstWorker.py
import asyncio
def firstWorker(buffer):
print("First Worker Executed")
api_id = 1234567
api_hash = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
app = Client("my_account", api_id, api_hash)
#app.on_message()
async def my_handler(client, message):
print("Message Arrived")
buffer.append(complex_object_conatining_message)
await asyncio.sleep(1)
app.run(secondWorker(buffer)) #here is the trick: I changed the
#method run() of the Client class
#inside the Pyrogram framework
#since it was a loop itself.
#In this way I added another task
#to the existing loop in orther to
#let run both of them together.
my secondWorker.py
import asyncio
async def secondWorker(buffer):
while True:
if len(buffer) > 0:
print(buffer.pop(0))
await asyncio.sleep(1)
The resources to understand the asyncio used in this code can be found here:
Asyncio simple tutorial
Python Asyncio Official Documentation
This tutorial about how to fix classical Asyncio errors

TensorFlow eval inbetween two queues

My goal is as follows:
1). Use tf.train.string_input_producer and tf.TextLineReader to read lines from files.
2). Convert the resulting tensors containing the files' lines into ordinary strings using eval to do preprocessing before batching (TensorFlow's limited string operations are insufficient for my purposes)
3). Convert these preprocessed strings back to tensors (presumably using tf.constant ?)
4). Use tf.train.batch on the resulting tensors.
The following code is a simplified version of what I'm working on.
The "After batch" print statement gets executed, the REPL hangs on the print statement with the final eval.
From what I've read, I have a feeling this is because
threads = tf.train.start_queue_runners(coord = coord, sess = sess)
needs to be run after calling tf.train.batch. But if I do this, then the REPL will of course hang on the first eval
evalue = value.eval(session = sess)
needed to do the preprocessing.
What is the best way to convert back and forth between tensors and their values inbetween queues? (I'm really hoping I can do this without preprocessing my data files beforehand.)
import tensorflow as tf
import os
def process(string):
return string.upper()
def main():
sess = tf.Session()
filenames = tf.constant(["test_data/" + f for f in os.listdir("./test_data")])
filename_queue = tf.train.string_input_producer(filenames)
file_reader = tf.TextLineReader()
key, value = file_reader.read(filename_queue)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord = coord, sess = sess)
evalue = value.eval(session = sess)
proc_value = process(evalue)
tensor_value = tf.constant(proc_value)
batch = tf.train.batch([tensor_value], batch_size = 2, capacity = 2)
print "After batch."
print batch.eval(session = sess)
We discussed a slightly different approach, which I think achieves what you need here:
Converting TensorFlow tutorial to work with my own data
Not sure what file formats you are reading, but the above example reads CSVs row-by-row and packs them into randomized batches.
If you are reading from a CSV, then, in a nutshell, I think what you might want to do is instead of returning value from file_reader.read(filename_queue) immediately, you could try to do some pre-processing first, and return THAT instead, something like this:
rDefaults = [['a'] for row in range((ROW_LENGTH))]
_, value = reader.read(filename_queue)
whole_row = tf.decode_csv(value, record_defaults=rDefaults)
cell1 = tf.slice(whole_row, [0], [1]) # one specific cell that contains a string
cell2 = tf.slice(whole_row, [1], [2]) # another cell that contains a string
# do some processing on cell1 and cell2
return cell1, cell2

Multiprocessing shared numpy array

I need to share numpy array between Processes, to store in it some results. Im not quite sure if what I have done so far is correct. This is my simplified code.
from multiprocessing import Process, Lock, Array
import numpy as np
def worker(shared,lock):
numpy_arr = np.frombuffer(shared.get_obj())
# do some work ...
with lock:
for i in range(10):
numpy_arr[0] += 1
numpy_arr += 1
return
if __name__ == '__main__':
jobs = []
lock = Lock()
shared_array = Array('d', 1000000)
for process in range(4):
p = Process(target=worker, args=(shared_array,lock))
jobs.append(p)
p.start()
for process in jobs:
process.join()
m = np.frombuffer(shared_array.get_obj())
np.save('data', m)
print (m[:5])
From this code i obtain expected results, but again, Im not sure if this is the correct way. And finally, what is the diffrence between multiprocessing.Array and multiprocessing.sharedctypes.Array ?

return a list from class object

I am using multiprocessing module to generate 35 dataframes. I guess this will save my time. But the problem is that the class does not return anything. I expect the list of dataframes to be returned from self.dflist
Here is how to create dfnames list.
urls=[]
fnames=[]
dfnames=[]
for x in xrange(100,3600,100):
y = str(x)
i = y.zfill(4)
filename='DCHB_Town_Release_'+i+'.xlsx'
url = "http://www.censusindia.gov.in/2011census/dchb/"+filename
urls.append(url)
fnames.append(filename)
dfnames.append((filename, 'DCHB_Town_Release_'+i))
This is the class that uses the dfnames generated by above code.
import pandas as pd
import multiprocessing
class mydf1():
def __init__(self, dflist, jobs, dfnames):
self.dflist=list()
self.jobs=list()
self.dfnames=dfnames
def dframe_create(self, filename, dfname):
print 'abc', filename, dfname
dfname=pd.read_excel(filename)
self.dflist.append(dfname)
print self.dflist
return self.dflist
def mp(self):
for f,d in self.dfnames:
p = multiprocessing.Process(target=self.dframe_create, args=(f,d))
self.jobs.append(p)
p.start()
#return self.dflist
for j in self.jobs:
j.join()
print '%s.exitcode = %s' % (j.name, j.exitcode)
This class when called like this...
dflist=[]
jobs=[]
x=mydf1(dflist, jobs, dfnames)
y=x.mp()
Prints the self.dflist correctly. But does not return anything.
I can collect all datafarmes sequentially. But in order to save time, I need to use multiple processes simultaneously to generate and add dataframes to a list.
In your case I prefer to write as less code as possible and use Pool:
import pandas as pd
import logging
import multiprocessing
def dframe_create(filename):
try:
return pd.read_excel(filename)
except Exception as e:
logging.error("Something went wrong: %s", e, exc_info=1)
return None
p = multiprocessing.Pool()
excel_files = p.map(dframe_create, dfnames)
for f in excel_files:
if f is not None:
print 'Ready to work'
else:
print ':('
Prints the self.dflist correctly. But does not return anything.
That's because you don't have a return statement in the mp method, e.g.
def mp(self):
...
return self.dflist
It's not entirely clear what you're issue is, however, you have to take some care here in that you can't just pass objects/lists across processes. That's why you have special objects (which lock while they make modifications to a list), that way you don't get tripped up when two processes try to make a change at the same time (and you only get one update).
That is, you have to use multiprocessing's list.
class mydf1():
def __init__(self, dflist, jobs, dfnames):
self.dflist = multiprocessing.list() # perhaps should be multiprocessing.list(dflist or ())
self.jobs = list()
self.dfnames = dfnames
However you have a bigger problem: the whole point of multiprocessing is that they may run/finish out of order, so keeping two lists like this is doomed to fail. You should use a multiprocessing.dict that way the DataFrame is saved unambiguously with the filename.
class mydf1():
def __init__(self, dflist, jobs, dfnames):
self.dfdict = multiprocessing.dict()
...
def dframe_create(self, filename, dfname):
print 'abc', filename, dfname
df = pd.read_excel(filename)
self.dfdict[dfname] = df