How to clear Dataframe memory in pandas? - pandas

I am converting fixedwidth file to delimiter file ('|' delimiter) using pandas read_fwf method. My input file ("infile.txt") is around 16GB and 9.9 Million records, while creating a dataframe it is occupying almost 3times of memory(around 48GB) before it creates outputfile. Can someone help me in impoving below logic please and through somelight where this extra memory is from (I know 'seq_id, fname and loaddatime will occupy some space it should in couple of GBs only).
Note:
I am processing multiple files(similar size files) in loop one after the other. so i have to clear the memory before next file takes over.
'''infile.txt'''
1234567890AAAAAAAAAA
1234567890BBBBBBBBBB
1234567890CCCCCCCCCC
'''test_layout.csv'''
FIELD_NAME,START_POS,END_POS
FIELD1,0,10
FIELD2,10,20
'''test.py'''
import datetime
import pandas as pd
import csv
from collections import OrderedDict
import gc
seq_id = 1
fname= 'infile.txt'
loadDatetime = '04/10/2018'
in_layout = open("test_layout.csv","rt")
reader = csv.DictReader(in_layout)
boundries, col_names = [[],[]]
for row in reader:
boundries.append(tuple([int(str(row['START_POS']).strip()) , int(str(row['END_POS']).strip())]))
col_names.append(str(row['FIELD_NAME']).strip())
dataf = pd.read_fwf(fname, quoting=3, colspecs = boundries, dtype = object, names = col_names)
len_df = len(dataf)
'''Used pair of key, value tuples and OrderedDict to preserve the order of the columns'''
mod_dataf = pd.DataFrame(OrderedDict((('seq_id',[seq_id]*len_df),('fname',[fname]*len_df))), dtype=object)
ldt_ser = pd.Series([loadDatetime]*len_df,name='loadDatetime', dtype=object)
dataf = pd.concat([mod_dataf, dataf],axis=1)
alldfs = [mod_dataf]
del alldfs
gc.collect()
mod_dataf = pd.DataFrame()
dataf = pd.concat([dataf,ldt_ser],axis=1)
dataf.to_csv("outfile.txt", sep='|', quoting=3, escapechar='\\' , index=False, header=False,encoding='utf-8')
''' Release Memory used by DataFrames '''
alldfs = [dataf]
del ldt_ser
del alldfs
gc.collect()
dataf = pd.DataFrame()
I used garbage collector , del dataframe and initialised to clear memory used but still total memory is not released from dataframe.
Inspired by https://stackoverflow.com/a/49144260/2799214
'''OUTPUT'''
1|infile.txt|1234567890|AAAAAAAAAA|04/10/2018
1|infile.txt|1234567890|BBBBBBBBBB|04/10/2018
1|infile.txt|1234567890|CCCCCCCCCC|04/10/2018

I had the same problem as you using https://stackoverflow.com/a/49144260/2799214
I found a solution using gc.collect() by splitting my code in different methods within a class. For example:
Class A:
def __init__(self):
# your code
def first_part_of_my_code(self):
# your code
# I want to clear my dataframe
del my_dataframe
gc.collect()
my_dataframe = pd.DataFrame() # not sure whether this line really helps
return my_new_light_dataframe
def second_part_of_my_code(self):
# my code
# same principle
So When the program call the methods, The garbage collector clear the memory once the program leaves the method.

Related

'poorly' organized csv file

I have a CSV file that I have to do some data processing and it's a bit of a mess. It's about 20 columns long, but there are multiple datasets that are concatenated in each column. see dummy file below
I'm trying to import each sub file into a separate pandas dataframe, but I'm not sure the best way to parse the csv other than manually hardcoding importing a certain length. any suggestions? I guess if there is some way to find where the spaces are (I could loop through the entire file and find them, and then read each block, but that doesn't seem very efficient). I have lots of csv files like this to read.
import pandas as pd
nrows = 20
skiprows = 0 #but this only reads in the first block
df = pd.read_csv(csvfile, nrows=nrows, skiprows=skiprows)
Below is a dummy example:
TIME,HDRA-1,HDRA-2,HDRA-3,HDRA-4
0.473934934,0.944026678,0.460177668,0.157028404,0.221362174
0.911384892,0.336694914,0.586014563,0.828339071,0.632790473
0.772652589,0.318146985,0.162987171,0.555896202,0.659099194
0.541382917,0.033706768,0.229596419,0.388057901,0.465507295
0.462815443,0.088206108,0.717132904,0.545779038,0.268174922
0.522861489,0.736462083,0.532785319,0.961993893,0.393424116
0.128671067,0.56740537,0.689995486,0.518493779,0.94916205
0.214026742,0.176948186,0.883636252,0.732258971,0.463732841
0.769415726,0.960761306,0.401863804,0.41823372,0.812081565
0.529750933,0.360314266,0.461615009,0.387516958,0.136616263
TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4
0.92264286,0.026312552,0.905839375,0.869477136,0.985560264
0.410573341,0.004825381,0.920616162,0.19473237,0.848603523
0.999293171,0.259955029,0.380094352,0.101050014,0.428047493
0.820216119,0.655118219,0.586754951,0.568492346,0.017038336
0.040384337,0.195101879,0.778631044,0.655215972,0.701596844
0.897559206,0.659759362,0.691643603,0.155601111,0.713735399
0.860188233,0.805013656,0.772153733,0.809025634,0.257632085
0.844167809,0.268060979,0.015993504,0.95131982,0.321210766
0.86288383,0.236599974,0.279435193,0.311005146,0.037592509
0.938348876,0.941851279,0.582434058,0.900348616,0.381844182
0.344351819,0.821571854,0.187962046,0.218234588,0.376122331
0.829766776,0.869014514,0.434165111,0.051749472,0.766748447
0.327865017,0.938176948,0.216764504,0.216666543,0.278110502
0.243953506,0.030809033,0.450110334,0.097976735,0.762393831
0.484856452,0.312943244,0.443236377,0.017201097,0.038786057
0.803696521,0.328088545,0.764850865,0.090543472,0.023363909
TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4
0.342418934,0.290979228,0.84201758,0.690964176,0.927385229
0.173485057,0.214049903,0.27438753,0.433904377,0.821778689
0.982816721,0.094490904,0.105895645,0.894103833,0.34362529
0.738593272,0.423470984,0.343551191,0.192169774,0.907698897
0.021809601,0.406001002,0.072701623,0.964640184,0.023427393
0.406226618,0.421944527,0.413150342,0.337243905,0.515996389
0.829989793,0.168974332,0.246064043,0.067662474,0.851182924
0.812736737,0.667154845,0.118274705,0.484017732,0.052666038
0.215947395,0.145078319,0.484063281,0.79414799,0.373845815
0.497877968,0.554808367,0.370429652,0.081553316,0.793608698
0.607612542,0.424703584,0.208995066,0.249033837,0.808169709
0.199613478,0.065853429,0.77236195,0.757789625,0.597225697
0.044167285,0.1024231,0.959682778,0.892311813,0.621810775
0.861175219,0.853442735,0.742542086,0.704287769,0.435969078
0.706544823,0.062501379,0.482065481,0.598698867,0.845585046
0.967217599,0.13127149,0.294860203,0.191045015,0.590202032
0.031666757,0.965674812,0.177792841,0.419935921,0.895265056
TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4
0.306849588,0.177454423,0.538670939,0.602747137,0.081221293
0.729747557,0.11762043,0.409064884,0.051577964,0.666653287
0.492543468,0.097222882,0.448642979,0.130965724,0.48613413
0.0802024,0.726352481,0.457476151,0.647556514,0.033820374
0.617976299,0.934428994,0.197735831,0.765364856,0.350880707
0.07660401,0.285816636,0.276995238,0.047003343,0.770284864
0.620820688,0.700434525,0.896417099,0.652364756,0.93838793
0.364233925,0.200229902,0.648342989,0.919306736,0.897029239
0.606100716,0.203585366,0.167232701,0.523079381,0.767224301
0.616600448,0.130377791,0.554714839,0.468486555,0.582775753
0.254480861,0.933534632,0.054558237,0.948978985,0.731855548
0.620161044,0.583061202,0.457991555,0.441254272,0.657127968
0.415874646,0.408141761,0.843133575,0.40991199,0.540792744
0.254903429,0.655739954,0.977873649,0.210656057,0.072451639
0.473680525,0.298845701,0.144989283,0.998560665,0.223980961
0.30605008,0.837920854,0.450681322,0.887787908,0.793229776
0.584644405,0.423279153,0.444505314,0.686058204,0.041154856
from io import StringIO
import pandas as pd
data ="""
TIME,HDRA-1,HDRA-2,HDRA-3,HDRA-4
0.473934934,0.944026678,0.460177668,0.157028404,0.221362174
0.911384892,0.336694914,0.586014563,0.828339071,0.632790473
0.772652589,0.318146985,0.162987171,0.555896202,0.659099194
0.541382917,0.033706768,0.229596419,0.388057901,0.465507295
0.462815443,0.088206108,0.717132904,0.545779038,0.268174922
0.522861489,0.736462083,0.532785319,0.961993893,0.393424116
TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4
0.92264286,0.026312552,0.905839375,0.869477136,0.985560264
0.410573341,0.004825381,0.920616162,0.19473237,0.848603523
0.999293171,0.259955029,0.380094352,0.101050014,0.428047493
0.820216119,0.655118219,0.586754951,0.568492346,0.017038336
0.040384337,0.195101879,0.778631044,0.655215972,0.701596844
TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4
0.342418934,0.290979228,0.84201758,0.690964176,0.927385229
0.173485057,0.214049903,0.27438753,0.433904377,0.821778689
0.982816721,0.094490904,0.105895645,0.894103833,0.34362529
0.738593272,0.423470984,0.343551191,0.192169774,0.907698897
"""
df = pd.read_csv(StringIO(data), header=None)
start_marker = 'TIME'
grouper = (df.iloc[:, 0] == start_marker).cumsum()
groups = df.groupby(grouper)
frames = [gr.T.set_index(gr.index[0]).T for _, gr in groups]

Getting an IndexError when trying to run pcolormesh from a pandas DataFrame

I'm trying to generate a pcolormesh plot from a large dataset, where the rows are in units of hertz, the rows are individual files, and the body is an array of magnitude values per file for each frequency. My DataFrame gets constructed correctly with correct labels, but when I pass it in to pcolormesh, it throws the exception "arrays used as indices must be of integer (or boolean) type". The code I am attaching reflects a conversion of the frequency array to an integer array using .astype(int). Note, if I convert the PSD_array (magnitudes) to integers, it DOES work (but isn't helpful), but it doesn't like it otherwise. I also played around with other pcolormesh generations using decimals as the body of the DataFrame and it worked fine.
Ideas would be lovely, I'll keep working on it.
Code: (Note: specific call file paths redacted).
'''def file_List():
files = [file for file in os.listdir('###)]
file_list = []
for file in files:
file_list += [file]
#print(natsorted(file_list))
#print(len(file_list))
return(natsorted(file_list))
###Using Fast Fourier Transform, take in file list generated from
###file_List program and perform FFT on each file.
###We read the files by adding them to the file directory. Could improve
###by making an overarching program that runs everything with an input
###file directory.
def FFT():
'''
runs through FFT for the files in a file list as determined by the
file_List() program.
'''
file_list = file_List() #runs file_List() program and saves
#the list of files as a variable.
df = pd.DataFrame()
freq_array = np.empty((0,204800))
PSD_array = np.empty((0,204800))
print(len(file_list))
count = 0
for file in file_list:
while count < 10:
file_read = pd.read_csv('###'+file,skiprows=22,sep = '\t')
df = pd.DataFrame(file_read, columns = ['X_Value','Acceleration'])
#print(df.head())
q = df['Acceleration'] #data set input
n = len(df['Acceleration']) #number of data points
dt = 2/len(df['X_Value']) #
f_hat = np.fft.fft(q,n) #Runs FFT
PSD = f_hat * np.conj(f_hat) / n #Power Spectral Density
freq = (1/(dt*n)) * np.arange(n) #Creates x axis of frequencies
freq_array = np.append(freq_array,np.array([freq]),axis=0)
PSD_array = np.append(PSD_array,np.array([PSD]),axis=0)
count += 1
#trans_freq = np.transpose(freq_array)
#trans_PSD = np.transpose(PSD_array)
print(freq_array)
int_freq = freq_array.astype(int)
print(int_freq)
#PSD_int = PSD_array.astype(int)
PSD_df = pd.DataFrame(PSD_array, index = np.arange(len(PSD_array)), columns = int_freq[0])
#print(np.arange(len(PSD_array)))
print(PSD_df)
return(PSD_df)
def heatmap(df):
'''
Constructs a heatmap given an input dataframe
'''
plt.pcolormesh(df)
'''

Correct way of passing dataframe to ray

I am trying to do the simplest thing with Ray, but no matter what I do it just never releases memory and fails.
The usage case is simply
read parquet files to DF -> pass to pool of actors -> make changes to DF -> return DF
class Main_func:
def calculate(self,data):
#do some things with the DF
return df.copy(deep=True) <- one of many attempts to fix the problem, but didnt work
cpus = 24
actors = []
for _ in range(cpus):
actors.append(Main_func.remote())
from ray.util import ActorPool
pool = ActorPool(actors)
import os
arr = os.listdir("/some/files")
def to_ray():
try:
filename = arr.pop(0)
pf = ParquetFile("/some/files/" + filename)
df = pf.to_pandas()
pool.submit(lambda a,v:a.calculate.remote(v),df.copy(deep=True)
except Exception as e:
print(e)
for _ in range(cpus):
to_ray()
while(True):
res = pool.get_next_unordered()
write('./temp/' + random_filename, res,compression='GZIP')
del res
to_ray()
I have tried other ways of doing the same thing, manually submitting rather than the map command, but whatever i do it always locks memory and fails after a few 100 dataframes.
Does each task needs to preserve state among different files? Ray has tasks abstraction that should simplify things:
import ray
ray.init()
#ray.remote
def read_and_write(path):
df = pd.read_parquet(path)
... do things
df.to_parquet("./temp/...")
import os
arr = os.listdir("/some/files")
results = ray.get([read_and_write.remote(path) for path in arr])

how to make a memory efficient multiple dimension groupby/stack using xarray?

I have a large time series of np.float64 with a 5-min frequency (size is ~2,500,000 ~=24 years).
I'm using Xarray to represent it in-memory and the time-dimension is named 'time'.
I want to group-by 'time.hour' and then 'time.dayofyear' (or vice-versa) and remove both their mean from the time-series.
In order to do that efficiently, i need to reorder the time-series into a new xr.DataArray with the dimensions of ['hour', 'dayofyear', 'rest'].
I wrote a function that plays with the GroupBy objects of Xarray and manages to do just that although it takes a lot of memory to do that...
I have a machine with 32GB RAM and i still get the MemoryError from numpy.
I know the code works because i used it on an hourly re-sampled version of my original time-series. so here's the code:
def time_series_stack(time_da, time_dim='time', grp1='hour', grp2='dayofyear'):
"""Takes a time-series xr.DataArray objects and reshapes it using
grp1 and grp2. outout is a xr.Dataset that includes the reshaped DataArray
, its datetime-series and the grps."""
import xarray as xr
import numpy as np
import pandas as pd
# try to infer the freq and put it into attrs for later reconstruction:
freq = pd.infer_freq(time_da[time_dim].values)
name = time_da.name
time_da.attrs['freq'] = freq
attrs = time_da.attrs
# drop all NaNs:
time_da = time_da.dropna(time_dim)
# group grp1 and concat:
grp_obj1 = time_da.groupby(time_dim + '.' + grp1)
s_list = []
for grp_name, grp_inds in grp_obj1.groups.items():
da = time_da.isel({time_dim: grp_inds})
s_list.append(da)
grps1 = [x for x in grp_obj1.groups.keys()]
stacked_da = xr.concat(s_list, dim=grp1)
stacked_da[grp1] = grps1
# group over the concatenated da and concat again:
grp_obj2 = stacked_da.groupby(time_dim + '.' + grp2)
s_list = []
for grp_name, grp_inds in grp_obj2.groups.items():
da = stacked_da.isel({time_dim: grp_inds})
s_list.append(da)
grps2 = [x for x in grp_obj2.groups.keys()]
stacked_da = xr.concat(s_list, dim=grp2)
stacked_da[grp2] = grps2
# numpy part:
# first, loop over both dims and drop NaNs, append values and datetimes:
vals = []
dts = []
for i, grp1_val in enumerate(stacked_da[grp1]):
da = stacked_da.sel({grp1: grp1_val})
for j, grp2_val in enumerate(da[grp2]):
val = da.sel({grp2: grp2_val}).dropna(time_dim)
vals.append(val.values)
dts.append(val[time_dim].values)
# second, we get the max of the vals after the second groupby:
max_size = max([len(x) for x in vals])
# we fill NaNs and NaT for the remainder of them:
concat_sizes = [max_size - len(x) for x in vals]
concat_arrys = [np.empty((x)) * np.nan for x in concat_sizes]
concat_vals = [np.concatenate(x) for x in list(zip(vals, concat_arrys))]
# 1970-01-01 is the NaT for this time-series:
concat_arrys = [np.zeros((x), dtype='datetime64[ns]')
for x in concat_sizes]
concat_dts = [np.concatenate(x) for x in list(zip(dts, concat_arrys))]
concat_vals = np.array(concat_vals)
concat_dts = np.array(concat_dts)
# finally , we reshape them:
concat_vals = concat_vals.reshape((stacked_da[grp1].shape[0],
stacked_da[grp2].shape[0],
max_size))
concat_dts = concat_dts.reshape((stacked_da[grp1].shape[0],
stacked_da[grp2].shape[0],
max_size))
# create a Dataset and DataArrays for them:
sda = xr.Dataset()
sda.attrs = attrs
sda[name] = xr.DataArray(concat_vals, dims=[grp1, grp2, 'rest'])
sda[time_dim] = xr.DataArray(concat_dts, dims=[grp1, grp2, 'rest'])
sda[grp1] = grps1
sda[grp2] = grps2
sda['rest'] = range(max_size)
return sda
So for the 2,500,000 items time-series, numpy throws the MemoryError so I'm guessing this has to be my memory bottle-neck. What can i do to solve this ?
Would Dask help me ? and if so how can i implement it ?
Like you, I ran it without issue when inputting a small time series (10,000 long). However, when inputting a 100,000 long time series xr.DataArraythe grp_obj2 for loop ran away and used all the memory of the system.
This is what I used to generate the time series xr.DataArray:
n = 10**5
times = np.datetime64('2000-01-01') + np.arange(n) * np.timedelta64(5,'m')
data = np.random.randn(n)
time_da = xr.DataArray(data, name='rand_data', dims=('time'), coords={'time': times})
# time_da.to_netcdf('rand_time_series.nc')
As you point out, Dask would be a way to solve it but I can't see a clear path at the moment...
Typically, the kind of problem with Dask would be to:
Make the input a dataset from a file (like NetCDF). This will not load the file in memory but allow Dask to pull data from disk one chunk at a time.
Define all calculations with dask.delayed or dask.futures methods for entire body of code up until the writing the output. This is what allows Dask to chunk a small piece of data to read then write.
Calculate one chunk of work and immediately write output to new dataset file. Effectively you ending up steaming one chunk of input to one chunk of output at a time (but also threaded/parallelized).
I tried importing Dask and breaking the input time_da xr.DataArray into chunks for Dask to work on but it didn't help. From what I can tell, the line stacked_da = xr.concat(s_list, dim=grp1) forces Dask to make a full copy of stacked_da in memory and much more...
One workaround to this is to write stacked_da to disk then immediately read it again:
##For group1
xr.concat(s_list, dim=grp1).to_netcdf('stacked_da1.nc')
stacked_da = xr.load_dataset('stacked_da1.nc')
stacked_da[grp1] = grps1
##For group2
xr.concat(s_list, dim=grp2).to_netcdf('stacked_da2.nc')
stacked_da = xr.load_dataset('stacked_da2.nc')
stacked_da[grp2] = grps2
However, the file size for stacked_da1.nc is 19MB and stacked_da2.nc gets huge at 6.5GB. This is for time_da with 100,000 elements... so there's clearly something amiss...
Originally, it sounded like you want to subtract the mean of the groups from the time series data. It looks like Xarray docs has an example for that. http://xarray.pydata.org/en/stable/groupby.html#grouped-arithmetic
The key is to group once and loop over the groups and then group again on each of the groups and append it to list.
Next i concat and use pd.MultiIndex.from_product for the groups.
No Memory problems and no Dask needed and it only takes a few seconds to run.
here's the code, enjoy:
def time_series_stack(time_da, time_dim='time', grp1='hour', grp2='month',
plot=True):
"""Takes a time-series xr.DataArray objects and reshapes it using
grp1 and grp2. output is a xr.Dataset that includes the reshaped DataArray
, its datetime-series and the grps. plots the mean also"""
import xarray as xr
import pandas as pd
# try to infer the freq and put it into attrs for later reconstruction:
freq = pd.infer_freq(time_da[time_dim].values)
name = time_da.name
time_da.attrs['freq'] = freq
attrs = time_da.attrs
# drop all NaNs:
time_da = time_da.dropna(time_dim)
# first grouping:
grp_obj1 = time_da.groupby(time_dim + '.' + grp1)
da_list = []
t_list = []
for grp1_name, grp1_inds in grp_obj1.groups.items():
da = time_da.isel({time_dim: grp1_inds})
# second grouping:
grp_obj2 = da.groupby(time_dim + '.' + grp2)
for grp2_name, grp2_inds in grp_obj2.groups.items():
da2 = da.isel({time_dim: grp2_inds})
# extract datetimes and rewrite time coord to 'rest':
times = da2[time_dim]
times = times.rename({time_dim: 'rest'})
times.coords['rest'] = range(len(times))
t_list.append(times)
da2 = da2.rename({time_dim: 'rest'})
da2.coords['rest'] = range(len(da2))
da_list.append(da2)
# get group keys:
grps1 = [x for x in grp_obj1.groups.keys()]
grps2 = [x for x in grp_obj2.groups.keys()]
# concat and convert to dataset:
stacked_ds = xr.concat(da_list, dim='all').to_dataset(name=name)
stacked_ds[time_dim] = xr.concat(t_list, 'all')
# create a multiindex for the groups:
mindex = pd.MultiIndex.from_product([grps1, grps2], names=[grp1, grp2])
stacked_ds.coords['all'] = mindex
# unstack:
ds = stacked_ds.unstack('all')
ds.attrs = attrs
return ds

return a list from class object

I am using multiprocessing module to generate 35 dataframes. I guess this will save my time. But the problem is that the class does not return anything. I expect the list of dataframes to be returned from self.dflist
Here is how to create dfnames list.
urls=[]
fnames=[]
dfnames=[]
for x in xrange(100,3600,100):
y = str(x)
i = y.zfill(4)
filename='DCHB_Town_Release_'+i+'.xlsx'
url = "http://www.censusindia.gov.in/2011census/dchb/"+filename
urls.append(url)
fnames.append(filename)
dfnames.append((filename, 'DCHB_Town_Release_'+i))
This is the class that uses the dfnames generated by above code.
import pandas as pd
import multiprocessing
class mydf1():
def __init__(self, dflist, jobs, dfnames):
self.dflist=list()
self.jobs=list()
self.dfnames=dfnames
def dframe_create(self, filename, dfname):
print 'abc', filename, dfname
dfname=pd.read_excel(filename)
self.dflist.append(dfname)
print self.dflist
return self.dflist
def mp(self):
for f,d in self.dfnames:
p = multiprocessing.Process(target=self.dframe_create, args=(f,d))
self.jobs.append(p)
p.start()
#return self.dflist
for j in self.jobs:
j.join()
print '%s.exitcode = %s' % (j.name, j.exitcode)
This class when called like this...
dflist=[]
jobs=[]
x=mydf1(dflist, jobs, dfnames)
y=x.mp()
Prints the self.dflist correctly. But does not return anything.
I can collect all datafarmes sequentially. But in order to save time, I need to use multiple processes simultaneously to generate and add dataframes to a list.
In your case I prefer to write as less code as possible and use Pool:
import pandas as pd
import logging
import multiprocessing
def dframe_create(filename):
try:
return pd.read_excel(filename)
except Exception as e:
logging.error("Something went wrong: %s", e, exc_info=1)
return None
p = multiprocessing.Pool()
excel_files = p.map(dframe_create, dfnames)
for f in excel_files:
if f is not None:
print 'Ready to work'
else:
print ':('
Prints the self.dflist correctly. But does not return anything.
That's because you don't have a return statement in the mp method, e.g.
def mp(self):
...
return self.dflist
It's not entirely clear what you're issue is, however, you have to take some care here in that you can't just pass objects/lists across processes. That's why you have special objects (which lock while they make modifications to a list), that way you don't get tripped up when two processes try to make a change at the same time (and you only get one update).
That is, you have to use multiprocessing's list.
class mydf1():
def __init__(self, dflist, jobs, dfnames):
self.dflist = multiprocessing.list() # perhaps should be multiprocessing.list(dflist or ())
self.jobs = list()
self.dfnames = dfnames
However you have a bigger problem: the whole point of multiprocessing is that they may run/finish out of order, so keeping two lists like this is doomed to fail. You should use a multiprocessing.dict that way the DataFrame is saved unambiguously with the filename.
class mydf1():
def __init__(self, dflist, jobs, dfnames):
self.dfdict = multiprocessing.dict()
...
def dframe_create(self, filename, dfname):
print 'abc', filename, dfname
df = pd.read_excel(filename)
self.dfdict[dfname] = df