Exception appending DataFrame chunk with string values to large HDF5 file using pandas - pandas

An exception happens while appending pandas.DataFrame() with string values (numerical values are OK) to HDF5 storage after filesize is greater than approximately 47 GiB. Neither minimal size of string, number of records, number of columns is not important. Filesize is important.
The bottom of the exception trace:
File "..\..\hdf5-1.8.14\src\H5FDsec2.c", line 822, in H5FD_sec2_write
file write failed: time = Tue Aug 18 18:26:17 2015
, filename = 'large_file.h5', file descriptor = 4, errno = 22, error message = 'Invalid argument', buf = 0000000066A40018, total write size = 262095, bytes this sub-write = 262095, bytes actually written = 18446744073709551615, offset = 47615949533
the code to reproduce:
import numpy as np
import pandas as pd
for i in range(200):
df = pd.DataFrame(np.char.mod('random string object (%f)', np.random.rand(5000000,3)), columns=('A','B','C'))
print('writing chunk №', i, '...', end='', flush=True)
with pd.HDFStore('large_file.h5') as hdf:
# Construct unique index
try:
nrows = hdf.get_storer('df').nrows
except:
nrows = 0
df.index = pd.Series(df.index) + nrows
# Append the dataframe to the storage. Exception hppens here
hdf.append('df', df, format='table')
print('done')
Environment:
Windows7 x64 machine, python 3.4.3, pandas 0.16.2, pytables 3.2.0, HDF5 1.8.14.
The question is how to fix the problem if it is located in python code above or how to avoid it if related to HDF5. Thanks.

Related

Pandas Making multiple HTTP requests

I have below code that reads from a csv file a number of ticker symbols into a dataframe.
Each ticker calls the Web Api returning a dafaframe df which is then attached to the last one until complete. The code works , but when a large number of tickers is used the code slows down tremendously. I understand I can use multiprocessing and threads to speed up my code but dont know where to start and what would be the most suited in my particular case.
What code should I use to get my data into a combined daframe in the fastest possible manner?
import pandas as pd
import numpy as np
import json
tickers=pd.read_csv("D:/verhuizen/pensioen/MULTI.csv",names=['symbol','company'])
read_str='https://financialmodelingprep.com/api/v3/income-statement/AAPL?limit=120&apikey=demo'
df = pd.read_json (read_str)
df = pd.DataFrame(columns=df.columns)
for ind in range(len(tickers)):
read_str='https://financialmodelingprep.com/api/v3/income-statement/'+ tickers['symbol'][ind] +'?limit=120&apikey=demo'
df1 = pd.read_json (read_str)
df=pd.concat([df,df1], ignore_index=True)
df.set_index(['date','symbol'], inplace=True)
df.sort_index(inplace=True)
df.to_csv('D:/verhuizen/pensioen/MULTI_out.csv')
The code provided works fine for smaller data sets, but when I use a large number of tickers (>4,000) at some point I get the below error. Is this because the web api gets overloaded or is there another problem?
Traceback (most recent call last):
File "D:/Verhuizen/Pensioen/Equity_Extractor_2021.py", line 43, in <module>
data = pool.starmap(download_data, enumerate(TICKERS, start=1))
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 276, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x00C33E30>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'
Process finished with exit code 1
It keeps giving the same error (for a larger amount of tickers)
code is exactly as provided:
def download_data(pool_id, symbols):
df = []
for symbol in symbols:
print("[{:02}]: {}".format(pool_id, symbol))
#do stuff here
read_str = BASEURL.format(symbol)
df.append(pd.read_json(read_str))
#df.append(pd.read_json(fake_data(symbol)))
return pd.concat(df, ignore_index=True)
It failed again with the pool.map, but one strange thing I noticed. Each time it fails it does so around 12,500 tickers (total is around 23,000 tickers) Similar error:
Traceback (most recent call last):
File "C:/Users/MLUY/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/Equity_naive.py", line 21, in <module>
data = pool.map(download_data, TICKERS)
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x078D1BF0>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'
Process finished with exit code 1
I get the tickers also from a API call https://financialmodelingprep.com/api/v3/financial-statement-symbol-lists?apikey=demo (I noticed it does not work without subscription), I wanted to attach the data it as a csv file but I dont have sufficient rights. I dont think its a good idea to paste the returned data here...
I tried adding time.sleep(0.2) before return as suggested, but again I ge the same error at ticker 12,510. Strange everytime its around the same location. As there are multiple processes going on I cannot see at what point its breaking
Traceback (most recent call last):
File "C:/Users/MLUY/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/Equity_naive.py", line 24, in <module>
data = pool.map(download_data, TICKERS)
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x00F32C90>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'
Process finished with exit code 1
Something very very strange is going on , I have split the data in chunks of 10,000 / 5,000 / 4,000 and 2,000 and each time the code breaks approx 100 tickers from the end. Clearly there is something going on that not right
import time
import pandas as pd
import multiprocessing
# get tickers from your csv
df=pd.read_csv('D:/Verhuizen/Pensioen/All_Symbols.csv',header=None)
# setting the Dataframe to a list (in total 23,000 tickers)
df=df[0]
TICKERS=df.tolist()
#Select how many tickers I want
TICKERS=TICKERS[0:2000]
BASEURL = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"
def download_data(symbol):
print(symbol)
# do stuff here
read_str = BASEURL.format(symbol)
df = pd.read_json(read_str)
#time.sleep(0.2)
return df
if __name__ == "__main__":
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
data = pool.map(download_data, TICKERS)
df = pd.concat(data).set_index(["date", "symbol"]).sort_index()
df.to_csv('D:/verhuizen/pensioen/Income_2000.csv')
In this particular example the code breaks at position 1,903
RPAI
Traceback (most recent call last):
File "C:/Users/MLUY/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/Equity_testing.py", line 27, in <module>
data = pool.map(download_data, TICKERS)
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\MLUY\AppData\Local\Programs\Python\Python37-32\lib\multiprocessing\pool.py", line 657, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x0793EAF0>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'
First optimization is to avoid concatenate your dataframe at each iteration.
You can try something like that:
url = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"
df = []
for symbol in tickers["symbol"]:
read_str = url.format(symbol)
df.append(pd.read_json(read_str))
df = pd.concat(df, ignore_index=True)
If it's not sufficient, we will see to use async, threading or multiprocessing.
Edit:
The code below can do the job:
import pandas as pd
import numpy as np
import multiprocessing
import time
import random
PROCESSES = 4 # number of parallel process
CHUNKS = 6 # one process handle n symbols
# get tickers from your csv
TICKERS = ["BCDA", "WBAI", "NM", "ZKIN", "TNXP", "FLY", "MYSZ", "GASX", "SAVA", "GCE",
"XNET", "SRAX", "SINO", "LPCN", "XYF", "SNSS", "DRAD", "WLFC", "OILD", "JFIN",
"TAOP", "PIC", "DIVC", "MKGI", "CCNC", "AEI", "ZCMD", "YVR", "OCG", "IMTE",
"AZRX", "LIZI", "ORSN", "ASPU", "SHLL", "INOD", "NEXI", "INR", "SLN", "RHE-PA",
"MAX", "ARRY", "BDGE", "TOTA", "PFMT", "AMRH", "IDN", "OIS", "RMG", "IMV",
"CHFS", "SUMR", "NRG", "ULBR", "SJI", "HOML", "AMJL", "RUBY", "KBLMU", "ELP"]
# create a list of n sublist
TICKERS = [TICKERS[i:i + CHUNKS] for i in range(0, len(TICKERS), CHUNKS)]
BASEURL = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"
def fake_data(symbol):
dti = pd.date_range("1985", "2020", freq="Y")
df = pd.DataFrame({"date": dti, "symbol": symbol,
"A": np.random.randint(0, 100, size=len(dti)),
"B": np.random.randint(0, 100, size=len(dti))})
time.sleep(random.random()) # to simulate network delay
return df.to_json()
def download_data(pool_id, symbols):
df = []
for symbol in symbols:
print("[{:02}]: {}".format(pool_id, symbol))
# do stuff here
# read_str = BASEURL.format(symbol)
# df.append(pd.read_json(read_str))
df.append(pd.read_json(fake_data(symbol)))
return pd.concat(df, ignore_index=True)
if __name__ == "__main__":
with multiprocessing.Pool(PROCESSES) as pool:
data = pool.starmap(download_data, enumerate(TICKERS, start=1))
df = pd.concat(data).set_index(["date", "symbol"]).sort_index()
In this example, I split the list of tickers into sublists for each process retrieves data for multiple symbols and limits overhead due to create and destroy processes.
The delay is to simulate the response time from the network connection and highlight the multiprocess behaviour.
Edit 2: simpler but naive version for your needs
import pandas as pd
import multiprocessing
# get tickers from your csv
TICKERS = ["BCDA", "WBAI", "NM", "ZKIN", "TNXP", "FLY", "MYSZ", "GASX", "SAVA", "GCE",
"XNET", "SRAX", "SINO", "LPCN", "XYF", "SNSS", "DRAD", "WLFC", "OILD", "JFIN",
"TAOP", "PIC", "DIVC", "MKGI", "CCNC", "AEI", "ZCMD", "YVR", "OCG", "IMTE",
"AZRX", "LIZI", "ORSN", "ASPU", "SHLL", "INOD", "NEXI", "INR", "SLN", "RHE-PA",
"MAX", "ARRY", "BDGE", "TOTA", "PFMT", "AMRH", "IDN", "OIS", "RMG", "IMV",
"CHFS", "SUMR", "NRG", "ULBR", "SJI", "HOML", "AMJL", "RUBY", "KBLMU", "ELP"]
BASEURL = "https://financialmodelingprep.com/api/v3/income-statement/{}?limit=120&apikey=demo"
def download_data(symbol):
print(symbol)
# do stuff here
read_str = BASEURL.format(symbol)
df = pd.read_json(read_str)
return df
if __name__ == "__main__":
with multiprocessing.Pool(multiprocessing.cpu_count()) as pool:
data = pool.map(download_data, TICKERS)
df = pd.concat(data).set_index(["date", "symbol"]).sort_index()
Note about pool.map: for each symbol in TICKERS, create a process and call function download_data.

concatenate results after multiprocessing

I have a function which is creating a data frame by doing multiprocessing on a df:-
Suppose if I am having 10 rows in my df so the function processor will process all 10 rows separately. what I want is to concatenate all the output of the function processor and make one data frame.
def processor(dff):
"""
reading data from a data frame and doing all sorts of data manipulation
for multiprocessing
"""
return df
def main(infile, mdebug):
global debug
debug = mdebug
try:
lines = sum(1 for line in open(infile))
except Exception as err:
print("Error {} opening file: {}").format(err, infile)
sys.exit(2000)
if debug >= 2:
print(infile)
try:
dff = pd.read_csv(infile)
except Exception as err:
print("Error {}, opening file: {}").format(err, infile)
sys.exit(2000)
df_split = np.array_split(dff, (lines+1))
cores = multiprocessing.cpu_count()
cores = 64
# pool = Pool(cores)
pool = Pool(lines-1)
for n, frame in enumerate(pool.imap(processor, df_split), start=1):
if frame is not None:
frame.to_csv('{}'.format(n))
pool.close()
pool.join()
if __name__ == "__main__":
args = parse_args()
"""
print "Debug is: {}".format(args.debug)
"""
if args.debug >= 1:
print("Running in debug mode: "), args.debug
main(infile=args.infile, mdebug=args.debug)
you can use either the data frame constructor or concat to solve your problem. the appropriate one to use depends on details of your code that you haven't included
here's a more complete example:
import numpy as np
import pandas as pd
# create dummy dataset
dff = pd.DataFrame(np.random.rand(101, 5), columns=list('abcde'))
# process data
with Pool() as pool:
result = pool.map(processor, np.array_split(dff, 7))
# put it all back together in one dataframe
result = np.concat(result)

How to calculate tf-idf when working on .txt files in python 3.7?

I have books in pdf and I want to do NLP tasks such as preprocessing, tf-idf calculation, word2vec, etc on those books. So I converted them into .txt files and was trying to get tf-idf scores. Previously I performed tf-idf on a CSV file, so I made some changes in that code and tried to use it for .txt file. But I am unsuccessful in my attempt.
Below is my code:
import pandas as pd
import numpy as np
from itertools import islice
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
data = open('jungle book.txt', 'r+')
# print(data.read())
cvec = CountVectorizer(stop_words='english', min_df=1, max_df=.5, ngram_range=(1,2))
cvec.fit(data)
list(islice(cvec.vocabulary_.items(), 20))
len(cvec.vocabulary_)
cvec_count = cvec.transform(data)
print('Sparse Matrix Shape : ', cvec_count.shape)
print('Non Zero Count : ', cvec_count.nnz)
print('sparsity: %.2f%%' % (100 * cvec_count.nnz / (cvec_count.shape[0] * cvec_count.shape[1])))
occ = np.asarray(cvec_count.sum(axis=0)).ravel().tolist()
count_df = pd.DataFrame({'term': cvec.get_feature_names(), 'occurrences' : occ})
term_freq = count_df.sort_values(by='occurrences', ascending=False).head(20)
print(term_freq)
transformer = TfidfTransformer()
transformed_weights = transformer.fit_transform(cvec_count)
weights = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()
weight_df = pd.DataFrame({'term' : cvec.get_feature_names(), 'weight' : weights})
tf_idf = weight_df.sort_values(by='weight', ascending=False).head(20)
print(tf_idf)
This code is working until print ('Non Zero Count :', cvec_count.shape) and printing:
Sparse Matrix Shape : (0, 7132)
Non Zero Count : 0
Then it is giving error:
ZeroDivisionError: division by zero
Even if I run this code with ignoring ZeroDivisionError, still it is wrong as it is not counting any frequencies.
I have no idea how to work around .txt file. What is the proper way to work on .txt file for NLP tasks?
Thanks in advance!
You are getting the error because data variable is empty or wrong type. Just opening the text file is not enough. You have to read the contents into a string variable and then do the preprocessing on that variable. Try replacing
data = open('jungle book.txt', 'r+')
# print(data.read())
with
with open('jungle book.txt', 'r') as file:
data = file.read()

Use pandas to read the csv file with several uncertain factors

I have asked the related question of string in: Find the number of \n before a given word in a long string. But this method cannot solve the complicate case I happened to. Thus I want to find out a solution of Pandas here.
I have a csv file (I just represent as a string):
csvfile = 'Idnum\tId\nkey:maturity\n2\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
I want to use the pandas:
value = pandas.read_csv(csvfile, sep = '\t', skiprows = 3).set_index('maturity')
to obtain the table like:
and set the first columan maturity as index.
But there are several uncertain factors in the csvfile:
1..set_index('maturity'), the key maturity
of index is included in the row key: maturity. Then I should find the row key: xxxx and obtain the string xxxx
2.skiprows = 3: the number of skipped rows before the title:
is uncertain. The csvfile can be something like:
'Idnum\tId\nkey:maturity\n2\n\n\n\n\n\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
I should find the row number of title (namely the row beginning with xxxx found in the rowkey: xxxx).
3.sep = '\t': the csvfile may use space as separator like:
csvfile = 'Idnum Id\nkey: maturity\n2\nmaturity para1 para2\n1Y 0 0\n2Y 0 0'
So is there any general code of pandas to deal with the csvfile with above uncertain factors?
Actually the string:
csvfile = 'Idnum\tId\nkey:maturity\n2\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
is from a StringIO: data
data.getvalue() = 'Idnum\tId\nkey:maturity\n2\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
I am not familiar with this structure and even I want to obtain a table of original data without any edition by using:
value = pandas.read_csv(data, sep = '\t')
There will be a error.
You can read the file line by line, collecting the necessary information and then pass the remainder to pd.read_csv with the appropriate arguments:
from io import StringIO
import re
import pandas as pd
with open('data.csv') as fh:
key = next(filter(lambda x: x.startswith('key:'), fh)).lstrip('key:').strip()
header = re.split('[ \t]+', next(filter(lambda x: x.startswith(key), fh)).strip())
df = pd.read_csv(StringIO(fh.read()), header=None, names=header, index_col=0, sep=r'\s+')
Example for data via StringIO:
fh = StringIO('Idnum\tId\nkey:maturity\n2\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0')
key = next(filter(lambda x: x.startswith('key:'), fh)).lstrip('key:').strip()
header = re.split('[ \t]+', next(filter(lambda x: x.startswith(key), fh)).strip())
df = pd.read_csv(fh, header=None, names=header, index_col=0, sep=r'\s+')
If you do not mind reading the csv file twice you can try doing something like:
from io import StringIO
csvfile = 'Idnum\tId\nkey:maturity\n2\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
data = pd.read_csv(StringIO(csvfile), sep='\t', error_bad_lines=False, header=None)
skiprows = len(data)
pd.read_csv(StringIO(csvfile), sep='\t', skiprows=skiprows)
same for you other example:
csvfile = 'Idnum\tId\nkey:maturity\n2\n\n\n\n\n\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
data = pd.read_csv(StringIO(csvfile), sep='\t', error_bad_lines=False, header=None)
skiprows = len(data)
pd.read_csv(StringIO(csvfile), sep='\t', skiprows=skiprows)
This assumes that you know the sep of the file
Also if you want to find the key:
csvfile = 'Idnum\tId\nkey:maturity\n2\n\n\n\n\n\nmaturity\tpara1\tpara2\n1Y\t0\t0\n2Y\t0\t0'
data = pd.read_csv(StringIO(csvfile), sep='\t', error_bad_lines=False, header=None)
key = [x.replace('key:','') for x in data[0] if x.find('key')>-1]
skiprows = len(data)
pd.read_csv(StringIO(csvfile), sep='\t', skiprows=skiprows).set_index(key)

How to clear Dataframe memory in pandas?

I am converting fixedwidth file to delimiter file ('|' delimiter) using pandas read_fwf method. My input file ("infile.txt") is around 16GB and 9.9 Million records, while creating a dataframe it is occupying almost 3times of memory(around 48GB) before it creates outputfile. Can someone help me in impoving below logic please and through somelight where this extra memory is from (I know 'seq_id, fname and loaddatime will occupy some space it should in couple of GBs only).
Note:
I am processing multiple files(similar size files) in loop one after the other. so i have to clear the memory before next file takes over.
'''infile.txt'''
1234567890AAAAAAAAAA
1234567890BBBBBBBBBB
1234567890CCCCCCCCCC
'''test_layout.csv'''
FIELD_NAME,START_POS,END_POS
FIELD1,0,10
FIELD2,10,20
'''test.py'''
import datetime
import pandas as pd
import csv
from collections import OrderedDict
import gc
seq_id = 1
fname= 'infile.txt'
loadDatetime = '04/10/2018'
in_layout = open("test_layout.csv","rt")
reader = csv.DictReader(in_layout)
boundries, col_names = [[],[]]
for row in reader:
boundries.append(tuple([int(str(row['START_POS']).strip()) , int(str(row['END_POS']).strip())]))
col_names.append(str(row['FIELD_NAME']).strip())
dataf = pd.read_fwf(fname, quoting=3, colspecs = boundries, dtype = object, names = col_names)
len_df = len(dataf)
'''Used pair of key, value tuples and OrderedDict to preserve the order of the columns'''
mod_dataf = pd.DataFrame(OrderedDict((('seq_id',[seq_id]*len_df),('fname',[fname]*len_df))), dtype=object)
ldt_ser = pd.Series([loadDatetime]*len_df,name='loadDatetime', dtype=object)
dataf = pd.concat([mod_dataf, dataf],axis=1)
alldfs = [mod_dataf]
del alldfs
gc.collect()
mod_dataf = pd.DataFrame()
dataf = pd.concat([dataf,ldt_ser],axis=1)
dataf.to_csv("outfile.txt", sep='|', quoting=3, escapechar='\\' , index=False, header=False,encoding='utf-8')
''' Release Memory used by DataFrames '''
alldfs = [dataf]
del ldt_ser
del alldfs
gc.collect()
dataf = pd.DataFrame()
I used garbage collector , del dataframe and initialised to clear memory used but still total memory is not released from dataframe.
Inspired by https://stackoverflow.com/a/49144260/2799214
'''OUTPUT'''
1|infile.txt|1234567890|AAAAAAAAAA|04/10/2018
1|infile.txt|1234567890|BBBBBBBBBB|04/10/2018
1|infile.txt|1234567890|CCCCCCCCCC|04/10/2018
I had the same problem as you using https://stackoverflow.com/a/49144260/2799214
I found a solution using gc.collect() by splitting my code in different methods within a class. For example:
Class A:
def __init__(self):
# your code
def first_part_of_my_code(self):
# your code
# I want to clear my dataframe
del my_dataframe
gc.collect()
my_dataframe = pd.DataFrame() # not sure whether this line really helps
return my_new_light_dataframe
def second_part_of_my_code(self):
# my code
# same principle
So When the program call the methods, The garbage collector clear the memory once the program leaves the method.