MemoryError when opening CSV file with pandas - pandas

I try to open a CSV file with pandas, but I'm getting a MemoryError. The file is around 300mb. Everything works fine when I use a smaller file.
I am using windows 10 with 64GB RAM. I already tried to change the custom VM options in Pycharm ("help" >> "Edit custom VM options") and set up higher memory numbers but it still doesn't work
import pandas as pd
df = pd.read_csv('report_OOP_Full.csv')
# I tried to add the following line but doesnt help
# df.info(memory_usage='deep')
MemoryError: Unable to allocate 344. MiB for an array with shape (14, 3216774) and data type float64
Process finished with exit code 1

This may not be the most efficient way but have a go.
Reduce or increase the chunk size depending on your RAM availability.
chunks = pd.read_csv('report_OOP_Full.csv', chunksize=10000)
i = 0
chunk_list = []
for chunk in chunks:
i += 1
chunk_list.append(chunk)
df = pd.concat(chunk_list, sort = True)
If this doesnt work. Try this:
chunks = pd.read_csv('report_OOP_Full.csv', chunksize=10000)
i = 0
chunk_list = []
for chunk in chunks:
if i >= 10:
break
i += 1
chunk_list.append(chunk)
df1 = pd.concat(chunk_list, sort = True)
chunks = pd.read_csv('report_OOP_Full.csv', skiprows = 100000, chunksize=10000)
i = 0
chunk_list = []
for chunk in chunks:
if i >= 10:
break
i += 1
chunk_list.append(chunk)
df2 = pd.concat(chunk_list, sort = True)
d3 = pd.concat([d1,d2], sort = True)
skiprows was calculated by how many rows the previous dataframe has read in.
This will break after 10 chunks is loaded. store this as df1. and read in the file again by starting at chunk 11, and append that again.
i understand that you're working with some big data. I encourage you to take a look at this function i found. The link below explains how it works.
credit for this function is here:
credit
def reduce_mem_usage(df):
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.uint8).min and c_max < np.iinfo(np.uint8).max:
df[col] = df[col].astype(np.uint8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.uint16).min and c_max < np.iinfo(np.uint16).max:
df[col] = df[col].astype(np.uint16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.uint32).min and c_max < np.iinfo(np.uint32).max:
df[col] = df[col].astype(np.uint32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
elif c_min > np.iinfo(np.uint64).min and c_max < np.iinfo(np.uint64).max:
df[col] = df[col].astype(np.uint64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
This will make sure your dataframe use as low memory as possible when you're working with it.

I guess another way would be to open only raws which have the same values in the first column ( in this case a string, 1 letter). I dont know if that is possible. for example:
A 4 5 6 3
A 3 4 5 7
A 2 1 4 9
A 1 1 8 7
B 1 2 3 1
B 2 2 3 3
C 1 2 1 2
open first a dataframe with only raws starting with "A" , later do the same with "B" , "C" and so on. I dont know if thats possible but it could help.

Related

How to add new columns to pandas data frame using .loc

I am calculating very simple daily stock calculations in data frame ( for e.g. SMA, VWAP, RSI etc). After I upgraded to anaconda 3.0, my code stopped working and gives followed error. I don't have much experience in coding and need some help.
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['RSI', 'ZONE'], dtype='object'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
Followed is the code.
import yfinance as yf
import pandas as pd
def convert_to_dataframe_daily(data):
window = 10
window20 = 20
window50 = 50
window100 = 100
window200 = 200
ema_time = 8
#data = yf.download("googl", period="30d", interval="5m")
#data = yf.download('TSLA', period='30d', interval='5m')
pd.set_option('display.max_columns', None)
#calculation for VWAP
volumeC = data['Volume']
priceC = data['Close']
df = data.assign(VWAP=((volumeC * priceC).cumsum() / volumeC.cumsum()).ffill())
#Convert the timezone to Chicago central
#df.index = pd.DatetimeIndex(df.index.tz_convert('US/Central')) # aware--> aware
#reset the dataframe index and separate time
df.reset_index(inplace=True)
#df.index.intersection
#df2 = df[df.index.isin(dts)]
#df['Date'] = pd.to_datetime(df['Datetime']).dt.date
#df['Time'] = pd.to_datetime(df['Datetime']).dt.time
# calculate stochastic
df['low5']= df['Low'].rolling(5).min()
df['high5']= df['High'].rolling(5).max()
#k = 100 * (c - l) / (h - l)
df['K'] = (df['Close']-df['low5'])/(df['high5']-df['low5'])
#s.reindex([1, 2, 3])
columns = df.columns.values.tolist()
#df[columns[index]]
#df = pd.DataFrame(np.random.randn(8, 4),index=dates, columns=['A', 'B', 'C', 'D'])
df = df.loc[:, ('Date','Open','High','Low', 'Close','Volume','VWAP','K','RSI', 'ZONE')]
#df = df.reindex(['Date','Open','High','Low', 'Close','Volume','VWAP','K','RSI', 'ZONE'])
df['RSI'] = calculate_rsi(df)
filter_Z1 = df['K'] <=0.1
filter_Z2 = (df['K'] > 0.1) & (df['K'] <= 0.2)
filter_Z3 = (df['K'] > 0.2) & (df['K'] <= 0.3)
filter_Z4 = (df['K'] > 0.3) & (df['K'] <= 0.4)
filter_Z5 = (df['K'] > 0.4) & (df['K'] <= 0.5)
filter_Z6 = (df['K'] > 0.5) & (df['K'] <= 0.6)
filter_Z7 = (df['K'] > 0.6) & (df['K'] <= 0.7)
filter_Z8 = (df['K'] > 0.7) & (df['K'] <= 0.8)
filter_Z9 = (df['K'] > 0.8) & (df['K'] <= 0.9)
filter_Z10 = (df['K'] > 0.9) & (df['K'] <= 1)
#plug in stochastic zones
df['ZONE'].where(-filter_Z1, 'Z1', inplace=True)
df['ZONE'].where(-filter_Z2, 'Z2', inplace=True)
df['ZONE'].where(-filter_Z3, 'Z3', inplace=True)
df['ZONE'].where(-filter_Z4, 'Z4', inplace=True)
df['ZONE'].where(-filter_Z5, 'Z5', inplace=True)
df['ZONE'].where(-filter_Z6, 'Z6', inplace=True)
df['ZONE'].where(-filter_Z7, 'Z7', inplace=True)
df['ZONE'].where(-filter_Z8, 'Z9', inplace=True)
df['ZONE'].where(-filter_Z9, 'Z9', inplace=True)
df['ZONE'].where(-filter_Z10, 'Z10', inplace=True)
df = df['Date','Open','High','Low', 'Close','Volume','VWAP','K','RSI', 'ZONE']
return df
data = yf.download('ba', period='500d', interval='1d')
df = convert_to_dataframe_daily(data)
print(df)
A few lines need to be tweaked
Instead of
df = df.loc[:, ('Date','Open','High','Low', 'Close','Volume','VWAP','K','RSI', 'ZONE')]
use
df = df[['Date','Open','High','Low', 'Close','Volume','VWAP','K']]
before
df['ZONE'].where(-filter_Z1, 'Z1', inplace=True)
...
put a line
df['ZONE'] = 0
The line before return df should be changed to
df = df[['Date','Open','High','Low', 'Close','Volume','VWAP','K','RSI', 'ZONE']]

Python3 to speed up the computing of dataframe

I have a dataframe (df) as following
id date t_slot dayofweek label
1 2021-01-01 2 0 1
1 2021-01-02 3 1 0
2 2021-01-01 4 6 1
.......
The data frame is very large(6 million rows). the t_slot is from 1 to 6 value. dayofweek is from 0-6.
I want to get the rate:
- the each id's rate about the label is 1 rate when the t_slot is 1 to 4, and dayofweek is 0-4 in the past 3 months before the date in each row.
- the each id's rate about the label is 1 rate when the t_slot is 1 to 4, and dayofweek is 0-4 in the past 3 months before the date in each row.
- the each id's rate about the label is 1 rate when the t_slot is 5 to 6, and dayofweek is 5-6 in the past 3 months before the date in each row.
- the each id's rate about the label is 1 rate when the t_slot is 5 to 6, and dayofweek is 5-6 in the past 3 months before the date in each row.
I have used loop to compute the rate, but it is very slow, do you have fast way to compute it. My code is copied as following:
def get_time_slot_rate(df):
import numpy as np
if len(df)==0:
return np.nan, np.nan, np.nan, np.nan
else:
work = df.loc[df['dayofweek']<5]
weekend = df.loc[df['dayofweek']>=5]
if len(work)==0:
work_14, work_56 = np.nan, np.nan
else:
work_14 = len(work.loc[(work['time_slot']<5)*(work['label']==1)])/len(work)
work_56 = len(work.loc[(work['time_slot']>5)*(work['label']==1)])/len(work)
if len(weekend)==0:
weekend_14, weekend_56 = np.nan, np.nan
else:
weekend_14 = len(weekend.loc[(weekend['time_slot']<5)*(weekend['label']==1)])/len(weekend)
weekend_56 = len(weekend.loc[(weekend['time_slot']>5)*(weekend['label']==1)])/len(weekend)
return work_14, work_56, weekend_14, weekend_56
import datetime as d_t
lst_id = list(df['id'])
lst_date = list(df['date'])
lst_t14_work = []
lst_t56_work = []
lst_t14_weekend = []
lst_t56_weekend = []
for i in range(len(lst_id)):
if i%100==0:
print(i)
d_date = lst_date[i]
dt = d_t.datetime.strptime(d_date, '%Y-%m-%d')
month_step = relativedelta(months=3)
pre_date = str(dt - month_step).split(' ')[0]
df_s = df.loc[(df['easy_id']==lst_easy[i])
& ((df['delivery_date']>=pre_date)
&(df['delivery_date']< d_date))].reset_index(drop=True)
work_14_rate, work_56_rate, weekend_14_rate, weekend_56_rate = get_time_slot_rate(df_s)
lst_t14_work.append(work_14_rate)
lst_t56_work.append(work_56_rate)
lst_t14_weekend.append(weekend_14_rate)
lst_t56_weekend.append(weekend_56_rate)
I could only fix your function and it's completely untested, but here we go:
Import only once by putting the imports at the top of your .py.
try/except blocks are more efficient than if/else statements.
True and False equals to 1 and 0 respectively in Python.
Don't multiply boolean selectors and use the reverse operator ~
Create the least amount of copies.
import numpy as np
def get_time_slot_rate(df):
# much faster than counting
if df.empty:
return np.nan, np.nan, np.nan, np.nan
# assuming df['label'] is either 0 or 1
df = df.loc[df['label']]
# create boolean selectors to be inverted with '~'
weekdays = df['dayofweek']<=5
slot_selector = df['time_slot']<=5
weekday_count = np.sum(weekdays)
try:
work_14 = len(df.loc[weekdays & slot_selector])/weekday_count
work_56 = len(df.loc[weekdays & ~slot_selector])/weekday_count
except ZeroDivisionError:
work_14 = work_56 = np.nan
weekend_count = np.sum(~weekdays)
try:
weekend_14 = len(df.loc[~weekdays & slot_selector])/weekend_count
weekend_56 = len(df.loc[~weekdays & ~slot_selector])/weekend_count
except ZeroDivisionError:
weekend_14 = weekend_56 = np.nan
return work_14, work_56, weekend_14, weekend_56
The rest of your script doesn't really make sense, see my comments:
for i in range(len(lst_id)):
if i%100==0:
print(i)
d_date = date[i]
# what is d_t ?
dt = d_t.datetime.strptime(d_date, '%Y-%m-%d')
month_step = relativedelta(months=3)
pre_date = str(dt - month_step).split(' ')[0]
df_s = df.loc[(df['easy_id']==lst_easy[i])
& (df['delivery_date']>=pre_date)
&(df['delivery_date']< d_date)].reset_index(drop=True)
# is it df or df_s ?
work_14_rate, work_56_rate, weekend_14_rate, weekend_56_rate = get_time_slot_rate(df)
If your date column is a datetime object than you can compare dates directly (no need for strings).

Vectorized alternative to iterrows : Semantic Analysis

Hi I'm currently doing a semantic tweet analysis and want to improve my code running time with Numpy Vectorization.
I tried enhancing my code for a while but was not successful in doing so.
Could I just enter the formula within the loop iteration to a function and apply it via Numpy.vectorize?
ss = SentimentIntensityAnalyzer()
for index, row in tw_list["full_text"].iteritems():
score = ss.polarity_scores(row)
neg = score["neg"]
neu = score["neu"]
pos = score["pos"]
comp = score["compound"]
if neg > pos:
tw_list.loc[index, "sentiment"] = "negative"
elif pos > neg:
tw_list.loc[index, "sentiment"] = "positive"
else:
tw_list.loc[index, "sentiment"] = "neutral"
tw_list.loc[index, "neg"] = neg
tw_list.loc[index, "neu"] = neu
tw_list.loc[index, "pos"] = pos
tw_list.loc[index, "compound"] = comp
Instead of iterating over rows in dataframe, you can make use of apply function.
def get_sentiments(text):
score = ss.polarity_scores(text)
neg = score["neg"]
neu = score["neu"]
pos = score["pos"]
comp = score["compound"]
if neg > pos:
sentiment = "negative"
elif pos > neg:
sentiment = "positive"
else:
sentiment = "neutral"
return sentiment,neg,neu,pos,comp
tw_list[["sentiment","neg","neu","pos","comp"]] = tw_list["full_text"].apply(get_sentiments,result_type='broadcast')
This should give improvement in perfomance

What will be the best approach, apply function or for loop in pandas dataframe for replacing null values?

I have a dataset with 100000 records for which I need to replace null values based on multiple columns.
I have tried with two approach:
#First Approach
# Missing value treatment
start_time = time.time()
data['date_of_last_rech_data_6'] = data.apply(lambda x: 0 if(np.all(pd.isnull(['date_of_last_rech_data_6','total_rech_data_6','max_rech_data_6']))) else x['date_of_last_rech_data_6'], axis = 1)
data['total_rech_data_6'] = data.apply(lambda x: 0 if(np.all(pd.isnull(['date_of_last_rech_data_6','total_rech_data_6','max_rech_data_6']))) else x['total_rech_data_6'], axis = 1)
data['max_rech_data_6'] = data.apply(lambda x: 0 if(np.all(pd.isnull(['date_of_last_rech_data_6','total_rech_data_6','max_rech_data_6']))) else x['max_rech_data_6'], axis = 1)
data['date_of_last_rech_data_7'] = data.apply(lambda x: 0 if(np.all(pd.isnull(['date_of_last_rech_data_7','total_rech_data_7','max_rech_data_7']))) else x['date_of_last_rech_data_7'], axis = 1)
data['total_rech_data_7'] = data.apply(lambda x: 0 if(np.all(pd.isnull(['date_of_last_rech_data_7','total_rech_data_7','max_rech_data_7']))) else x['total_rech_data_7'], axis = 1)
data['max_rech_data_7'] = data.apply(lambda x: 0 if(np.all(pd.isnull(['date_of_last_rech_data_7','total_rech_data_7','max_rech_data_7']))) else x['max_rech_data_7'], axis = 1)
data['date_of_last_rech_data_8'] = data.apply(lambda x: 0 if(np.all(pd.isnull(['date_of_last_rech_data_8','total_rech_data_8','max_rech_data_8']))) else x['date_of_last_rech_data_8'], axis = 1)
data['total_rech_data_8'] = data.apply(lambda x: 0 if(np.all(pd.isnull(['date_of_last_rech_data_8','total_rech_data_8','max_rech_data_8']))) else x['total_rech_data_8'], axis = 1)
data['max_rech_data_8'] = data.apply(lambda x: 0 if(np.all(pd.isnull(['date_of_last_rech_data_8','total_rech_data_8','max_rech_data_8']))) else x['max_rech_data_8'], axis = 1)
data['date_of_last_rech_data_9'] = data.apply(lambda x: 0 if(np.all(pd.isnull(['date_of_last_rech_data_9','total_rech_data_9','max_rech_data_9']))) else x['date_of_last_rech_data_9'], axis = 1)
data['total_rech_data_9'] = data.apply(lambda x: 0 if(np.all(pd.isnull(['date_of_last_rech_data_9','total_rech_data_9','max_rech_data_9']))) else x['total_rech_data_9'], axis = 1)
data['max_rech_data_9'] = data.apply(lambda x: 0 if(np.all(pd.isnull(['date_of_last_rech_data_9','total_rech_data_9','max_rech_data_9']))) else x['max_rech_data_9'], axis = 1)
end_time = time.time()
print(end_time-start_time)
Time taken by this snippet is 152.52092480659485.
#Second Approach
start_time = time.time()
for i in range(0,len(data)):
# Missing value treatment for the month of June
if pd.isnull((data['date_of_last_rech_data_6'][i]) and (data['total_rech_data_6'][i]) and (data['max_rech_data_6'][i]) ):
data['date_of_last_rech_data_6'][i]=0
data['total_rech_data_6'][i]=0
data['max_rech_data_6'][i]=0
# Missing value treatment for the month of July
if pd.isnull((data['date_of_last_rech_data_7'][i]) and (data['total_rech_data_7'][i]) and (data['max_rech_data_7'][i]) ):
data['date_of_last_rech_data_7'][i]=0
data['total_rech_data_7'][i]=0
data['max_rech_data_7'][i]=0
# Missing value treatment for the month of August
if pd.isnull((data['date_of_last_rech_data_8'][i]) and (data['total_rech_data_8'][i]) and (data['max_rech_data_8'][i]) ):
data['date_of_last_rech_data_8'][i]=0
data['total_rech_data_8'][i]=0
data['max_rech_data_8'][i]=0
# Missing value treatment for the month of September
if pd.isnull((data['date_of_last_rech_data_9'][i]) and (data['total_rech_data_9'][i]) and (data['max_rech_data_9'][i]) ):
data['date_of_last_rech_data_9'][i]=0
data['total_rech_data_9'][i]=0
data['max_rech_data_9'][i]=0
end_time = time.time()
print(end_time-start_time)
Time taken by this code 223.60794281959534. But this code sometime runs and sometime just hangs and stops the kernel.
Is there any other best approach to do this?

Subset two consecutive event occurrence in pandas

I'm trying to get a subset of my data whenever there is consecutive occurrence of an two events in that order. The event is time-stamped. So every time there are continuous 2's and then continuous 3's, I want to subset that to a dataframe and append it to a dictionary. The following code does that but I have to apply this to a very large dataframe of more than 20 mil obs. This is extremely slow using iterrows. How can I make this fast?
df = pd.DataFrame({'Date': [101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122],
'Event': [1,1,2,2,2,3,3,1,3,2,2,3,1,2,3,2,3,2,2,3,3,3]})
dfb = pd.DataFrame(columns = df.columns)
C = {}
f1 = 0
for index, row in df.iterrows():
if ((row['Event'] == 2) & (3 not in dfb['Event'].values)):
dfb = dfb.append(row)
f1 =1
elif ((row['Event'] == 3) & (f1 == 1)):
dfb = dfb.append(row)
elif 3 in dfb['Event'].values:
f1 = 0
C[str(dfb.iloc[0,0])] = dfb
del dfb
dfb = pd.DataFrame(columns = df.columns)
if row['Event'] == 2:
dfb = dfb.append(row)
f1 =1
else:
f1=0
del dfb
dfb = pd.DataFrame(columns = df.columns)
Edit: The desired output is basically a dictionary of the subsets shown in the imagehttps://i.stack.imgur.com/ClWZs.png
If you want to accerlate, you should vectorize your code. You could try it like this (df is the same with your code):
vec = df.copy()
vec['Event_y'] = vec['Event'].shift(1).fillna(0).astype(int)
vec['Same_Flag'] = float('nan')
vec.Same_Flag.loc[(vec['Event_y'] == vec['Event']) & (vec['Event'] != 1)] = 1
vec.dropna(inplace=True)
vec.loc[:, ('Date', 'Event')]
Output is:
Date Event
3 104 2
4 105 2
6 107 3
10 111 2
18 119 2
20 121 3
21 122 3
I think that's close to what you need. You could improve based on that.
I'm not understand why date 104, 105, 107 are not counted.