pandas apply for performance - pandas

I have a pandas apply function that runs inference over a 10k csv of strings
account messages
0 th_account Forgot to tell you Evan went to sleep a little...
1 th_account Hey I heard your buying a house I m getting ri...
2 th_account They re releasing a 16 MacBook
3 th_account 5 cups of coffee today I may break the record
4 th_account Apple Store Items in order W544414717 were del...
The function takes about 17 seconds to run.
I'm working on a text classifier and was wondering if there is a quicker way to write it
def _predict(messages):
results = []
for message in messages:
message = vectorizer.transform([message])
message = message.toarray()
results.append(model.predict(message))
return results
df["pred"] = _predict(df.messages.values)
the vectorizer is a TfidfVectorizer and model is a GaussianNB model from sklearn.
I need to loop through every messsage in the csv and perform a prediction to be shown in a new column

You can try built-in function apply in pandas. Its underlying uses C language passby GIL. But still slow.
def _predict(message):
"""message is each row in dataframe
Each row of dataframe return a result
"""
message = vectorizer.transform([message])
message = message.toarray()
return model.predict(message)
df["pred"] = df.apply(_predict, axis=1)
You can run the following code to evaluate the time.
df.head().apply(_predict, axis=1)

Related

Dask process hangs after warning about "full garbage collection took x% cpu time recently (threshold: y%)"

I'm using Dask to process a massive dataset and eventually build a model for a classification task and I'm running into problems. I hope I can get some help.
Main Task
I'm working with clinical notes. Each clinical note has a note type associated with it. However, over 60% of the notes are of type *Missing*. I'm trying to train a classifier on the notes that are labeled and run inference on the notes that have the missing type.
Data
I'm working with 3 years worth of clinical notes. The total data size is ~1.3TB. These were pulled from a database using PySpark (I have no control over this process) and are organized as year/month/partitions.parquet. The root directory is raw_data. The number of partitions within each month varies (e.g, one of the months has 2620 partitions). The total number of partitions is over 50,000.
Machine
Cores: 64
Memory: 1TB
Machine is shared with others so I won't be able to access the entire hardware resources at a given time.
Code
As a first step towards building the model, I want to preprocess the data and do some EDA. I'm using the package Textdescriptives which uses SpaCy to get some basic information about the text.
def replace_empty(text, replace=np.nan):
"""
Replace empty notes with nan's which can be removed later
"""
if pd.isnull(text):
return text
elif text.isspace() or text == '':
return replace
return text
def fix_ws(text):
"""
Replace multiple carriage returns with a single newline
and multiple new lines with a single new line
"""
text = re.sub('\r', '\n', text)
text = re.sub('\n+', '\n', text)
return text
def replace_empty_part(df, **kwargs):
return df.apply(replace_empty)
def fix_ws_part(df, **kwargs):
return df.apply(fix_ws)
def fix_missing_part(df, **kwargs):
return df.apply(lambda t: *Missing* if t == 'Unknown at this time' else t)
def extract_td_metrics(text, spacy_model):
try:
doc = spacy_model(text)
metrics_df = td.extract_df(doc)[cols]
return metrics_df.squeeze()
except:
return pd.Series([np.nan for _ in range(len(cols))], index=cols)
def extract_metrics_part(df, **kwargs):
spacy_model = spacy.load('en_core_web_sm', disable=['tok2vec', 'parser', 'ner', 'attribute_ruler', 'lemmantizer'])
spacy_model.add_pipe('textdescriptives')
return df.apply(extract_td_metrics, spacy_model=spacy_model)
client = Client(n_workers=32)
notes_df = dd.read_parquet(single_month)
notes_df['Text'] = notes_df['Text'].map_partitions(replace_empty_part, meta='string')
notes_df = notes_df.dropna()
notes_df['Text'] = notes_df['Text'].map_partitions(fix_ws_part, meta='string')
notes_df['NoteType'] = notes_df['NoteType'].map_partitions(fix_missing_part, meta='string')
metrics_df = notes_df['Text'].map_partitions(extract_metrics_part)
notes_df = dd.concat([notes_df, metrics_df], axis=1)
notes_df = notes_df.dropna()
notes_df = notes_df.repartition(npartitions=4)
notes_df.to_parquet(processed_notes, schema={'NoteType': pa.string(), 'Text': pa.string(), write_index=False)
All of this code was tested on a small sample with Pandas to make sure it works and on Dask (on the same sample) to make sure the results matched. When I run this code on only a single month worth of data, after running for a few seconds, the process just hangs outputing a stream of warnings of this type:
timestamp - distributed.utils_perf - WARNING - full garbage collections took 35% CPU time recently (threshold: 10%)
The machine is in a secure enclave so I don't have copy/paste facility so I'm typing out everything here. After some research I came across two threads here and here. While there wasn't a direct solution in either one of them, suggestions included disabling Python garbage collection using gc.disable and starting a clean environment with dask freshly installed. Both of these didn't help me. I'm wondering if I can maybe modify my code so that this problem doesn't happen. There is no way to load all this data in memory and use Pandas directly.
Thanks.

Numpy.dot hang my program, i assume that is memory problem

I have two numpy array: A shape(512,) and B (3000,512). One function call
C = np.dot(B,A) and my program hang without any error out.
My python 3.7.3 and numpy 1.16.2.
But that code run good if i call c = np.dot(B,A) manual with suitable input or the length of B around 50
I don't know what difference between 2 ways of call.
I found the answer. This because of proceed memory limit. My program when running taking 20GB of RAM and when numpy need more memory for it's working the system hanged without a any error or warning, but when i call my that function manually, it called another process and got more RAM for it's work.
When A comes first, you need to use the transpose of B. Interestingly, I did not have to change the shape of A. This does not seem to be consistent to me, but it works.
import numpy as np
A = np.array([i for i in range(512)])
B = np.random.rand(3000, 512)
C1 = B.dot(A) # 3000 rows, 512
B = B.transpose() # 512 rows, 3,000 columns
C2 = A.dot(B)
C2 = C2.transpose() # 3,000 rows, 512 columns
print(np.all(np.equal(C1, C2))) # Verify that the result is the same

Using Dask Delayed on Small/Partitioned Dataframes

I am working with time series data that is formatted as each row is a single instance of a ID/time/data. This means that the rows don't correspond 1 to 1 for each ID. Each ID has many rows across time.
I am trying to use dask delayed to have a function run on an entire ID sequence (it makes sense that the operation should be able to run on each individual ID at the same time since they don't affect each other). To do this I am first looping through each of the ID tags, pulling/locating all the data from that ID (with .loc in pandas, so it is a separate "mini" df), then delaying the function call on the mini df, adding a column with the delayed values and adding it to a list of all mini dfs. At the end of the for loop I want to call dask.compute() on all the mini-dfs at once but for some reason the mini df's values are still delayed. Below I will post some pseudocode about what I just tried to explain.
I have a feeling that this may not be the best way to go about this but it's what made sense at the time and I can't understand whats wrong so any help would be very much appreciated.
Here is what I am trying to do:
list_of_mini_dfs = []
for id in big_df:
curr_df = big_df.loc[big_df['id'] == id]
curr_df['new value 1'] = dask.delayed(myfunc)(args1)
curr_df['new value 2'] = dask.delayed(myfunc)(args2) #same func as previous line
list_of_mini_dfs.append(curr_df)
list_of_mini_dfs = dask.delayed(list_of_mini_dfs).compute()
Concat all mini dfs into new big df.
As you can see by the code I have to reach into my big/overall dataframe to pull out each ID's sequence of data since it is interspersed throughout the rows. I want to be able to call a delayed function on that single ID's data and then return the values from the function call into the big/overall dataframe.
Currently this method is not working, when I concat all the mini dataframes back together the two values I have delayed are still delayed, which leads me to think that it is due to the way I am delaying a function within a df and trying to compute the list of dataframes. I just can't see how to fix it.
Hopefully this was relatively clear and thank you for the help.
IIUC you are trying to do a sort of transform using dask.
import pandas as pd
import dask.dataframe as dd
import numpy as np
# generate big_df
dates = pd.date_range(start='2019-01-01',
end='2020-01-01')
l = len(dates)
out = []
for i in range(1000):
df = pd.DataFrame({"ID":[i]*l,
"date": dates,
"data0": np.random.randn(l),
"data1": np.random.randn(l)})
out.append(df)
big_df = pd.concat(out, ignore_index=True)\
.sample(frac=1)\
.reset_index(drop=True)
Now you want to apply your function fun on columns data0 and data1
Pandas
out = big_df.groupby("ID")[["data0","data1"]]\
.apply(fun)\
.reset_index()
df_pd = pd.merge(big_df, out, how="left", on="ID" )
Dask
df = dd.from_pandas(big_df, npartitions=4)
out = df.groupby("ID")[["data0","data1"]]\
.apply(fun, meta={'data0':'f8',
'data1':'f8'})\
.rename(columns={'data0': 'new_values0',
'data1': 'new_values1'})\
.compute() # Here you need to compute otherwise you'll get NaNs
df_dask = dd.merge(df, out,
how="left",
left_on=["ID"],
right_index=True)
The dask version is not necessarily faster than the pandas one. In particular if your df fits in RAM.

Why there is a train inside train in line 2 & 3?

I am a beginner in python. This is an excerpt of a code from https://github.com/minsuk-heo/kaggle-titanic/blob/master/titanic-solution.ipynb (line no. 12). I was trying to understand a bar chart with it:
def bar_chart(feature):
survived = train[train['Survived']==1][feature].value_counts()
dead = train[train['Survived']==0][feature].value_counts()
df = pd.DataFrame([survived,dead])
df.index = ['Survived','Dead']
df.plot(kind='bar',stacked=True, figsize=(10,5))
#Pranjal try to learn a python module first (here, pandas) before jumping to any challenge (say, kaggle's titanic).
To answer your question, consider the lines you asked for -
Line 2: survived = train[train['Survived']==1][feature].value_counts()
Line 3: dead = train[train['Survived']==0][feature].value_counts()
The train['Survived']==1 code results into a boolean (True/False) pandas series. It results into True where the column Survived value is equal to 1 else False. Once the series is generated it is fed to outer train and only the rows which are mapped to True will be kept and others will be dropped. Next you select only the feature column from the resulted dataframe and returns object containing counts of unique values. Similarly, proceed with line 3.
Tip: There happened no permanent change to the train dataframe.

Randomly selecting different percentages of data in Python

Python beginner, here. I have a dataset with 101 rows which I have imported into Python (as a csv file) using Pandas. I essentially want to randomly generate a number between 0 and 1 and, based on the result, randomly select the percent equivalent from the dataset. So, for instance, a randomly generated number of 0.89 would require 89% of the data to be selected.
I also want to specify different percentages such that I have, for instance, 89%, 8% and 3% of the data randomly selected at once. This is so I can make different assumptions based on X% of data that has been selected (for instance, for 3% of rows selected print('A'), etc.). I finally want to simulate the whole thing several times and store the results.
I have been experimenting with different types of code, such as the df.sample(frac=0.89) and etc. but I'm not sure how to extend this to select different percentages at the same time.
My current code is:
import random
import pandas import pandas as pd
df = pd.read_csv(r'R_100.csv', encoding='cp1252')
df_1 = df['R_MD'].sample(frac=0.8889)
Total = df['PR_MD'].sum()
print(df_1, 'Total=', Total)
Any advice is much appreciated. Thanks in advance.
Here is something you can do, you need a function to do this every time.
import pandas as pd
df = pd.read_csv(r'R_100.csv', encoding='cp1252')
After you read the dataframe
def frac(dataframe, fraction, other_info=None):
"""Returns fraction of data"""
return dataframe.sample(frac=fraction)
here other_info can be specific column name and then call the function however many times you want
df_1 = frac(df, 0.3)
it will return you a new dataframe that you can use for anything you want, you can use this something like this as I infer from your example you are taking sum of a column
import random
def random_gen():
"""generates random number"""
return random.randint(0,1)
def print_sum(column_name):
"""Prints sum"""
# call the random_gen() to give out a number
rand_num = random_gen()
# pass the number as fraction parameter to frac()
df_tmp = frac(df, rand_num)
print(df_tmp[str(column_name)].sum())
Or if you want
but I'm not sure how to extend this to select different percentages at the same time.
Then just change the print_sum as follows
def print_sum(column_name):
"""returns result for 10 iterations"""
# list to store all the result
results = []
# selecting different percentage fraction
# for 10 different random fraction or you can have a list of all the fractions you want
# and then for loop over that list
for i in range(1,10):
# generate random number
fracr = random_gen()
# pass the number as fraction parameter to frac()
df_tmp = frac(df, fracr)
result.append(df_tmp[str(column_name)].sum())
return result
Hope this helps! Feedback is much appreciated :)