Speeding up Sympy's solveset calculation for a large array of variables - pandas

I'm trying to create a parametrization of points in space from a specific point according to a specific inequality.
I'm doing it using Sympy.solevset method while the calculation will return an interval of the parameter t that represents all points between those in my dataframe.
Sadly, performing a Sympy.solveset over 13 sets of values (i.e 13 iterations) leads to execution times of over 20 seconds overall, and over 1 sec calculation time per set.
The code:
from sympy import *
from sympy import S
from sympy.solvers.solveset import solveset, solveset_real
import pandas as pd
import time
t=symbols('t',positive=True)
p1x,p1y,p2x,p2y=symbols('p1x p1y p2x p2y')
centerp=[10,10]
radius=5
data={'P1X':[0,1,2,3,1,2,3,1,2,3,1,2,3],'P1Y':[3,2,1,0,1,2,3,1,2,3,1,2,3],'P2X':[3,8,2,4,1,2,3,1,2,3,1,2,3],'P2Y':[3,9,10,7,1,2,3,1,2,3,1,2,3],'result':[0,0,0,0,0,0,0,0,0,0,0,0,0]}
df=pd.DataFrame(data)
parameterized_x=p1x+t*(p2x-p1x)
parameterized_y=p1y+t*(p2y-p1y)
start_whole_process=time.time()
overall_time=0
for index,row in df.iterrows():
parameterized_x.subs([[p1x,row['P1X']],[p2x,row['P2X']]])
parameterized_y.subs([[p1y,row['P1Y']],[p2y,row['P2Y']]])
expr=sqrt((parameterized_x-centerp[0])**2+(parameterized_y-centerp[1])**2)-radius
start=time.time()
df.at[index,'result']=solveset(expr>=0,t,domain=S.Reals)
end=time.time()
overall_time=overall_time+end-start
end_whole_process=time.time()
I need to know if there's a way to enhance calculation time or maybe there is another package that can preform a specific inequality over large quantities of data without having to wait minutes upon minutes.

There is one big mistake in your current approach than needs to be fixed first. Inside your for loop you did:
parameterized_x.subs([[p1x,row['P1X']],[p2x,row['P2X']]])
parameterized_y.subs([[p1y,row['P1Y']],[p2y,row['P2Y']]])
expr=sqrt((parameterized_x-centerp[0])**2+(parameterized_y-centerp[1])**2)-radius
This is wrong: SymPy expressions cannot be modified in place. This leads your expr to be exactly the same for each row, namely:
# sqrt((p1x + t*(-p1x + p2x) - 10)**2 + (p1y + t*(-p1y + p2y) - 10)**2) - 5
Then, solveset tries to solve the same expression on each row. Because this expression contains 3 symbols, solveset takes a long time trying to compute the solution, eventually producing the same answer for each row:
# ConditionSet(t, sqrt((p1x + t*(-p1x + p2x) - 10)**2 + (p1y + t*(-p1y + p2y) - 10)**2) - 5 >= 0, Complexes)
Remember: every operation you apply to a SymPy expression creates a new SymPy expression. So, the above code has to be modified to:
px_expr = parameterized_x.subs([[p1x,row['P1X']],[p2x,row['P2X']]])
py_expr = parameterized_y.subs([[p1y,row['P1Y']],[p2y,row['P2Y']]])
expr=sqrt((px_expr-centerp[0])**2+(py_expr-centerp[1])**2)-radius
In doing so, expr is different for each row, as it is expected. Then, solveset computes different solutions, and it is much much faster.
Here is your full example:
from sympy import *
from sympy.solvers.solveset import solveset, solveset_real
import pandas as pd
import time
t=symbols('t',positive=True)
p1x,p1y,p2x,p2y=symbols('p1x p1y p2x p2y')
centerp=[10,10]
radius=5
data={'P1X':[0,1,2,3,1,2,3,1,2,3,1,2,3],'P1Y':[3,2,1,0,1,2,3,1,2,3,1,2,3],'P2X':[3,8,2,4,1,2,3,1,2,3,1,2,3],'P2Y':[3,9,10,7,1,2,3,1,2,3,1,2,3],'result':[0,0,0,0,0,0,0,0,0,0,0,0,0]}
df=pd.DataFrame(data)
parameterized_x=p1x+t*(p2x-p1x)
parameterized_y=p1y+t*(p2y-p1y)
start_whole_process=time.time()
overall_time=0
for index,row in df.iterrows():
px_expr = parameterized_x.subs([[p1x,row['P1X']],[p2x,row['P2X']]])
py_expr = parameterized_y.subs([[p1y,row['P1Y']],[p2y,row['P2Y']]])
expr=sqrt((px_expr-centerp[0])**2+(py_expr-centerp[1])**2)-radius
df.at[index,'result']=solveset(expr>=0,t,domain=S.Reals)
end_whole_process=time.time()
print("end_whole_process - start_whole_process", end_whole_process - start_whole_process)

Related

Pandas rolling window on an offset between 4 and 2 weeks in the past

I have a datafile with quality scores from different suppliers over a time range of 3 years. The end goal is to use machine learning to predict the quality label (good or bad) of a shipment based on supplier information.
I want to use the mean historic quality data over a specific period of time as an input feature in this model by using pandas rolling window. the problem with this method is that pandas only allows you to create a window from t=0-x until t=0 for you rolling window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='14d',closed='left').mean()
And this is were the problem comes in. For my feature I want to use quality data from a period of 2 weeks, but these 2 weeks are not the 2 weeks before the corresponding shipment, but of 2 weeks, starting from t=-4weeks , and ending on t=-2weeks.
You would imagine that this could be solved by using the same string of code but changing the window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='28d' - '14d',closed='left').mean()
This, or any other type of denotation of this specific window does not seem to work.
It seems like pandas does not offer a solution to this problem, so we made a work around it with the following solution:
def time_shift_week(df):
def _avg_score_interval_func(series):
current_time = series.index[-1]
result = series[(series.index > ( current_time- pd.Timedelta(value=4, unit='w')))
& (series.index < (current_time - pd.Timedelta(value=2, unit='w')))]
return result.mean() if len(result)>0 else 0.0
temp_df = df.groupby(by=["supplier", "timestamp"], as_index=False).aggregate({"score": np.mean}).set_index('timestamp')
temp_df["w-42"] = (
temp_df
.groupby(["supplier"])
.ag_score
.apply(lambda x:
x
.rolling(window='30D', closed='both')
.apply(_avg_score_interval_func)
))
return temp_df.reset_index()
This results in a new df in which we find the average score score per supplier per timestamp, which we can subsequently merge with the original data frame to obtain the new feature.
Doing it this way seems really cumbersome and overly complicated for the task I am trying to perform. Eventhough we have found a workaround, I am wondering if there is an easier method of doing this.
Is anyone aware of a less complicated way of performing this rolling window feature extraction?
While pandas does not have the custom date offset you need, calculating the mean is pretty simple: it's just sum divided by count. You can subtract the 14-day rolling window from the 28-day rolling window:
# Some sample data. All scores are sequential for easy verification
idx = pd.MultiIndex.from_product(
[list("ABC"), pd.date_range("2020-01-01", "2022-12-31")],
names=["supplier", "timestamp"],
)
df = pd.DataFrame({"score": np.arange(len(idx))}, index=idx).reset_index()
# Now we gonna do rolling avg on score with the custom window.
# closed=left mean the current row will be excluded from the window.
score = df.set_index("timestamp").groupby("supplier")["score"]
r28 = score.rolling("28d", closed="left")
r14 = score.rolling("14d", closed="left")
avg_score = (r28.sum() - r14.sum()) / (r28.count() - r14.count())

pd.to_timedelta('1D') works while pd.to_timedelta('D') fails. Issue when converting ferquency to timedelta

I'm encountering a silly problem I guess.
Currently, I'm using pd.infer_freq to get the frequency of a dataframe index. Afterwards, I use pd.to_timedelta() to convert this frequency to a timedelta object, to be added to another date.
This works quite fine, except when the dataframe index has a frequency which can be expressed as a single time unit (eg. 1 day, or 1 minute).
To be more precise,
freq = pd.infer_freq(df)
# let's say it gives '2D' because data index is spaced on that interval
timedelta = pd.to_timedelta(freq)
works, while
freq = pd.infer_freq(df)
# let's say it gives 'D' because data index is spaced on that interval
timedelta = pd.to_timedelta(freq)
fails and returns
ValueError: unit abbreviation w/o a number
This could work if I supplied '1D' instead of 'D' though.
I could try to check if the first character of the freq string is numeric, and add '1' otherwise, but that seems quite cumbersome.
Is anyone aware of a better approach ?

T-test on the means pandas

I'm woking with the Movielens dataset and I would like to do the t-test on the mean ratings value of the male and female users.
import pandas as pd
from scipy.stats import ttest_ind
users_table_names= ['user_id','age','gender','occupation','zip_code']
users= pd.read_csv('ml-100k/u.user', sep='|', names= users_table_names)
ratings_table_names= ['user_id', 'item_id','rating','timestamp']
ratings= pd.read_csv('ml-100k/u.data', sep='\t', names=ratings_table_names)
rating_df= pd.merge(users, ratings)
males = rating_df[rating_df['gender']=='M']
females = rating_df[rating_df['gender']=='F']
ttest_ind(males.rating, females.rating)
And I get the following result:
Ttest_indResult(statistic=-0.27246234775012407, pvalue=0.7852671011802962)
Is this the correct way to do this operation? The results seem a bit odd.
Thank you in advance!
With your code you are considering a two-sided ttest with the assumption that the populations have identical variances, once you haven't specified the parameter equal_var and by default it is True on the scypi ttest_ind().
So you can represent your statitical test as:
Null hypothesis (H0): there is no difference between the values recorded for male and females, or in other words, means are similar. (µMale == µFemale).
Alternative hypothesis (H1): there is a difference between the values recorded for male and females, or in other words, means are not similar (both the situations where µMale > µFemale and µMale < µFemale, or simply µMale != µFemale)
The significance level is an arbitrary definition on your test, such as 0.05. If you had obtained a small p-value, smaller than your significance level, you could disprove the null hypothesis (H0) and consequently prove the alternative hypothesis (H1).
In your results, the p-value is ~0.78, or you can't disprove the H0. So, you can assume that the means are equal.
Considering the standard deviations of sampes as below, you could eventually define your test as equal_var = False:
>> males.rating.std()
1.1095557786889139
>> females.rating.std()
1.1709514829100405
>> ttest_ind(males.rating, females.rating, equal_var = False)
Ttest_indResult(statistic=-0.2654398046364026, pvalue=0.7906719538136853)
Which also confirms that the null hypothesis (H0).
If you use the stats model ttest_ind(), you also get the degrees of freedon used in the t-test:
>> import statsmodels.api as sm
>> sm.stats.ttest_ind(males.rating, females.rating, alternative='two-sided', usevar='unequal')
(-0.2654398046364028, 0.790671953813685, 42815.86745494558)
What exactly you've found odd on your results?

Using Python, how can you make a time wheel include all 24 wedges if your data doesn't have data in each category?

Using code from David Dale (Time Wheel in python3 pandas), my data is fairly large but has a few hours that don't have data and subsequently the corresponding wedges are not shown in the time wheel. And so the wheel is missing wedges and it makes it look wrong even though it technically is not wrong.
I have searched the proposed questions when asking this question and tried to understand the code to alter it but cannot.
David Dale code copied from the link:
def pie_heatmap(table, cmap=cm.hot, vmin=None, vmax=None,inner_r=0.25, pie_args={}):
n, m = table.shape
vmin= table.min().min() if vmin is None else vmin
vmax= table.max().max() if vmax is None else vmax
centre_circle = plt.Circle((0,0),inner_r,edgecolor='black',facecolor='white',fill=True,linewidth=0.25)
plt.gcf().gca().add_artist(centre_circle)
norm = mpl.colors.Normalize(vmin=vmin, vmax=vmax)
cmapper = cm.ScalarMappable(norm=norm, cmap=cmap)
for i, (row_name, row) in enumerate(table.iterrows()):
labels = None if i > 0 else table.columns
wedges = plt.pie([1] * m,radius=inner_r+float(n-i)/n, colors=[cmapper.to_rgba(x) for x in row.values],
labels=labels, startangle=90, counterclock=False, wedgeprops={'linewidth':-1}, **pie_args)
plt.setp(wedges[0], edgecolor='white',linewidth=1.5)
wedges = plt.pie([1], radius=inner_r+float(n-i-1)/n, colors=['w'], labels=[row_name], startangle=-90, wedgeprops={'linewidth':0})
plt.setp(wedges[0], edgecolor='white',linewidth=1.5)
The code works well - thanks Dave - but I need it to make the time wheel with 24 wedges regardless of whether data exists or not for that wedge. Thanks for any help!
Update: I wrote a script to make it work for me. data is my 7x24 data table (7 days by 24 hours each). hours is a list of the 24 hours of the day. ie, ['00:00','01:00','02:00','03:00' ... '23:00']. It is a list of strings because the data is strings when we get it.
blankhours=pandas.DataFrame(0,index=np.arange(0,24),columns=np.arange(1)) #to get 0s
shape=data.shape
if shape[1] < 24: #to access shape of the columns
for h in hours:
hour=0 #counter
if h not in data.columns.values: #see if it is in what should be the complete list
data.insert(hour,h,blankhours) #insert it in there since it wasn't in there
hour+=1 #increment counter
data=data.sort_index(axis=1) #sort the final dataframe by column headers
Hopefully that helps someone...

Apply function with pandas dataframe - POS tagger computation time

I'm very confused on the apply function for pandas. I have a big dataframe where one column is a column of strings. I'm then using a function to count part-of-speech occurrences. I'm just not sure the way of setting up my apply statement or my function.
def noun_count(row):
x = tagger(df['string'][row].split())
# array flattening and filtering out all but nouns, then summing them
return num
So basically I have a function similar to the above where I use a POS tagger on a column that outputs a single number (number of nouns). I may possibly rewrite it to output multiple numbers for different parts of speech, but I can't wrap my head around apply.
I'm pretty sure I don't really have either part arranged correctly. For instance, I can run noun_count[row] and get the correct value for any index but I can't figure out how to make it work with apply how I have it set up. Basically I don't know how to pass the row value to the function within the apply statement.
df['num_nouns'] = df.apply(noun_count(??),1)
Sorry this question is all over the place. So what can I do to get a simple result like
string num_nouns
0 'cat' 1
1 'two cats' 1
EDIT:
So I've managed to get something working by using list comprehension (someone posted an answer, but they've deleted it).
df['string'].apply(lambda row: noun_count(row),1)
which required an adjustment to my function:
def tagger_nouns(x):
list_of_lists = st.tag(x.split())
flat = [y for z in list_of_lists for y in z]
Parts_of_speech = [row[1] for row in flattened]
c = Counter(Parts_of_speech)
nouns = c['NN']+c['NNS']+c['NNP']+c['NNPS']
return nouns
I'm using the Stanford tagger, but I have a big problem with computation time, and I'm using the left 3 words model. I'm noticing that it's calling the .jar file again and again (java keeps opening and closing in the task manager) and maybe that's unavoidable, but it's really taking far too long to run. Any way I can speed it up?
I don't know what 'tagger' is but here's a simple example with a word count that ought to work more or less the same way:
f = lambda x: len(x.split())
df['num_words'] = df['string'].apply(f)
string num_words
0 'cat' 1
1 'two cats' 2