Pandas and Fuzzy Match - pandas

Currently I have two data frames. I am trying to get a fuzzy match of client names using fuzzywuzzy's process.extractOne function. When I have run the following script on sample data I get good results and no error, but when I run the following on my current data frames I get both an Attribute and Type error. I am not able to provide the data for security reasons, but if anyone can figure out why I am getting errors based on the script provided I would be much obliged.
names2 = list(dftr3['Common Name'])
names3 = dict(zip(names2,names2))
def get_fuzz_match(row):
match = process.extractOne(row['CLIENT_NAME'],choices = n3.keys(),score_cutoff = 80)
if match:
return n3[match[0]]
return np.nan
dfmi4['Match Name'] = dfmi4.apply(get_fuzz_match, axis=1)
I know not having some examples makes this more difficult to troubleshoot, so I will answer any question and edit the post to help this process along. The specific errors are:
1.AttributeError: 'dict_keys' object has no attribute 'items'
2.TypeError: expected string or buffer

The AttributeError is straightforward and to be expected, I think. Fuzzywuzzy's process.extract function, which does most of the actual work in process.extractOne, uses a try:... except: clause to determine whether to process the choices parameter as dict-like or list-like. I think you are seeing the exception because the TypeError is raised during the except: clause.
The TypeError is trickier to pin down, but I suspect it occurs somewhere in the StringProcessor class, used in the processor module, again called by extract, which uses several string methods and doesn't catch exceptions. So it seems likely that your apply call is passing something that is not a string. Is it possible that you have any empty cells?

Related

Transforming Python Classes to Spark Delta Rows

I am trying to transform an existing Python package to make it work with Structured Streaming in Spark.
The package is quite complex with multiple substeps, including:
Binary file parsing of metadata
Fourier Transformations of spectra
The intermediary & end results were previously stored in an SQL database using sqlalchemy, but we need to transform it to delta.
After lots of investigation, I've made the first part work for the binary file parsing but only by statically defining the column types in an UDF:
fileparser = F.udf(File()._parseBytes,FileDelta.getSchema())
Where the _parseBytes() method takes a binary stream and outputs a dictionary of variables
Now I'm trying to do this similarly for the spectrum generation:
spectrumparser = F.udf(lambda inputDict : vars(Spectrum(inputDict)),SpectrumDelta.getSchema())
However the Spectrum() init method generates multiple Pandas Dataframes as fields.
I'm getting errors as soon as the Executor nodes get to that part of the code.
Example error:
expected zero arguments for construction of ClassDict (for pandas.core.indexes.base._new_Index).
This happens when an unsupported/unregistered class is being unpickled that requires construction arguments.
Fix it by registering a custom IObjectConstructor for this class.
Overall, I feel like i'm spending way too much effort for building the Delta adaptation. Is there maybe an easy way to make these work?
I read in 1, that we could switch to the Pandas on spark API but to me that seems to be something to do within the package method itself. Is that maybe the solution, to rewrite the entire package & parsers to work natively in PySpark?
I also tried reproducing the above issue in a minimal example but it's hard to reproduce since the package code is so complex.
After testing, it turns out that the problem lies in the serialization when wanting to output (with show(), display() or save() methods).
The UDF expects ArrayType(xxxType()), but gets a pandas.Series object and does not know how to unpickle it.
If you explicitly tell the UDF how to transform it, the UDF works.
def getSpectrumDict(inputDict):
spectrum = Spectrum(inputDict["filename"],inputDict["path"],dict_=inputDict)
dict = {}
for key, value in vars(spectrum).items():
if type(value) == pd.Series:
dict[key] = value.tolist()
elif type(value) == pd.DataFrame:
dict[key] = value.to_dict("list")
else:
dict[key] = value
return dict
spectrumparser = F.udf(lambda inputDict : getSpectrumDict(inputDict),SpectrumDelta.getSchema())

Unable to overwrite a Column Value using Pandas

I'm planning to overwrite a Field value using pandas but that does not seem to work. Am i missing anything as part of the code below?
`for row_no in range(df.shape[0]):
rowIndex = df.index[row_no]
if re.search('Fex|Process|PIP|VIP|Generic|Mobility', df.loc[rowIndex].VPC_Sub_Cat, re.I):
print(df.loc[rowIndex].Headline)
print(df.loc[rowIndex].VPC_Sub_Cat)
print(df.loc[rowIndex].Final_Result)
df.loc[rowIndex].Final_Result = 0
print(df.loc[rowIndex].Final_Result)
break`
The output that I get after running this piece of code is:
This is the description of the issue...
VPC-Generic
1
1
Also can i achieve the same thing using a function and applying that on a data frame? kindly let me know.
df.loc[rowIndex].Final_Result is equals to (in most situation though...)
df.loc[rowIndex]['Final_Result']
This will cause a chained assignment (see here #Warning)
Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called chained assignment and should be avoided. See Returning a View versus Copy.
And then from Returning a view versus a copy
But it turns out that assigning to the product of chained indexing has inherently unpredictable results.
So using df.loc[rowIndex, 'Final_Result'] to make sure that the value you assigned is view, not copy.

TfidfTransformer.fit_transform( dataframe ) fails

I am trying to build a TF/IDF transformer (maps sets of words into count vectors) based on a Pandas series, in the following code:
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts )
This fails with the following message:
ValueError: could not convert string to float: "I'm trying to work out, in general terms..."
Now, "excerpts" is a Pandas Series consisting of a bunch of text strings excerpted from StackOverflow posts, but when I look at the dtype of excerpts,
it says object. So, I reason that the problem might be that something is inferring the type of that Series to be float. So, I tried several ways to make the Series have dtype str:
I tried forcing the column types for the dataframe that includes "excerpts" to be str, but when I look at the dtype of the resulting Series, it's still object
I tried casting the entire dataframe that includes "excerpts" to dtypes str using Pandas.DataFrame.astype(), but the "excerpts" stubbornly have dtype object.
These may be red herrings; the real problem is with fit_transform. Can anyone suggest some way whereby I can see which entries in "excerpts" are causing problems or, alternatively, simply ignore them (leaving out their contribution to the TF/IDF).
I see the problem. I thought that tf_idf_transformer.fit_transform takes as the source argument an array-like of text strings. Instead, I now understand that it takes an (n,2)-array of text strings / token counts. The correct usage is more like:
count_vect = CountVectorizer()
excerpts_token_counts = count_vect.fit_transform( excerpts)
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts_token_counts )
Sorry for my confusion (I should have looked at "Sample pipeline for text feature extraction and evaluation" in the TfidfTransformer documentation for sklearn).

#NLConstraint with vectorized constraint JuMP/Julia

I am trying to solve a problem involving the equating of sums of exponentials.
This is how I would do it hardcoded:
#NLconstraint(m, exp(x[25])==exp(x[14])+exp(x[18]))
This works fine with the rest of the code. However, when I try to do it for an arbitrary set of equations like the above I get an error. Here's my code:
#NLconstraint(m,[k=1:length(LHSSum)],sum(exp.(LHSSum[k][i]) for i=1:length(LHSSum[k]))==sum(exp.(RHSSum[k][i]) for i=1:length(RHSSum[k])))
where LHSSum and RHSSum are arrays containing arrays of the elements that need to be exponentiated and then summed over. That is LHSSum[1]=[x[1],x[2],x[3],...,x[n]]. Where x[i] are variables of type JuMP.Variable. Note that length(LHSSum)=length(RHSSum).
The error returned is:
LoadError: exp is not defined for type Variable. Are you trying to build a nonlinear problem? Make sure you use #NLconstraint/#NLobjective.
So a simple solution would be to simply do all the exponentiating and summing outside of the #NLconstraint function, so the input would be a scalar. However, this too presents a problem since exp(x) is not defined since x is of type JuMP.variable, whereas exp expects something of type real. This is strange since I am able to calculate exponentials just fine when the function is called within an #NLconstraint(). I.e. when I code this line#NLconstraint(m,exp(x)==exp(z)+exp(y)) instead of the earlier line, no errors are thrown.
Another thing I thought to do would be a Taylor Series expansion, but this too presents a problem since it goes into #NLconstraint land for powers greater than 2, and then I get stuck with the same vectorization problem.
So I feel stuck, I feel like if JuMP would allow for the vectorized evaluation of #NLconstraint like it does for #constraint, this would not even be an issue. Another fix would be if JuMP implements it's own exp function to allow for the exponentiation of JuMP.Variable type. However, as it is I don't see a way to solve this problem in general using the JuMP framework. Do any of you have any solutions to this problem? Any clever workarounds that I am missing?
I'm confused why i isn't used in the expressions you wrote. Do you mean:
#NLconstraint(m, [k = 1:length(LHSSum)],
sum(exp(LHSSum[k][i]) for i in 1:length(LHSSum[k]))
==
sum(exp(RHSSum[k][i]) for i in 1:length(RHSSum[k])))

In django, how do I sort a model on a field and then get the last item?

Specifically, I have a model that has a field like this
pub_date = models.DateField("date published")
I want to be able to easily grab the object with the most recent pub_date. What is the easiest/best way to do this?
Would something like the following do what I want?
Edition.objects.order_by('pub_date')[:-1]
obj = Edition.objects.latest('pub_date')
You can also simplify things by putting get_latest_by in the model's Meta, then you'll be able to do
obj = Edition.objects.latest()
See the docs for more info. You'll probably also want to set the ordering Meta option.
Harley's answer is the way to go for the case where you want the latest according to some ordering criteria for particular Models, as you do, but the general solution is to reverse the ordering and retrieve the first item:
Edition.objects.order_by('-pub_date')[0]
Note:
Normal python lists accept negative indexes, which signify an offset from the end of the list, rather than the beginning like a positive number. However, QuerySet objects will raise AssertionError: Negative indexing is not supported. if you use a negative index, which is why you have to do what insin said: reverse the ordering and grab the 0th element.
Be careful of using
Edition.objects.order_by('-pub_date')[0]
as you might be indexing an empty QuerySet. I'm not sure what the correct Pythonic approach is, but the simplest would be to wrap it in an if/else or try/catch:
try:
last = Edition.objects.order_by('-pub_date')[0]
except IndexError:
# Didn't find anything...
But, as #Harley said, when you're ordering by date, latest() is the djangonic way to do it.
This has already been answered, but for more reference, this is what Django Book has to say about Slicing Data on QuerySets:
Note that negative slicing is not supported:
>>> Publisher.objects.order_by('name')[-1]
Traceback (most recent call last):
...
AssertionError: Negative indexing is not supported.
This is easy to get around, though. Just change the order_by()
statement, like this:
>>> Publisher.objects.order_by('-name')[0]
Refer the link for more such details. Hope that helps!