Create a new column based on specific character from existing column fail : 'str' object has no attribute 'str' - pandas

I hope you can help me. I'm looking for to classify some product based on the size: 40ML or other.
Here is my piece of code:
1. Dataframe creation
test = {'Name':['ProductA 40ML','ProductB 100ML','ProductC 40ML','ProductD 100ML']}
df1=pd.DataFrame(test)
2. Function built for classification
def size_class(row):
if row['Name'].str.contains('40ML'):
val = '40ML'
else:
val = 'other'
return val
df1['size_classification'] = df1.apply(size_class, axis=1)
Error message:
However the function returns the following error: AttributeError: 'str' object has no attribute 'str'
Question
Would you please be able to help me fix this one? I had a look at existing issues but couldn't find any answer addressing this.

I figure out some things you missed in your implementation:
In Python for most of the cases of membership tests, the operator in is more relevant than contains. Membership test operations documentation, see more details in this SOF question: Does Python have a string 'contains' substring method?
The default of the apply function is to look at the value of specific column, so you don't need to apply it on the whole data frame, but only on the relevant column.
The function applied with 'apply' looks separately on every cell's value. In your case, it's a string so you don't need to cast things.
So, the code that fixes your bugs is:
import pandas as pd
test = {'Name':['ProductA 40ML','ProductB 100ML','ProductC 40ML','ProductD 100ML']}
df1=pd.DataFrame(test)
def size_class(row):
if '40ML' in row:
val = '40ML'
else:
val = 'other'
return val
df1['size_classification'] = df1['Name'].apply(size_class)
print(df1.head())

Related

Can anyone tell me what's wrong with my code (I am a newbie in programming, pls do cooperate )

I am trying to write a code which calculates the HCF of two numbers but I am either getting a error or an empty list as my answer
I was expecting the HCF, My idea was to get the factors of the 2 given numbers and then find the common amongst them then take the max out of that
For future reference, do not attach screenshots. Instead, copy your code and put it into a code block because stack overflow supports code blocks. To start a code block, write three tildes like ``` and to end it write three more tildes to close. If you add a language name like python, or javascript after the first three tildes, syntax highlighting will be enabled. I would also create a more descriptive title that more accurately describes the problem at hand. It would look like so:
Title: How to print from 1-99 in python?
for i in range(1,100):
print(i)
To answer your question, it seems that your HCF list is empty, and the python max function expects the argument to the function to not to be empty (the 'arg' is the HCF list). From inspection of your code, this is because the two if conditions that need to be satisfied before anything is added to HCF is never satisfied.
So it could be that hcf2[x] is not in hcf and hcf[x] is not in hcf[x] 2.
What I would do is extract the logic for the finding of the factors of each number to a function, then use built in python functions to find the common elements between the lists. Like so:
num1 = int(input("Num 1:")) # inputs
num2 = int(input("Num 2:")) # inputs
numberOneFactors = []
numberTwoFactors = []
commonFactors = []
# defining a function that finds the factors and returns it as a list
def findFactors(number):
temp = []
for i in range(1, number+1):
if number%i==0:
temp.append(i)
return temp
numberOneFactors = findFactors(num1) # populating factors 1 list
numberTwoFactors = findFactors(num2) # populating factors 2 list
# to find common factors we can use the inbuilt python set functions.
commonFactors = list(set(numberOneFactors).intersection(numberTwoFactors))
# the intersection method finds the common elements in a set.

How do I access dataframe column value within udf via scala

I am attempting to add a column to a dataframe, using a value from a specific column—-let’s assume it’s an id—-to look up its actual value from another df.
So I set up a lookup def
def lookup(id:String): String {
return lookupdf.select(“value”)
.where(s”id = ‘$id’”).as[String].first
}
The lookup def works if I test it on its own by passing an id string, it returns the corresponding value.
But I’m having a hard time finding a way to use it within the “withColumn” function.
dataDf
.withColumn(“lookupVal”, lit(lookup(col(“someId”))))
It properly complains that I’m passing in a column, instead of the expected string, the question is how do I give it the actual value from that column?
You cannot access another dataframe from withColumn . Think of withColumn can only access data at a single record level of the dataDf
Please use a join like
val resultDf = lookupDf.select(“value”,"id")
.join(dataDf, lookupDf("id") == dataDf("id"), "right")

Selecting in DataFrame without typing 'INDEX', but by calling user-defined variable as existing index/value

I'm looking for a way to select a specific part of a DataFrame. This works as follows:
df = gpd.read("path_to_file")
df.set_index(['OBJECTID'], inplace=True)
Polygon = df.loc[['81207'], 'geometry']
(the code continues with other operations using the same 'OBJECTID' in different GeoDataFrame; this is needed to not lose geometry of points and/or polygons as only one geometry type can be linked to a GeoDataFrame)
This gives the correct output. However, the same process will be incorporated in a function to receive similar output for a user-defined input of the 'OBJECTID'. I'm therefore looking for a way to select data based on a user-defined variable: OBJECTID = 81207. How can an index be called by using a variable?
Any suggestions are welcome.
Thanks in advance.
Example of what I would like to achieve:
def Building(OBJECTID):
OBJECTID = OBJECTID
print("Building with OBJECTID:", OBJECTID)
Polygon = df.loc['OBJECTID'] #OBJECTID defined in function
Points = df2.loc['OBJECTID'] #OBJECTID defined in function
return (Polygon, Points)
SOLVED:
This can be done by formatting the variable as a string.
def Building(OBJECTID):
print("Building with ObjectID:", OBJECTID)
Polygon = df.loc['{}'.format(OBJECTID)]
Points = df2.loc['{}'.format(OBJECTID)]
return Polygon, Points
Hope this is helpful for others.

Reading in non-consecutive columns using XLSX.gettable?

Is there a way to read in a selection of non-consecutive columns of Excel data using XLSX.gettable? I’ve read the documentation here XLSX.jl Tutorial, but it’s not clear whether it’s possible to do this. For example,
df = DataFrame(XLSX.gettable(sheet,"A:B")...)
selects the data in columns “A” and “B” of a worksheet called sheet. But what if I want columns A and C, for example? I tried
df = DataFrame(XLSX.gettable(sheet,["A","C"])...)
and similar variations of this, but it throws the following error: MethodError: no method matching gettable(::XLSX.Worksheet, ::Array{String,1}).
Is there a way to make this work with gettable, or is there a similar function which can accomplish this?
I don't think this is possible with the current version of XLSX.jl:
If you look at the definition of gettable here you'll see that it calls
eachtablerow(sheet, cols;...)
which is defined here as accepting Union{ColumnRange, AbstractString} as input for the cols argument. The cols argument itself is converted to a ColumnRange object in the eachtablerow function, which is defined here as:
struct ColumnRange
start::Int # column number
stop::Int # column number
function ColumnRange(a::Int, b::Int)
#assert a <= b "Invalid ColumnRange. Start column must be located before end column."
return new(a, b)
end
end
So it looks to me like only consecutive columns are working.
To get around this you should be able to just broadcast the gettable function over your column ranges and then concatenate the resulting DataFrames:
df = reduce(hcat, DataFrame.(XLSX.gettable.(sheet, ["A:B", "D:E"])))
I found that to get #Nils Gudat's answer to work you need to add the ... operator to give
reduce(hcat, [DataFrame(XLSX.gettable(sheet, x)...) for x in ["A:B", "D:E"]])

TfidfTransformer.fit_transform( dataframe ) fails

I am trying to build a TF/IDF transformer (maps sets of words into count vectors) based on a Pandas series, in the following code:
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts )
This fails with the following message:
ValueError: could not convert string to float: "I'm trying to work out, in general terms..."
Now, "excerpts" is a Pandas Series consisting of a bunch of text strings excerpted from StackOverflow posts, but when I look at the dtype of excerpts,
it says object. So, I reason that the problem might be that something is inferring the type of that Series to be float. So, I tried several ways to make the Series have dtype str:
I tried forcing the column types for the dataframe that includes "excerpts" to be str, but when I look at the dtype of the resulting Series, it's still object
I tried casting the entire dataframe that includes "excerpts" to dtypes str using Pandas.DataFrame.astype(), but the "excerpts" stubbornly have dtype object.
These may be red herrings; the real problem is with fit_transform. Can anyone suggest some way whereby I can see which entries in "excerpts" are causing problems or, alternatively, simply ignore them (leaving out their contribution to the TF/IDF).
I see the problem. I thought that tf_idf_transformer.fit_transform takes as the source argument an array-like of text strings. Instead, I now understand that it takes an (n,2)-array of text strings / token counts. The correct usage is more like:
count_vect = CountVectorizer()
excerpts_token_counts = count_vect.fit_transform( excerpts)
tf_idf_transformer = TfidfTransformer()
return tf_idf_transformer.fit_transform( excerpts_token_counts )
Sorry for my confusion (I should have looked at "Sample pipeline for text feature extraction and evaluation" in the TfidfTransformer documentation for sklearn).