replacing a column value in a dataframe using map and replace, the difference, using pandas - pandas

I can replace a couple of values in column, 'qualify', with true or false as follows and works just fine:
df['qualify'] = df['qualify'].map({'yes':True, 'np':False})
but if I use it to change a name in another column, it will change the name but will make all other values in that column except the one it change to NaN.
df['name'] = df['name'].map({'dick':'Harry'})
Of course using replace will do the job right. But I need to understand why map() does not work correctly in the second instance?
df['name']=df['name'].replace('dick','Harry')

Related

Error when filtering pandas dataframe by column value

I am having a problem with filtering a pandas dataframe. I am trying to filter a dataframe based on column values being equal to a specific list but I am getting a length error.
I tried every possible way of filtering a dataframe but got nowhere. Any help would be appreciated, thanks in advance.
Here is my code :
for ind in df_hourly.index:
timeslot = df_hourly['date_parsed'][ind][0:4] # List value to filter
filtered_df = df.loc[df['timeslot'] == timeslot]
Error : ValueError: ('Lengths must match to compare', (5696,), (4,))
Above Image : df , Below Image : df_hourly
In the above image, the dataframe I want to filter is shown. Specifically, I want to filter according to the "timeslot" column.
And the below image shows the the dataframe which includes the value I want to filter by. I specifically want to filter by "date_parsed" column. In the first line of my code, I iterate through every row in this dataframe and assign the first 4 elements of the list value in df_hourly["date_parsed"] to a variable and later in the code, I try to filter the above dataframe by that variable.
When comparing columns using ==, pandas try to compare value by value - aka does the first item equals to first item, second item to the second and so on. This is why you receive this error - pandas expects to have two columns of the same shape.
If you want to compare if value is inside a list, you can use the .isin (documentation):
df.loc[df['timeslot'].isin(timeslot)]
Depends on what timeslot is exactly, you might to take timeslot.values or something like that (hard to understand exactly without giving an example for your dataframe)

Change Column Values in a Dataframe column using Pandas

The data type of the column is object. but, i still map it to string using astype(str). even used temp['Injury Severity'].str.strip() to remove spaces from column values.
enter image description here
I want to replace all "Fatal(0)",Fatal(1)"... with only "Fatal". so i used.temp['Injury Severity'] = temp['Injury Severity'].replace('Fatal(0)','Fatal',inplace = True).
But did not work. i also tried temp.loc[temp['Injury Severity'] == 'Fatal(0)','Injury Severity'] = temp['Injury Severity'].replace('Fatal(0)','Fatal',inplace = True)
In addition is tried str.replace but did not work out.lastly also used regex = True but no changes was observed.It still remains the same.
I think it is solved. It seems that the values were having leading and trailing spaces in the name of values.Thanks alot for the help everyone !!
Try This,
temp['Injury Severity'].replace('Fatal(0)','Fatal',inplace = True)
No need to assign it again.

Style.format column in dataframe to absolute values

I have the following problem. I have a column in a dataframe (let's call it df['Price']) and I need to format in to two decimal places, but for negative values I need the minus sign gone, since I already have a coloring formatting which colors me in red the negative values.
df.style.format({'Price': '{:,.2f}'})
This is the generic formatting which works fine, but how do I change this to solve my problem? I basically need to send the absolute values of the column to formatting instead the actual values.
You can pass a callable to .format as well – see https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Formatting-Values.
This should do the trick:
df.style.format({'Price': lambda value: f'{abs(value):,.2f}'})
From the pandas documentation you can pass a callable as formatter. Therefore you can just take the absolute value.
df.style.format({'Price': lambda x: f"{abs(x):,.2f}"})

How do I access dataframe column value within udf via scala

I am attempting to add a column to a dataframe, using a value from a specific column—-let’s assume it’s an id—-to look up its actual value from another df.
So I set up a lookup def
def lookup(id:String): String {
return lookupdf.select(“value”)
.where(s”id = ‘$id’”).as[String].first
}
The lookup def works if I test it on its own by passing an id string, it returns the corresponding value.
But I’m having a hard time finding a way to use it within the “withColumn” function.
dataDf
.withColumn(“lookupVal”, lit(lookup(col(“someId”))))
It properly complains that I’m passing in a column, instead of the expected string, the question is how do I give it the actual value from that column?
You cannot access another dataframe from withColumn . Think of withColumn can only access data at a single record level of the dataDf
Please use a join like
val resultDf = lookupDf.select(“value”,"id")
.join(dataDf, lookupDf("id") == dataDf("id"), "right")

Reading in non-consecutive columns using XLSX.gettable?

Is there a way to read in a selection of non-consecutive columns of Excel data using XLSX.gettable? I’ve read the documentation here XLSX.jl Tutorial, but it’s not clear whether it’s possible to do this. For example,
df = DataFrame(XLSX.gettable(sheet,"A:B")...)
selects the data in columns “A” and “B” of a worksheet called sheet. But what if I want columns A and C, for example? I tried
df = DataFrame(XLSX.gettable(sheet,["A","C"])...)
and similar variations of this, but it throws the following error: MethodError: no method matching gettable(::XLSX.Worksheet, ::Array{String,1}).
Is there a way to make this work with gettable, or is there a similar function which can accomplish this?
I don't think this is possible with the current version of XLSX.jl:
If you look at the definition of gettable here you'll see that it calls
eachtablerow(sheet, cols;...)
which is defined here as accepting Union{ColumnRange, AbstractString} as input for the cols argument. The cols argument itself is converted to a ColumnRange object in the eachtablerow function, which is defined here as:
struct ColumnRange
start::Int # column number
stop::Int # column number
function ColumnRange(a::Int, b::Int)
#assert a <= b "Invalid ColumnRange. Start column must be located before end column."
return new(a, b)
end
end
So it looks to me like only consecutive columns are working.
To get around this you should be able to just broadcast the gettable function over your column ranges and then concatenate the resulting DataFrames:
df = reduce(hcat, DataFrame.(XLSX.gettable.(sheet, ["A:B", "D:E"])))
I found that to get #Nils Gudat's answer to work you need to add the ... operator to give
reduce(hcat, [DataFrame(XLSX.gettable(sheet, x)...) for x in ["A:B", "D:E"]])