How come apply on multiple columns in dataframe does not work? - pandas

I am trying to remove '$' sign and convert value to a float for multiple columns in a dataframe.
I have a dataframe that looks something like this:
policy_status sum_assured premium riders premium_plus
0 A 1252000 $ 1500 $ 1.0 1100 $
1 A 1072000 $ 2200 $ 2.0 1600 $
2 A 1274000 $ 1700 $ 2.0 1300 $
3 A 1720000 $ 2900 $ 1.0 1400 $
4 A 1360000 $ 1700 $ 3.0 1400 $
I have this function:
def transform_amount(x):
x=x.replace('$','')
x=float(x)
return x
when I do this:
policy[['sum_assured','premium','premium_plus']]=policy[['sum_assured','premium','premium_plus']].apply(transform_amount)
the following error occured:
TypeError: ("cannot convert the series to <class 'float'>", 'occurred at index sum_assured')
Does anyone know why?

If need working elementwise use DataFrame.applymap:
cols = ['sum_assured','policy_premium','rider_premium']
policy[cols]=policy[cols]].applymap(transform_amount)
Better is use DataFrame.replace with regex=True, but first escape $ because special regex value and convert columns to floats:
cols = ['sum_assured','premium','premium_plus']
policy[cols]=policy[cols].replace('\$','', regex=True).astype(float)
print (policy)
policy_status sum_assured premium riders premium_plus
0 A 1252000.0 1500.0 1.0 1100.0
1 A 1072000.0 2200.0 2.0 1600.0
2 A 1274000.0 1700.0 2.0 1300.0
3 A 1720000.0 2900.0 1.0 1400.0
4 A 1360000.0 1700.0 3.0 1400.0

Related

Pythonic style of writing a "for-loop" with "if" clause

I use Java and I'm new to python.
I have the following code snippet:
count_of_yes = 0
for str_idx in str_indexes: # Ex. ["abc", "bbb","cb","aaa"]
if "a" in str_idx:
count_of_yes += one_dict["data_frame_of_interest"].loc[(str_idx), 'yes_column']
The one_dict looks like:
# categorical, can only be 1 or 0 in either column
one_dict --> data_frame_of_interest --> ______|__no_column__|__yes_column__
"abc" | 1.0 | 0.0
\ "cb" | 1.0 | 0.0
\ "aaab"| 0.0 | 1.0
\ "bb" | 0.0 | 1.0
\ ...
\
other_dfs_dont_need --> ...
...
I'm trying to get count_of_yes, is there a better pythonic way to refactor the above for-loop and calculate the sum of count_of_yes?
Thanks!

Excel xlsx file modified into a dataframe is not recognized by an R package that uses dataframe

I uploaded an Excel xlsx file then created a dataframe by converting numeric variables into categories. When I run a R package that uses dataframe, the output shows the following error:
> library(DiallelAnalysisR)
> Griffing(Yield, Rep, Cross1, Cross2, GriffingData41, 4, 1)
Error in `$<-.data.frame`(`*tmp*`, "Trt", value = character(0)) :
replacement has 0 rows, data has 20
When I issue a str() function, it shows the modifications of the numeric columns into catergories as below.
> str(GriffingData41)
'data.frame': 20 obs. of 4 variables:
$ CROSS1: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 2 2 2 3 3 4 ...
$ CROSS2: Factor w/ 4 levels "2","3","4","5": 1 2 3 4 2 3 4 3 4 4 ...
$ REP : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 2 ...
$ YIELD : num 11.9 14.5 9 13.5 20.5 9.8 16.5 22.1 18.4 19.4 ...
Is this a problem in my dataframe creation?
I would appreciate it if I could be helped with this error. By the way, I am running this in R Studio.
Thank you.
Note: This is not really a solution to my problem but I managed to move forward by saving my Excel data in CSV format; changing the data type of the specific columns to character and importing to R Studio. From there, creating the dataframe and running the R package went smoothly. Still, I am curious why it did not work on the "xlsx" file.

pandas - how to convert all columns from object to float type

I trying to convert all columns with '$' amount from object to float type.
With below code i couldnt remove the $ sign.
input:
df[:] = df[df.columns.map(lambda x: x.lstrip('$'))]
You can using extract
df=pd.DataFrame({'A':['$10.00','$10.00','$10.00']})
df.apply(lambda x : x.str.extract('(\d+)',expand=False).astype(float))
Out[333]:
A
0 10.0
1 10.0
2 10.0
Update
df.iloc[:,9:32]=df.iloc[:,9:32].apply(lambda x : x.str.extract('(\d+)',expand=False).astype(float))
May be you can also try using applymap:
df[:] = df.astype(str).applymap(lambda x: x.lstrip('$')).astype(float)
If df is:
0 1 2
0 $1 7 5
1 $2 7 9
2 $3 7 9
Then, it will result in:
0 1 2
0 1.0 7.0 5.0
1 2.0 7.0 9.0
2 3.0 7.0 9.0
Please use the below regular expression based matching to replace all occurrences of $ with null character
df = df.replace({'\$': ''}, regex=True)
UPDATE: As per #Wen suggestion, the solution will be
df.iloc[:,9:32]=df.iloc[:,9:32].replace({'\$':''},regex=True).astype(float)

awk: divide odd columns by following even column

I want to divide all the odd columns in a file by the next even column, e.g. column1/column2, column3/column4,......, columnN/columnN+1
test1.txt
1 4 1 2 1 3
1 2 4 2 3 9
desired output
0.25 0.5 0.333
0.5 2 0.333
I tried this:
awk 'BEGIN{OFS="\t"} { for (i=2; i<NF+2; i+=2) printf $(i-1)/i OFS; printf "\n"}'
but it doesn't work.
I would like to add that my actual files have a very large and variable (but always even) number of columns and I would like something that would work on all of them.
awk '{for(i=1;i<NF;i+=2)printf "%f%s",$i/$(i+1),OFS;print "";}' input.txt
Output:
0.250000 0.500000 0.333333
0.500000 2.000000 0.333333
You can adjust printing format to your needs see here for more info.

obtain averages of field 2 after grouping by field 1 with awk

I have a file with two fields containing numbers that I have sorted numerically based on field 1. The numbers in field 1 range from 1 to 200000 and the numbers in field 2 between 0 and 1. I want to obtain averages for both field 1 and field 2 in batches (based on rows).
Here is example input output when specifying batches of 4 rows:
1 0.12
1 0.34
2 0.45
2 0.40
50 0.60
301 0.12
899 0.13
1003 0.14
1300 0.56
1699 0.43
2100 0.25
2500 0.56
The output would be:
1.5 0.327
563.25 0.247
1899.75 0.45
Here you go:
awk -v n=4 '{s1 += $1; s2 += $2; if (++i % n == 0) { print s1/n, s2/n; s1=s2=0; } }'
Explanation:
Initialize n=4, the size of the batches
Collect the sums: sum of 1st column in s1, the 2nd in s2
Increment counter i by 1 (default initial value is 0, no need to set it)
If i is divisible by n with no remainder, then we print the averages, and reset the sum variables