Julia: Convert Dataframe with multiple string columns to float array - dataframe

I have a DataFrame which has many String columns where they should be float64 instead. I would like to transform all the column at once and transform the dataframe into a float array. How can do this? Importantly, there are some float columns too.
df = DataFrame(a=["1", "2", "3"], b=["1.1", "2.2", "3.3"], c=[0.1, 0.2, 0.3])
# Verbose option
df.a = parse.(Float64, df.a)
df.b = parse.(Float64, df.b)
matrix = Matrix{Float64}(df)
# Is is possible to do this at once especially when there are float columns too?
# Here parse.(Float64, df.c) would throw an error

One way of doing this is by looping over the String columns:
for c ∈ names(df, String)
df[!, c]= parse.(Float64, df[!, c])
end
Note that you don't need Matrix{Float64} if you've already turned everything into Floats, just Matrix(df) will do.

I've had the same question and landed on this page and found that the above code does not work for me, a slight change that made it work for me is:
for c ∈ names(df, Any)
df[!, c]= Float64.(df[!, c])
end
Note that for the names(df, Any) argument Any can be specified as String or any other data type.

Related

making numpy binary file data to two decimal [duplicate]

I have a numpy array, something like below:
data = np.array([ 1.60130719e-01, 9.93827160e-01, 3.63108206e-04])
and I want to round each element to two decimal places.
How can I do so?
Numpy provides two identical methods to do this. Either use
np.round(data, 2)
or
np.around(data, 2)
as they are equivalent.
See the documentation for more information.
Examples:
>>> import numpy as np
>>> a = np.array([0.015, 0.235, 0.112])
>>> np.round(a, 2)
array([0.02, 0.24, 0.11])
>>> np.around(a, 2)
array([0.02, 0.24, 0.11])
>>> np.round(a, 1)
array([0. , 0.2, 0.1])
If you want the output to be
array([1.6e-01, 9.9e-01, 3.6e-04])
the problem is not really a missing feature of NumPy, but rather that this sort of rounding is not a standard thing to do. You can make your own rounding function which achieves this like so:
def my_round(value, N):
exponent = np.ceil(np.log10(value))
return 10**exponent*np.round(value*10**(-exponent), N)
For a general solution handling 0 and negative values as well, you can do something like this:
def my_round(value, N):
value = np.asarray(value).copy()
zero_mask = (value == 0)
value[zero_mask] = 1.0
sign_mask = (value < 0)
value[sign_mask] *= -1
exponent = np.ceil(np.log10(value))
result = 10**exponent*np.round(value*10**(-exponent), N)
result[sign_mask] *= -1
result[zero_mask] = 0.0
return result
It is worth noting that the accepted answer will round small floats down to zero as demonstrated below:
>>> import numpy as np
>>> arr = np.asarray([2.92290007e+00, -1.57376965e-03, 4.82011728e-08, 1.92896977e-12])
>>> print(arr)
[ 2.92290007e+00 -1.57376965e-03 4.82011728e-08 1.92896977e-12]
>>> np.round(arr, 2)
array([ 2.92, -0. , 0. , 0. ])
You can use set_printoptions and a custom formatter to fix this and get a more numpy-esque printout with fewer decimal places:
>>> np.set_printoptions(formatter={'float': "{0:0.2e}".format})
>>> print(arr)
[2.92e+00 -1.57e-03 4.82e-08 1.93e-12]
This way, you get the full versatility of format and maintain the precision of numpy's datatypes.
Also note that this only affects printing, not the actual precision of the stored values used for computation.

Convert column names to string in pandas dataframe

Can somebody explain how this works?
df.columns = list(map(str, df.columns))
Your code is not the best way to convert column names to string, use instead:
df.columns = df.columns.astype(str)
Your code:
df.columns = list(map(str, df.columns))
is equivalent to:
df.columns = [str(col) for col in df.columns]
map: for each item in df.columns (iterable), apply the function str on it but the map function returns an iterator, so you need to explicitly execute list to generate the list.

Drop rows that hasn't float value in a column

I have this df:
My task is to find results with this conditions:
[(df.neighbourhood_group == 'Manhattan') & (df.room_type == 'Entire home/apt') & (df.price.between(150.0, 175.0))]`
But this is not working. The error message says:
TypeError: '>=' not supported between instances of 'str' and 'float'
Because in the price column I have the value Private room wrote somewhere.
How can I write a piece of code that tells to keep only float values and drop all the others?
NOTE
These are not working:
df = df[df['price'].apply(lambda x: type(x) in [float])
clean['price']=df['price'].str.replace('Private room', '0.0')
clean.price = clean.price.astype(float)
df.select_dtypes(exclude=['str'])
This is the CSV data.
One way to achive it:
df['price'] = df.apply(lambda r: r['price'] if type(x['price'])==float else np.nan, axis=1)
df.dropna(inplace=True)
In this way you will replace any non-float row with np.nan, and later remove such row.

Initialize a column with missing values and copy+transform another column of a dataframe into the initialized column

I have a messy column in a csv file (column A of the dataframe).
using CSV, DataFrames
df = DataFrame(A = ["1", "3", "-", "4", missing, "9"], B = ["M", "F", "R", "G", "Z", "D"])
Want I want to do is:
Transform the integer from string to numeric (e.g. Float64)
Transform the string "-" in missing
The strategy would be to first define a new column vector filled with missing
df[:C] = fill(missing, size(df)[1])
and then perform the 2 transformations with for loops
for i in 1:size(df)[1]
if df[:A][i] == "-"
continue
else
df[:C][i] = parse(Float64,df[:A][i])
end
end
However, when looking at df[:C] I have a column filled only with missing.
What am I doing wrong?
There are several issues with your code, but first let me show how I would write this transformation:
df.C = passmissing(parse).(Float64, replace(df.A, "-"=>missing))
It is not the most efficient way to do it but is simple to reason about.
An implementation using a loop could look like:
df.C = similar(df.A, Union{Float64, Missing});
for (i, a) in enumerate(df.A)
if !ismissing(a) && a != "-"
df.C[i] = parse(Float64, a)
else
df.C[i] = missing
end
end
Note that similar by default will fill the df.C with missing so the else part could be dropped, but this behavior is not documented so it is safer to write it.
You could also use a comprehension:
df. C = [ismissing(a) || a == "-" ? missing : parse(Float64, a) for a in df.A]
Now, to fix your code you could write:
# note a different initialization
# in your code df.C allowed only values of Missing type and disallows of Float64 type
df.C = Vector{Union{Float64, Missing}}(missing, size(df, 1))
for i in 1:size(df)[1]
# note that we need to handle missing value and "=" separately
if ismissing(df.A[i]) || df.A[i] == "-"
continue
else
df.C[i] = parse(Float64,df.A[i])
end
end
Finally note that it is preferred to write df.C than df[:C] to access a column in a data frame (currently both are equivalent but this might change in the future).

Managing MultiIndex Dataframe objects

So, yet another problem using grouped DataFrames that I am getting so confused over...
I have defined an aggregation dictionary as:
aggregations_level_1 = {
'A': {
'mean': 'mean',
},
'B': {
'mean': 'mean',
},
}
And now I have two grouped DataFrames that I have aggregated using the above, then joined:
grouped_top =
df1.groupby(['group_lvl']).agg(aggregations_level_1)
grouped_bottom =
df2.groupby(['group_lvl']).agg(aggregations_level_1)
Joining these:
df3 = grouped_top.join(grouped_bottom, how='left', lsuffix='_top_10',
rsuffix='_low_10')
A_top_10 A_low_10 B_top_10 B_low_10
mean mean mean mean
group_lvl
a 3.711413 14.515901 3.711413 14.515901
b 4.024877 14.442106 3.694689 14.209040
c 3.694689 14.209040 4.024877 14.442106
Now, if I call index and columns I have:
print df3.index
>> Index([u'a', u'b', u'c'], dtype='object', name=u'group_lvl')
print df3.columns
>> MultiIndex(levels=[[u'A_top_10', u'A_low_10', u'B_top_10', u'B_low_10'], [u'mean']],
labels=[[0, 1, 2, 3], [0, 0, 0, 0]])
So, it looks as though I have a regular DataFrame-object with index a,b,c but each column is a MultiIndex-object. Is this a correct interpretation?
How do I slice and call this? Say I would like to have only A_top_10, A_low_10 for all a,b,c?
Only A_top_10, B_top_10 for a and c?
I am pretty confused so any overall help would be great!
Need slicers, but first sort columns by sort_index else error:
UnsortedIndexError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (1), lexsort depth (0)'
df = df.sort_index(axis=1)
idx = pd.IndexSlice
df1 = df.loc[:, idx[['A_low_10', 'A_top_10'], :]]
print (df1)
A_low_10 A_top_10
mean mean
group_lvl
a 14.515901 3.711413
b 14.442106 4.024877
c 14.209040 3.694689
And:
idx = pd.IndexSlice
df2 = df.loc[['a','c'], idx[['A_top_10', 'B_top_10'], :]]
print (df2)
A_top_10 B_top_10
mean mean
group_lvl
a 3.711413 3.711413
c 3.694689 4.024877
EDIT:
So, it looks as though I have a regular DataFrame-object with index a,b,c but each column is a MultiIndex-object. Is this a correct interpretation?
I think very close, better is say I have MultiIndex in columns.