Convert dataframe column from Object to numeric - pandas

Hello I have a conversion question. I'm using some code to conditionally add a value to a new column in my dataframe (df). the new column ('new_col') is created in type object. How do I convert 'new_col' in dataframe to float for aggregation in code to follow. I'm new to python, tried several function and methods. Any help would be greatly appreciated.
conds = [(df['sc1']=='UP_MJB'),(df['sc1']=='UP_MSCI')]
actions = [df['st1'],df['st2']]
df['new_col'] = np.select(conds,actions,default=df['sc1'])

tried astype(float), got value Error. Talked to teammate, tried df.to_numeric(np.select(conds,actions,default=df['sc1'])). That worked.

Related

How to get the mode of a column from a dataframe using spark scala?

I'm trying to get the mode of a column in a dataframe using spark scala but my code is not working.
For example
val type_mode = dfairports.groupBy("type").count().orderBy("count").first()
print("Mode", type_mode.get(0))
You're almost there! You're probably getting the least common value now, since the orderBy function by default orders by ascending values. So taking the first element will take the lowest value.
Try:
val type_mode = dfairports.groupBy("type").count().orderBy(desc("count")).first()
print("Mode", type_mode.get(0))
Hope this helps :)

Pandas dataframe being treated as a series object after using groupby

I am conducting an analysis of a dataset. To find my results, I use this line of code:
new_df = df_ncis.groupby(['state', 'year'])['totals'].mean()
The object returned by this statement is a Series, when it should be a dataframe. I don't understand why this happened, or how to solve this issue. Also, one of the columns of the new object is missing its name. Here is the github link for the project: https://github.com/louishrm/gundataUS.
Any help would be great.
You are filtering the result by ['totals'] which is a series.
try this instead
new_df = df_ncis[['state', 'year', 'totals']].groupby(['state', 'year']).mean()
which will give you a dataframe with your 3 columns.
or if you want it as a dataframe of one column (Note the double brackets)
new_df = df_ncis.groupby(['state', 'year'])[['totals']].mean()

Parse Datetime in Pandas Dataframe

I have the checkout column in Dataframe of type 'object' in '2017-08-04T23:31:19.000+02:00' format.
But i want it in the format as shown in the image.
Can anyone help me please.
Thank you:)
You should be able to convert the object column to a date time column, then use the built in date and time functions.
# create an intermediate column that we won't store on the DataFrame
checkout_as_datetime = pd.to_datetime(df['checkout'])
# Add the desired columns to the dataframe
df['checkout_date'] = checkout_as_datetime.dt.date
df['checkout_time'] = checkout_as_datetime.dt.time
Though, if you're goal isn't to write these specific new columns out somewhere, but to use them for other calculations, it may be simpler to just overwrite your original column and use the datetime methods from there.
df['checkout'] = pd.to_datetime(df['checkout'])
df['checkout'].dt.date # to access the date
I haven't tested this, but something along the lines of:
df['CheckOut_date'] = pd.to_datetime(df["CheckOut_date"].dt.strftime('%Y-%m-%d'))
df['CheckOut_time'] = pd.to_datetime(df["CheckOut_time"].dt.strftime('%H:%m:%s'))

What's the cleanest way for assigning a new pandas dataframe column to a single value?

Working with a dataframe df I wanted to create a new column A and assign it to a single value (a string in my case)
df['A'] = value
gave a warning and suggested to use loc
however the solution below still gave the same warning:
df.loc[:,'A']=value
Doing some research I found the solution below which does not generate a warning:
df=df.assign(A =value)
Is it the general accepted way of creating a new column and assigning it to a value? Are there other possibilities using loc?
pandas version '0.20.1'
EDIT: this is the warning message obtained for the 2 first methods
"A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead"
As explained by #EdChum and #ScottBoston
Since df was derived using a mask on some original dataframe
df = df_original[boolean_mask]
to avoid the warning with the two first methods, use instead df=df_original[boolean_mask].copy()
df.assign does not need this because it automatically creates a copy of the original dataframe

python read data like -.345D-4

I am wondering how to read the values like -.345D+1 with numpy/scipy?
The values are float with first 0 ignored.
I have tried the numpy.loadtxt and got errors like
ValueError: invalid literal for float(): -.345D+01
Many thanks :)
You could write a converter and use the converters keyword. If cols are the indices of the columns where you expect this format:
converters = dict.fromkeys(cols, lambda x: float(x.replace("D", "E")))
np.loadtxt(yourfile, converters=converters)