how can i get mean value of str type in a dataframe in Pandas - pandas

I have a DataFrame from pandas:
i want to get a mean value of "stop_duration" for each "violation_raw".
How can i do it if column "stop_duration" is object type
df = enter code herepd.read_csv('police.csv', parse_dates=['stop_date'])
df[['stop_date', 'violation_raw','stop_duration']]
My table:
the table

Use to_datetime function to convert object to datetime. Also specifying a format to match your data.
import pandas as pd
df["column"] = pd.to_datetime(df["column"], format="%M-%S Min")

Related

How to change a string to NaN when applying astype?

I have a column in a dataframe that has integers like: [1,2,3,4,5,6..etc]
My problem: In this field one of the field has a string, like this: [1,2,3,2,3,'hello form France',1,2,3]
the Dtype of this column is object.
I want to cast it to float with column.astype(float) but I get an error because that string.
The columns has over 10.000 records and there is only this record with string. How can I cast to float and change this string to NaN for example?
You can use pd.to_numeric with errors='coerce'
import pandas as pd
df = pd.DataFrame({
'all_nums':range(5),
'mixed':[1,2,'woo',4,5],
})
df['mixed'] = pd.to_numeric(df['mixed'], errors='coerce')
df.head()
Before:
After:

Converting data frame columns in category type in pyspark

I have a data frame df and there I want to convert some columns into category type. Using pandas I can do it like below way:
for col in categorical_collist:
df[col] = df[col].astype('category')
I want to do the column conversion in pyspark. How can I do it?
I have tried using the below code in pyspark. But it is not giving my expected output during operation.
from pyspark.sql.types import StringType
for col in categorical_collist:
df = df.withColumn(col, df[col].cast(StringType()))

Converting DataFrame into sql

I am using the following code to convert my pandas into sql, but I get the following error although my dtype is float64 for this particular column.
I have tried to convert my dtype to str, but this did not work.
import sqlite3
import pandas as pd
#create db file
db = conn = sqlite3.connect(‘example.db’)
#convert my df data to sql
df = df(‘users’ , con=db, if_exists='replace')
InterfaceError: Error binding parameter 1214 - probably unsupported type.
However when I check the parameter 1214 i.e. column 1214 in my df. This col has a float64 dtype. I don't understand then how to solve this problem.
Double check your data types, as SQLite supports a limited number of data types --> https://www.sqlite.org/datatype3.html. My guess would be to use a float dtype (so try dtype='float')

How do you append a column and drop a column with pandas dataframes? Can't figure out why it won't print the dataframe afterwards

The DataFrame that I am working with has a datetime object that I changed to a date object. I attempted to append the date object to be the last column in the DataFrame. I also wanted to drop the datetime object column.
Both the append and drop operations don't work as expected. Nothing prints out afterwards. It should print the entire DataFrame (shortened it is long).
My code:
import pandas as pd
import numpy as np
df7=pd.read_csv('kc_house_data.csv')
print(df7)
mydates = pd.to_datetime(df7['date']).dt.date
print(mydates)
df7.append(mydates)
df7.drop(['date'], axis=1)
print(df7)
Why drop/append? You can overwrite
df7['date'] = pd.to_datetime(df7['date']).dt.date
import pandas as pd
import numpy as np
# read csv, convert column type
df7=pd.read_csv('kc_house_data.csv')
df7['date'] = pd.to_datetime(df7['date']).dt.date
print(df7)
Drop a column using df7.drop('date', axis=1, inplace=True).
Append a column using df7['date'] = mydates.

pandas HDFStore select rows with non-null values in the data column

In pandas Dataframe/Series there's a .isnull() method. Is there something similar in the syntax of where= filter of the select method of HDFStore?
WORKAROUND SOLUTION:
The /meta section of a data column inside hdf5 can be used as a hack solution:
import pandas as pd
store = pd.HDFStore('store.h5')
print(store.groups)
non_null = list(store.select("/df/meta/my_data_column/meta"))
df = store.select('df', where='my_data_column == non_null')