Dealing with greater than and less than values in numeric data when reading csv in pandas - pandas

My csv file contains numeric data where some values have greater than or less than symbols e.g. ">244". I want my data type to be a float. When reading the file into pandas:
df = pd.read_csv('file.csv')
I get a warning:
Columns (2) have mixed types. Specify dtype option on import or set low_memory=False.
I have checked this question: Pandas read_csv: low_memory and dtype options and tried specifying the date type of the relevant column with:
df = pd.read_csv('file.csv',dtype={'column':'float'})
However, this gives an error:
ValueError: could not convert string to float: '>244'
I have also tried
df = pd.read_csv('file.csv',dtype={'column':'float'}, error_bad_lines=False)
However this does not solve my problem, and I get the same error above.
My problem appears to be that my data has a mixture of string and floats. Can I ignore any rows containing strings in particular columns when reading in the data?

You can use:
df = pd.read_csv('file.csv', dtype={'column':'str'})
Then:
df['column'] = pd.to_numeric(df['column'], errors='coerce')

I found a workaround which was read in my data
df = pd.read_csv('file.csv')
Then remove any values with '<' or '>'
df = df.loc[df['column'].str[:1] != '<']
df = df.loc[df['column'].str[:1] != '>']
Then convert to numeric with pd.to_numeric
df['column'] = pd.to_numeric(df['column'])

Related

Aggregating multiple data types in pandas groupby

I have a data frame with rows that are mostly translations of other rows e.g. an English row and an Arabic row. They share an identifier (location_shelfLocator) and I'm trying to merge the rows together based on the identifier match. In some columns the Arabic doesn't contain a translation, but the same English value (e.g. for the language column both records might have ['ger'] which becomes ['ger', 'ger']) so I would like to get rid of these duplicate values. This is my code:
df_merged = df_filled.groupby("location_shelfLocator").agg(
lambda x: np.unique(x.tolist())
)
It works when the values being aggregated are the same type (e.g. when they are both strings or when they are both arrays). When one is a string and the other is an array, it doesn't work. I get this warning:
FutureWarning: ['subject_name_namePart'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
df_merged = df_filled.groupby("location_shelfLocator").agg(lambda x: np.unique(x.tolist()))
and the offending column is removed from the final data frame. Any idea how I can combine these values and remove duplicates when they are both lists, both strings, or one of each?
Here is some sample data:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
81055/vdc_100000000094.0x000093,ara,"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', 'الكواكب']",المُلكية العامة,كلاوديوس بطلميوس (بطليمو)
81055/vdc_100000000094.0x000093,ara,"['Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']",Public Domain,"['Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
And expected output:
location_shelfLocator,language_languageTerm,subject_topic,accessCondition,subject_name_namePart
"[‘81055/vdc_100000000094.0x000093’] ",[‘ara’],"['فلك، العرب', 'فلك، اليونان', 'فلك، العصور الوسطى', ‘الكواكب’, 'Astronomy, Arab', 'Astronomy, Greek', 'Astronomy, Medieval', 'Constellations']","[‘المُلكية العامة’, ‘Public Domain’]","[‘كلاوديوس بطلميوس (بطليمو)’,’Claudius Ptolemaeus (Ptolemy)', ""'Abd al-Raḥmān ibn 'Umar Ṣūfī""]"
If you cannot have a control over the input value, you need to fix it somehow.
Something like this. Here, I am converting string value in subject_name_namePart to array of string.
from ast import literal_eval
mask = df.subject_name_namePart.str[0] != '['
df.loc[mask, 'subject_name_namePart'] = "['" + df.loc[mask, 'subject_name_namePart'] + "']"
df['subject_name_namePart'] = df.subject_name_namePart.transform(literal_eval)
Then, you can do (explode) + aggregation.
df = df.explode('subject_name_namePart')
df = df.groupby('location_shelfLocator').agg(lambda x: x.unique().tolist())

how can i get mean value of str type in a dataframe in Pandas

I have a DataFrame from pandas:
i want to get a mean value of "stop_duration" for each "violation_raw".
How can i do it if column "stop_duration" is object type
df = enter code herepd.read_csv('police.csv', parse_dates=['stop_date'])
df[['stop_date', 'violation_raw','stop_duration']]
My table:
the table
Use to_datetime function to convert object to datetime. Also specifying a format to match your data.
import pandas as pd
df["column"] = pd.to_datetime(df["column"], format="%M-%S Min")

Why pandas does not want to subset given columns in a list

I'm trying to remove certain values with that code, however pandas does not give me to, instead outputs
ValueError: Unable to coerce to Series, length must be 10: given 2
Here is my code:
import pandas as pd
df = pd.read_csv("/Volumes/SSD/IT/DataSets/Automobile_data.csv")
print(df.shape)
columns_df = ['index', 'company', 'body-style', 'wheel-base', 'length', 'engine-type',
'num-of-cylinders', 'horsepower', 'average-mileage', 'price']
prohibited_symbols = ['?','Nan''n.a']
df = df[df[columns_df] != prohibited_symbols]
print(df)
Try:
df = df[~df[columns_df].str.contains('|'.join(prohibited_symbols))]
The regex operator '|' helps remove records that contain any of your prohibited symbols.
Because what you are trying is not doing what you imagine it should.
df = df[df[columns_df] != prohibited_symbols]
Above line will always return False values for everything. You can't iterate over a list of prohibited symbols like that. != will do only a simple inequality check and none of your cells will be equal to the list of prohibited symbols probably. Also using that syntax will not delete those values from your cells.
You'll have to use a for loop and clean every column like this.
for column in columns_df:
df[column] = df[column].str.replace('|'.join(prohibited_symbols), '', regex=True)
You can as well specify the values you consider as null with the na_values argument when reading the data and then use dropna from pandas.
Example:
import pandas as pd
df = pd.read_csv("/Volumes/SSD/IT/DataSets/Automobile_data.csv", na_values=['?','Nan''n.a'])
df = df.dropna()

Setting the data model in pandas

So I'm used to database ETL. In SQL I Create the table and set the char lengths, data types etc. As I understand it pandas uses the max length of whatever is put into the dataframe. Fine if you're staying in python, but I need to specify these things explicitly.
Here's some base code to work from, pointers welcome:
df=pd.Dataframe()
df['ID'] = some data probably i + 1
df['text'] = some text length set to max 255
Here is an informative article on pandas datatypes:
https://pbpython.com/pandas_dtypes.html
If you want to see the datatypes of your dataframe, you can do:
df.info()
You can set datatypes explicitly where you can choose to ignore or raise errors with .astype():
df['ID'] = df['ID'].astype(int, errors='raise')
df['ID'] = df['ID'].astype(int, errors='ignore')
For strings you can set the datatype as follows:
df['text'] = df['text'].astype('string')
Or if you're using an older version of pandas < 1.0 then do:
df['text'] = df['text'].astype(str)
If you want to set a max length for your string, you could do:
df['text'] = df['text'].str.slice(0, 255)

Converting DataFrame into sql

I am using the following code to convert my pandas into sql, but I get the following error although my dtype is float64 for this particular column.
I have tried to convert my dtype to str, but this did not work.
import sqlite3
import pandas as pd
#create db file
db = conn = sqlite3.connect(‘example.db’)
#convert my df data to sql
df = df(‘users’ , con=db, if_exists='replace')
InterfaceError: Error binding parameter 1214 - probably unsupported type.
However when I check the parameter 1214 i.e. column 1214 in my df. This col has a float64 dtype. I don't understand then how to solve this problem.
Double check your data types, as SQLite supports a limited number of data types --> https://www.sqlite.org/datatype3.html. My guess would be to use a float dtype (so try dtype='float')