How can I replace value in cells in dataframe? [duplicate] - pandas

This question already has answers here:
replace() method not working on Pandas DataFrame
(9 answers)
Closed 3 months ago.
I have cells with values '...', I wanna replace them on 'NaN'. My dataframe called energy.
import pandas as pd
import numpy as np
file_name_energy = "data/Energy Indicators.xls"
energy = pd.read_excel(file_name_energy)
energy.replace('...', np.NaN)
I tried to use replace() but it doesn't work and dont output any error.
energy.head(10)

You need to reassign your dataframe or use inplace=True:
energy= energy.replace('...', np.NaN) #or energy.replace('...', np.NaN, inplace=True)
But since you're reading and Excel, why not using na_values parameter of pandas.read_excel ?
Try this :
energy = pd.read_excel(file_name_energy, na_values= ["..."])
From the documentation:
na_values : scalar, str, list-like, or dict, default None
Additional strings to recognize as NA/NaN.
If dict passed, specific per-column NA
values. By default the following values are interpreted as NaN: ‘’,
‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’,
‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’,
‘null’.

Related

Pandas series pad function not working with apply in pandas

I am trying to write the code to pad columns of my pandas dataframe with different characters. I tried to use apply function to fill '0' with zfill and it works.
print(df["Date"].apply(lambda x: x.zfill(10)))
But when I try to use pad function using apply method to my dataframe I face error:
AttributeError: 'str' object has no attribute 'pad'
The code I am trying is:
print(df["Date"].apply(lambda x: x.pad(10, side="left", fillchar="0")))
Both the zfill and pad functions are a part of pandas.Series.str. I am confused why pad is not working and zfill works. How can I achieve this functionality?
Full code:
import pandas as pd
from io import StringIO
StringData = StringIO(
"""Date,Time
パンダ,パンダ
パンダサンDA12-3,パンダーサンDA12-3
パンダサンDA12-3,パンダサンDA12-3
"""
)
df = pd.read_csv(StringData, sep=",")
print(df["Date"].apply(lambda x: x.zfill(10))) -- works
print(df["Date"].apply(lambda x: x.pad(10, side="left", fillchar="0"))) -- doesn't work
I am using pandas 1.5.1.
You should just not use apply, doing so you don't benefit from Series methods, but rather use pure python str methods:
print(df["Date"].str.zfill(10))
print(df["Date"].str.pad(10, side="left", fillchar="0"))
output:
0 0000000パンダ
1 パンダサンDA12-3
2 パンダサンDA12-3
Name: Date, dtype: object
0 0000000パンダ
1 パンダサンDA12-3
2 パンダサンDA12-3
Name: Date, dtype: object
multiple columns:
Now, you need to use apply, but this is DataFrame.apply, not Series.apply:
df[['col1', 'col2', 'col3']].apply(lambda s: s.str.pad(10, side="left", fillchar="0"))

Why pandas does not want to subset given columns in a list

I'm trying to remove certain values with that code, however pandas does not give me to, instead outputs
ValueError: Unable to coerce to Series, length must be 10: given 2
Here is my code:
import pandas as pd
df = pd.read_csv("/Volumes/SSD/IT/DataSets/Automobile_data.csv")
print(df.shape)
columns_df = ['index', 'company', 'body-style', 'wheel-base', 'length', 'engine-type',
'num-of-cylinders', 'horsepower', 'average-mileage', 'price']
prohibited_symbols = ['?','Nan''n.a']
df = df[df[columns_df] != prohibited_symbols]
print(df)
Try:
df = df[~df[columns_df].str.contains('|'.join(prohibited_symbols))]
The regex operator '|' helps remove records that contain any of your prohibited symbols.
Because what you are trying is not doing what you imagine it should.
df = df[df[columns_df] != prohibited_symbols]
Above line will always return False values for everything. You can't iterate over a list of prohibited symbols like that. != will do only a simple inequality check and none of your cells will be equal to the list of prohibited symbols probably. Also using that syntax will not delete those values from your cells.
You'll have to use a for loop and clean every column like this.
for column in columns_df:
df[column] = df[column].str.replace('|'.join(prohibited_symbols), '', regex=True)
You can as well specify the values you consider as null with the na_values argument when reading the data and then use dropna from pandas.
Example:
import pandas as pd
df = pd.read_csv("/Volumes/SSD/IT/DataSets/Automobile_data.csv", na_values=['?','Nan''n.a'])
df = df.dropna()

Python Pandas : Read dates from excel in different formats [duplicate]

I have one field in a pandas DataFrame that was imported as string format.
It should be a datetime variable. How do I convert it to a datetime column and then filter based on date.
Example:
df = pd.DataFrame({'date': ['05SEP2014:00:00:00.000']})
Use the to_datetime function, specifying a format to match your data.
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
If you have more than one column to be converted you can do the following:
df[["col1", "col2", "col3"]] = df[["col1", "col2", "col3"]].apply(pd.to_datetime)
You can use the DataFrame method .apply() to operate on the values in Mycol:
>>> df = pd.DataFrame(['05SEP2014:00:00:00.000'],columns=['Mycol'])
>>> df
Mycol
0 05SEP2014:00:00:00.000
>>> import datetime as dt
>>> df['Mycol'] = df['Mycol'].apply(lambda x:
dt.datetime.strptime(x,'%d%b%Y:%H:%M:%S.%f'))
>>> df
Mycol
0 2014-09-05
Use the pandas to_datetime function to parse the column as DateTime. Also, by using infer_datetime_format=True, it will automatically detect the format and convert the mentioned column to DateTime.
import pandas as pd
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], infer_datetime_format=True)
chrisb's answer works:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
however it results in a Python warning of
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would guess this is due to some chaining indexing.
Time Saver:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'])
To silence SettingWithCopyWarning
If you got this warning, then that means your dataframe was probably created by filtering another dataframe. Make a copy of your dataframe before any assignment and you're good to go.
df = df.copy()
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f')
errors='coerce' is useful
If some rows are not in the correct format or not datetime at all, errors= parameter is very useful, so that you can convert the valid rows and handle the rows that contained invalid values later.
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
# for multiple columns
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime, format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
Setting the correct format= is much faster than letting pandas find out1
Long story short, passing the correct format= from the beginning as in chrisb's post is much faster than letting pandas figure out the format, especially if the format contains time component. The runtime difference for dataframes greater than 10k rows is huge (~25 times faster, so we're talking like a couple minutes vs a few seconds). All valid format options can be found at https://strftime.org/.
1 Code used to produce the timeit test plot.
import perfplot
from random import choices
from datetime import datetime
mdYHMSf = range(1,13), range(1,29), range(2000,2024), range(24), *[range(60)]*2, range(1000)
perfplot.show(
kernels=[lambda x: pd.to_datetime(x),
lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S.%f'),
lambda x: pd.to_datetime(x, infer_datetime_format=True),
lambda s: s.apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))],
labels=["pd.to_datetime(df['date'])",
"pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S.%f')",
"pd.to_datetime(df['date'], infer_datetime_format=True)",
"df['date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))"],
n_range=[2**k for k in range(20)],
setup=lambda n: pd.Series([f"{m}/{d}/{Y} {H}:{M}:{S}.{f}"
for m,d,Y,H,M,S,f in zip(*[choices(e, k=n) for e in mdYHMSf])]),
equality_check=pd.Series.equals,
xlabel='len(df)'
)
Just like we convert object data type to float or int. Use astype()
raw_data['Mycol']=raw_data['Mycol'].astype('datetime64[ns]')

Pandas replace with blanks [duplicate]

I am trying to create a dataframe in pandas using a CSV that is semicolon-delimited, and uses commas for the thousands separator on numeric data. Is there a way to read this in so that the type of the column is float and not string?
Pass param thousands=',' to read_csv to read those values as thousands:
In [27]:
import pandas as pd
import io
t="""id;value
0;123,123
1;221,323,330
2;32,001"""
pd.read_csv(io.StringIO(t), thousands=r',', sep=';')
Out[27]:
id value
0 0 123123
1 1 221323330
2 2 32001
The answer to this question should be short:
df=pd.read_csv('filename.csv', thousands=',')
Take a look at the read_csv documentation there is a keyword argument 'thousands' that you can pass the ',' into. Likewise if you had European data containing a '.' for the separator you could do the same.

How to select multi range of rows in pandas dataframe [duplicate]

This question already has answers here:
Python pandas slice dataframe by multiple index ranges
(3 answers)
Slice multiple column ranges with Pandas
(1 answer)
Closed 5 years ago.
Given an example of pandas dataframe with index from 0 to 30. I would like to select the rows within several ranges of index, [0:5], [10:15] and [20:25].
How to do that?
Say you have a random pandas DataFrame with 30 rows and 4 columns as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,30,size=(30, 4)), columns=list('ABCD'))
You can then use np.r_ to index into ranges of rows [0:5], [10:15] and [20:25] as follows:
df.loc[np.r_[0:5, 10:15, 20:25], :]