replace any strings with nan in a pandas dataframe - pandas

I'm new to pandas and the dataframe-concept. Because of the format of my data (excel-sheets, first row is the name of my data, the second row is the unit) it's a little tricky to handle it in a data frame.
The task is to calculate new data from existing columns, e.g. df.['c'] = df['a']**2 + df.['b']
I get: TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'
This did work, but is pain to my hands and eyes:
df.['c'] = df['a']
df.['c'] = df['a'].tail(len(df.['a']-1))**2 + df.['b'].tail(len(df.['b'])-1)
df.loc[0,'c'] = 'unit for c'
Is there any way to do this quicker or with less typing?
Thanks already
schamonn

Let's look at the error mentioned first in this post.
TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'
What this error is staying that you are trying to take and string to a power, we can replicate this error using the following example:
df = pd.DataFrame({'a':['1','2','3'],'b':[4,5,6]})
df['a']**2
Output last line of stack trace:
TypeError: unsupported operand type(s) for ** or pow(): 'str' and 'int'
A simple resolution to this if all your a column are numeric representations, then use pd.to_numeric:
pd.to_numeric(df['a'])**2
Output:
0 1
1 4
2 9
Name: a, dtype: int64
Got non-numeric strings also in column a?
Use errors = 'coerce' as a parameter to pd.to_numeric
df = pd.DataFrame({'a':['a','1','2','3'],'b':[4,5,6,7]})
Use:
pd.to_numeric(df['a'], errors='coerce')**2
Output:
0 NaN
1 1.0
2 4.0
3 9.0
Name: a, dtype: float64

this is how I read in the data
Data = pd.read_excel(fileName, sheet_name = 'Messung')
In [154]: Data
Out[154]:
T1 T2 Messung Datum
0 °C °C - -
1 12 100 1 2018-12-06 00:00:00
2 15 200 2 2018-12-06 00:00:00
3 20 120 3 2018-12-06 00:00:00
4 10 160 4 2018-12-06 00:00:00
5 12 160 5 2018-12-06 00:00:00

Related

Panda astype not converting column to int even when using errors=ignore

I have the following DF
ID
0 1.0
1 555555.0
2 NaN
3 200.0
When I try to convert the ID column to Int64 I got the following error:
Cannot convert non-finite values (NA or inf) to integer
I've used the following code to solve this problem:
df["ID"] = df["ID"].astype('int64', errors='ignore')
Although, when I use the above code my ID column persists with float64 type.
Any tip to solve this problem?
Use pd.Int64DType64 instead of np.int64:
df['ID'] = df['ID'].fillna(pd.NA).astype(pd.Int64Dtype())
Output:
>>> df
ID
0 1
1 555555
2 <NA>
3 200
>>> df['ID'].dtype
Int64Dtype()
>>> df['ID'] + 10
0 11
1 555565
2 <NA>
3 210
Name: ID, dtype: Int64
>>> print(df.to_csv(index=False))
ID
1
555555
""
200

dividing column that contains blanks pandas

I have:
df=pd.DataFrame({'a':[10,10,10],'b':[10,'',0]})
df
a b
0 10 1o
1 10
2 10 0
I want:
df['c']=df['a']/df['b']
but get error:
TypeError: unsupported operand type(s) for /: 'int' and 'str'
but I need the result to be a blank, and want
df['c] = [1,'',inf]
suggestions?
repalce it will NaN, then fill it back
(df['a']/df['b'].replace('',np.nan)).fillna('')
Out[162]:
0 1
1
2 inf
dtype: object

Convert hh:mm:ss to minutes return but return TypeError: 'float' object is not subscriptable

Original data in a dataframe look like below and I want to convert it to minutes:
0 03:30:00
1 NaN
2 00:25:00
I learned a very good approach from this post:
Convert hh:mm:ss to minutes using python pandas
df2['FS_Runtime'].str.split(':') running this code split the data into below
0 [03, 30, 00]
1 NaN
2 [00, 25, 00]
I then added the .apply like the example in the post.
df2['FS_Runtime'].str.split(':').apply(lambda x: int(x[0])*60)
but i got the following error:
TypeError: 'float' object is not subscriptable
The issue is because of NaN in the dataframe. You can try this
df1['FS_Runtime'] = pd.to_datetime(df1['FS_Runtime'], format = '%H:%M:%S')
df1['FS_Runtime'].dt.hour * 60 + df1['FS_Runtime'].dt.minute
0 210.0
1 NaN
2 25.0
Your format is in the proper format for pd.to_timedelta then get the number of seconds and divide by 60:
import pandas as pd
import numpy as np
pd.to_timedelta(df['FS_Runtime']).dt.total_seconds()/60
# Alternatively
pd.to_timedelta(df['FS_Runtime'])/np.timedelta64(1, 'm')
#0 210.0
#1 NaN
#2 25.0
#Name: FS_Runtime, dtype: float64

How to index into a data frame using another data frame's indices?

I have a dataframe, num_buys_per_day
Date count
0 2011-01-13 1
1 2011-02-02 1
2 2011-03-03 2
3 2011-06-03 1
4 2011-08-01 1
I have another data frame commissions_buy which I'll give a small subset of:
num_orders
2011-01-10 0
2011-01-11 0
2011-01-12 0
2011-01-13 0
2011-01-14 0
2011-01-18 0
I want to apply the following command
commissions_buy.loc[num_buys_per_day.index, :] = num_buys_per_day.values * commission
where commission is a scalar.
Note that all indices in num_buys_per_day exist in commissions_buy.
I get the following error:
TypeError: unsupported operand type(s) for *: 'Timestamp' and 'float'
How should I do the correct command?
you need to first make the Date colum to the index:
num_buys_per_day.set_index('Date', inplace=True)
commission_buy.loc[num_buys_per_day.index, 'num_orders'] = num_buys_per_day['count'].values * commission

date can not be serialized

I am getting an error while trying to save the dataframe as a file.
from fastparquet import write
write('profile_dtl.parq', df)
The error is related to "date" and the error message looks like this...
ValueError: Can't infer object conversion type: 0 1990-01-01
1 1954-01-01
2 1981-11-15
3 1993-01-21
4 1948-01-01
5 1977-01-01
6 1968-04-28
7 1969-01-01
8 1989-01-01
9 1985-01-01
Name: dob, dtype: object
I have checked that the column is "object" just like any other column that can be serialized without any problem. If I remove the "dob" column from the dataframe, then this line will work. This will also work if there is date+time.
Only dates are not accepted by fast-parquet?
Try changing dob to datetime64 dtype:
import pandas as pd
dob = pd.Series(['1954-01-01', '1981-11-15', '1993-01-21', '1948-01-01',
'1977-01-01', '1968-04-28', '1969-01-01', '1989-01-01',
'1985-01-01'], name='dob')
Out:
0 1954-01-01
1 1981-11-15
2 1993-01-21
3 1948-01-01
4 1977-01-01
5 1968-04-28
6 1969-01-01
7 1989-01-01
8 1985-01-01
Name: dob, dtype: object
Note the dtype that results:
pd.to_datetime(dob)
Out:
0 1954-01-01
1 1981-11-15
2 1993-01-21
3 1948-01-01
4 1977-01-01
5 1968-04-28
6 1969-01-01
7 1989-01-01
8 1985-01-01
dtype: datetime64[ns]
Using this Series as an index in a DataFrame:
baz = list(range(9))
foo = pd.DataFrame(baz, index=pd.to_datetime(dob), columns=['dob'])
You should be able to save your Parquet file now.
from fastparquet import write
write('foo.parquet', foo)
$ls -l foo.parquet
-rw-r--r-- 1 moi admin 854 Oct 13 16:44 foo.parquet
Your dob Series has an object dtype and you left unchanged the object_encoding='infer' argument to fastparquet.write. So, from the docs:
"The special value 'infer' will cause the type to be guessed from the first ten non-null values."
Fastparquet does not try to infer a date value from what it expects to be one of bytes|utf8|json|bson|bool|int|float.