series from dictionary using pandas - pandas

#create series from dictionary using pandas
data_dict={'Ahmed':90,'Ali':85,'Omar':80}
series=pd.Series(data_dict,index=['Ahmed','Ali','Omar'])
print("Series :",series)
series2=pd.Series(data_dict,index=['Ahmed','Ali','Omar','Karthi'])
print("Series 2 :",series2)
I tried this code while practising pandas, I received the output as below:
Series :
Ahmed 90
Ali 85
Omar 80
dtype: int64
Series 2 :
Ahmed 90.0
Ali 85.0
Omar 80.0
Karthi NaN
dtype: float64
Question: Why the data type got changed in the Series 2 from int to float?
I just tried to know what will be the output if i add an extra field in the index which is not belong to dictionary.I got NaN, but datatype got changed from int to float.

When providing a dictionary to pandas.Series, the keys are used as index, and the values as data.
In fact you only need:
series = pd.Series(data_dict)
print(series)
Ahmed 90
Ali 85
Omar 80
dtype: int64
If you use a list as source of the data, then index is useful:
series = pd.Series([90, 85, 80], index=['Ahmed','Ali','Omar'])
print(series)
Ahmed 90
Ali 85
Omar 80
dtype: int64
When you provide both, this acts as a reindex:
series = pd.Series(data_dict, index=['Ahmed','Ali','Omar','Karthi'])
# equivalent to
series = pd.Series(data_dict).reindex(['Ahmed','Ali','Omar','Karthi'])
print(series)
Ahmed 90.0
Ali 85.0
Omar 80.0
Karthi NaN
dtype: float64
In this case, missing indices are filled with NaN as default value, which forces the float64 type.
You can prevent the change by using the Int64 dtype that supports an integer NA:
series = pd.Series(data_dict, index=['Ahmed','Ali','Omar','Karthi'], dtype='Int64')
print(series)
Output:
Ahmed 90
Ali 85
Omar 80
Karthi <NA>
dtype: Int64

NaN is considered a special floating point value (IEE 754). There is no value for Karthi in series2, so it gets automatically filled in with NaN. Try converting one of the integers into np.NaN and you will see the same behavior. A series that contains a floating point will be automatically cast as a floating point.
import pandas as pd
import numpy as np
data_dict = {'Ahmed':90, 'Ali':85, 'Omar':np.NaN}
series = pd.Series(data_dict, index=['Ahmed','Ali','Omar'])
Output:
Ahmed 90.0
Ali 85.0
Omar NaN
dtype: float64

Related

extracting hour and minutes from a cell in pandas column

Example
How can I split or extract 04:38 from 04:38:00 AM in a pandas dataframe column?
>>> df.timestamp
3 2020-01-17 07:02:20.540540416
2 2020-01-24 01:10:37.837837824
7 2020-03-14 21:58:55.135135232
Name: timestamp, dtype: datetime64[ns]
>>> df.timestamp.dt.strftime('%H:%M')
3 07:02
2 01:10
7 21:58
Name: timestamp, dtype: object
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.strftime.html?highlight=dt%20strftime#pandas.Series.dt.strftime
Using str.slice:
df["hm"] = df["time"].str.slice(stop=5)

How to create the column in pandas based on values of another column

I have created a new column, by adding values from one column, to the index of the column from which I have created this new column. However, my problem is the code works fine when I implement on sample column, but when I pass the already existing dataframe, it throws the error, "can only perform ops with scalar values". As per what I found is the code expects dist and that is why it is throwing error.
I tried converting the dataframe to dictionary or to a list, but no luck.
df = pd.DataFrame({'Name': ['Sam', 'Andrea', 'Alex', 'Robin', 'Kia', 'Sia'], 'Age':[14,25,55,8,21,43], 'd_id_max':[2,1,1,2,0,0]})`
df['Expected_new_col'] = df.loc[df.index + df['d_id_max'].to_list, 'Age'].to_numpy()
print(df)
error: can only perform ops with scalar values.
This is the dataframe I want to implement this code:
Weight Name Age 1 2 abs_max d_id_max
0 45 Sam 14 11.0 41.0 41.0 2
1 88 Andrea 25 30.0 -17.0 30.0 1
2 56 Alex 55 -47.0 -34.0 47.0 1
3 15 Robin 8 13.0 35.0 35.0 2
4 71 Kia 21 22.0 24.0 24.0 2
5 44 Sia 43 2.0 22.0 22.0 2
6 54 Ryan 45 20.0 0.0 20.0 1
Writing your new column like this will not return an error:
df.loc[df.index + df['d_id_max'], 'Age'].to_numpy()
EDIT:
You should first format d_id_max as int (or float):
df['d_id_max'] = df['d_id_max'].astype(int)
The solution was very simple, I was getting the error because the data type of the column d_id_max was object type, which should be either float or integer, so i change the data type and it worked fine.

Pandas - DataFrame aggregate behaving oddly

Related to Dataframe aggregate method passing list problem and Pandas fails to aggregate with a list of aggregation functions
Consider this dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame(index=range(10))
df['a'] = [ 3 * x for x in range(10) ]
df['b'] = [ 1 -2 * x for x in range(10) ]
According to the documentation for aggregate you should be able to specify which columns to aggregate using a dict like this:
df.agg({'a' : 'mean'})
Which returns
a 13.5
But if you try to aggregate with a user-defined function like this one
def nok_mean(x):
return np.mean(x)
df.agg({'a' : nok_mean})
It returns the mean for each row rather than the column
a
0 0.0
1 3.0
2 6.0
3 9.0
4 12.0
5 15.0
6 18.0
7 21.0
8 24.0
9 27.0
Why does the user-defined function not return the same as aggregating with np.mean or 'mean'?
This is using pandas version 0.23.4, numpy version 1.15.4, python version 3.7.1
The issue has to do with applying np.mean to a series. Let's look at a few examples:
def nok_mean(x):
return x.mean()
df.agg({'a': nok_mean})
a 13.5
dtype: float64
this works as expected because you are using pandas version of mean, which can be applied to a series or a dataframe:
df['a'].agg(nok_mean)
df.apply(nok_mean)
Let's see what happens when np.mean is applied to a series:
def nok_mean1(x):
return np.mean(x)
df['a'].agg(nok_mean1)
df.agg({'a':nok_mean1})
df['a'].apply(nok_mean1)
df['a'].apply(np.mean)
all return
0 0.0
1 3.0
2 6.0
3 9.0
4 12.0
5 15.0
6 18.0
7 21.0
8 24.0
9 27.0
Name: a, dtype: float64
when you apply np.mean to a dataframe it works as expected:
df.agg(nok_mean1)
df.apply(nok_mean1)
a 13.5
b -8.0
dtype: float64
in order to get np.mean to work as expected with a function pass an ndarray for x:
def nok_mean2(x):
return np.mean(x.values)
df.agg({'a':nok_mean2})
a 13.5
dtype: float64
I am guessing all of this has to do with apply, which is why df['a'].apply(nok_mean2) returns an attribute error.
I am guessing here in the source code
When you define your nok_mean function, your function definition is basically saying that you want np.mean for each row
It finds the mean for each row and returns you the result.
For example, if your dataframe looked like this:
a b
0 [0, 0] 1
1 [3, 4] -1
2 [6, 8] -3
3 [9, 12] -5
4 [12, 16] -7
5 [15, 20] -9
6 [18, 24] -11
7 [21, 28] -13
8 [24, 32] -15
9 [27, 36] -17
Then df.agg({'a', nok_mean}) would return this:
a
0 0.0
1 3.5
2 7.0
3 10.5
4 14.0
5 17.5
6 21.0
7 24.5
8 28.0
9 31.5
This is related to how calculations are made on pandas side.
When you pass a dict of functions, the input is treated as a DataFrame instead of a flattened array. After that all calculations are made over the index axis by default. That's why you're getting the means by row.
If you go to the docs page you'll see:
The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
numpy aggregation functions (mean, median, prod, sum, std,
var), where the default is to compute the aggregation of the flattened
array, e.g., numpy.mean(arr_2d) as opposed to numpy.mean(arr_2d,
axis=0).
__
I think the only way to emulate numpy's behavior and pass a dict of functions to agg at the same time is df.agg(nok_mean)['a'].

Convert hh:mm:ss to minutes return but return TypeError: 'float' object is not subscriptable

Original data in a dataframe look like below and I want to convert it to minutes:
0 03:30:00
1 NaN
2 00:25:00
I learned a very good approach from this post:
Convert hh:mm:ss to minutes using python pandas
df2['FS_Runtime'].str.split(':') running this code split the data into below
0 [03, 30, 00]
1 NaN
2 [00, 25, 00]
I then added the .apply like the example in the post.
df2['FS_Runtime'].str.split(':').apply(lambda x: int(x[0])*60)
but i got the following error:
TypeError: 'float' object is not subscriptable
The issue is because of NaN in the dataframe. You can try this
df1['FS_Runtime'] = pd.to_datetime(df1['FS_Runtime'], format = '%H:%M:%S')
df1['FS_Runtime'].dt.hour * 60 + df1['FS_Runtime'].dt.minute
0 210.0
1 NaN
2 25.0
Your format is in the proper format for pd.to_timedelta then get the number of seconds and divide by 60:
import pandas as pd
import numpy as np
pd.to_timedelta(df['FS_Runtime']).dt.total_seconds()/60
# Alternatively
pd.to_timedelta(df['FS_Runtime'])/np.timedelta64(1, 'm')
#0 210.0
#1 NaN
#2 25.0
#Name: FS_Runtime, dtype: float64

error using astype when NaN exists in a dataframe

df
A B
0 a=10 b=20.10
1 a=20 NaN
2 NaN b=30.10
3 a=40 b=40.10
I tried :
df['A'] = df['A'].str.extract('(\d+)').astype(int)
df['B'] = df['B'].str.extract('(\d+)').astype(float)
But I get the following error:
ValueError: cannot convert float NaN to integer
And:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
How do I fix this ?
If some values in column are missing (NaN) and then converted to numeric, always dtype is float. You cannot convert values to int. Only to float, because type of NaN is float.
print (type(np.nan))
<class 'float'>
See docs how convert values if at least one NaN:
integer > cast to float64
If need int values you need replace NaN to some int, e.g. 0 by fillna and then it works perfectly:
df['A'] = df['A'].str.extract('(\d+)', expand=False)
df['B'] = df['B'].str.extract('(\d+)', expand=False)
print (df)
A B
0 10 20
1 20 NaN
2 NaN 30
3 40 40
df1 = df.fillna(0).astype(int)
print (df1)
A B
0 10 20
1 20 0
2 0 30
3 40 40
print (df1.dtypes)
A int32
B int32
dtype: object
From pandas >= 0.24 there is now a built-in pandas integer.
This does allow integer nan's, so you don't need to fill na's.
Notice the capital in 'Int64' in the code below.
This is the pandas integer, instead of the numpy integer.
You need to use: .astype('Int64')
So, do this:
df['A'] = df['A'].str.extract('(\d+)', expand=False).astype('float').astype('Int64')
df['B'] = df['B'].str.extract('(\d+)', expand=False).astype('float').astype('Int64')
More info on pandas integer na values:
https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions