How to show multiple timeseries plots using seaborn - pandas

I'm trying to generate 4 plots from a DataFrame using Seaborn
Date A B C D
2019-04-05 330.665 161.975 168.69 0
2019-04-06 322.782 150.243 172.539 0
2019-04-07 322.782 150.243 172.539 0
2019-04-08 295.918 127.801 168.117 0
2019-04-09 282.674 126.894 155.78 0
2019-04-10 293.818 133.413 160.405 0
I have casted dates using pd.to_DateTime and numbers using pd.to_numeric. Here is the df.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 460 to 465
Data columns (total 5 columns):
Date 6 non-null datetime64[ns]
A 6 non-null float64
B 6 non-null float64
C 6 non-null float64
D 6 non-null float64
dtypes: datetime64[ns](1), float64(4)
memory usage: 288.0 bytes
I can do a wide column plot by just calling .plot() on df.
However,
The legend of the plot is covering the plot itself
I would instead like to have 4 separate plots in 1 diagram and have tried using lmplot to achieve this.
I would like to add labels to the plot like so:
Plot with image
I first melted the data:
df=pd.melt(df,id_vars='Date', var_name='Var', value_name='Unit')
And then tried lmplot
sns.lmplot(x = df['Date'], y='Unit', col='Var', data=df)
However, I get the traceback:
TypeError: Invalid comparison between dtype=datetime64[ns] and str
I have also tried setting df.set_index['Date'] and replotting that using x=df.index and that gave me the same error.
The data can be plotted using Google Sheets but I am trying to automate a workflow where the chart can be generated and sent via Slack to selected recipients.
I hope I have expressed myself clearly enough as I am rather new to Python and Seaborn and hope to get some help from the experts here.

Regarding the legend you can just use .legend(loc="upper left", bbox_to_anchor=(1,1)) as in this example
%matplotlib inline
import pandas as pd
import numpy as np
data = np.random.rand(10,4)
df = pd.DataFrame(data, columns=["A", "B", "C", "D"])
df.plot()\
.legend(loc="upper left", bbox_to_anchor=(1,1));
While for the second IIUC you can play from
df.plot(subplots=True, layout=(2,2));

Related

Change NaN to None in Pandas dataframe

I try to replace Nan to None in pandas dataframe. It was working to use df.where(df.notnull(),None).
Here is the thread for this method.
Use None instead of np.nan for null values in pandas DataFrame
When I try to use the same method on another dataframe, it failed.
The new dataframe is like below
A NaN B C D E, the print out of the dataframe is like this:
Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6
0 A NaN B C D E
even when I use the working code run against the new dataframe, it failed.
I just wondering is it is because in the excel, the cell format has to be certain type.
Any suggestion on this?
This always works for me
df = df.replace({np.nan:None})
You can check this related question, Credit from here
The problem is that I did not follow the format.
The format I used that cause the problem was
df.where(df.notnull(), None)
If I wrote the code like this, there is no problem
df = df.where(df.notnull(), None)
To do it just over one column
df.col_name.replace({np.nan: None}, inplace=True)
This is not as easy as it looks.
1.NaN is the value set for any cell that is empty when we are reading file using pandas.read_csv()
2.None is the value set for any cell that is NULL when we are reading file using pandas.read_sql() or readin from a database
import pandas as pd
import numpy as np
x=pd.DataFrame()
df=pd.read_csv('file.csv')
df=df.replace({np.NaN:None})
df['prog']=df['prog'].astype(str)
print(df)
if there is compatibility issue of datatype , which will be because on replacing np.NaN will make the column of dataframe as object type.
so in this case first replace np.NaN with None and then choose the required datatype for the column
file.csv
column names : batch,prog,name
'prog' column is empty

Series.replace cannot use dict-like to_replace and non-None value [duplicate]

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.
How can I replace the nans with averages of columns where they are?
This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.
You can simply use DataFrame.fillna to fill the nan's directly:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
In [28]: df.mean()
Out[28]:
A -0.151121
B -0.231291
C -0.530307
dtype: float64
In [29]: df.fillna(df.mean())
Out[29]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325 1.533582
4 -0.151121 -0.231291 0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858 1.033826 -0.530307
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().
Try:
sub2['income'].fillna((sub2['income'].mean()), inplace=True)
In [16]: df = DataFrame(np.random.randn(10,3))
In [17]: df.iloc[3:5,0] = np.nan
In [18]: df.iloc[4:6,1] = np.nan
In [19]: df.iloc[5:8,2] = np.nan
In [20]: df
Out[20]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 NaN -0.985188 -0.324136
4 NaN NaN 0.238512
5 0.769657 NaN NaN
6 0.141951 0.326064 NaN
7 -1.694475 -0.523440 NaN
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
In [22]: df.mean()
Out[22]:
0 -0.251534
1 -0.040622
2 -0.841219
dtype: float64
Apply per-column the mean of that columns and fill
In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]:
0 1 2
0 1.148272 0.227366 -2.368136
1 -0.820823 1.071471 -0.784713
2 0.157913 0.602857 0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622 0.238512
5 0.769657 -0.040622 -0.841219
6 0.141951 0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8 0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794
Although, the below code does the job, BUT its performance takes a big hit, as you deal with a DataFrame with # records 100k or more:
df.fillna(df.mean())
In my experience, one should replace NaN values (be it with Mean or Median), only where it is required, rather than applying fillna() all over the DataFrame.
I had a DataFrame with 20 variables, and only 4 of them required NaN values treatment (replacement). I tried the above code (Code 1), along with a slightly modified version of it (code 2), where i ran it selectively .i.e. only on variables which had a NaN value
#------------------------------------------------
#----(Code 1) Treatment on overall DataFrame-----
df.fillna(df.mean())
#------------------------------------------------
#----(Code 2) Selective Treatment----------------
for i in df.columns[df.isnull().any(axis=0)]: #---Applying Only on variables with NaN values
df[i].fillna(df[i].mean(),inplace=True)
#---df.isnull().any(axis=0) gives True/False flag (Boolean value series),
#---which when applied on df.columns[], helps identify variables with NaN values
Below is the performance i observed, as i kept on increasing the # records in DataFrame
DataFrame with ~100k records
Code 1: 22.06 Seconds
Code 2: 0.03 Seconds
DataFrame with ~200k records
Code 1: 180.06 Seconds
Code 2: 0.06 Seconds
DataFrame with ~1.6 Million records
Code 1: code kept running endlessly
Code 2: 0.40 Seconds
DataFrame with ~13 Million records
Code 1: --did not even try, after seeing performance on 1.6 Mn records--
Code 2: 3.20 Seconds
Apologies for a long answer ! Hope this helps !
If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.
sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))
# To read data from csv file
Dataset = pd.read_csv('Data.csv')
X = Dataset.iloc[:, :-1].values
# To calculate mean use imputer class
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
Directly use df.fillna(df.mean()) to fill all the null value with mean
If you want to fill null value with mean of that column then you can use this
suppose x=df['Item_Weight'] here Item_Weight is column name
here we are assigning (fill null values of x with mean of x into x)
df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))
If you want to fill null value with some string then use
here Outlet_size is column name
df.Outlet_Size = df.Outlet_Size.fillna('Missing')
Pandas: How to replace NaN (nan) values with the average (mean), median or other statistics of one column
Say your DataFrame is df and you have one column called nr_items. This is: df['nr_items']
If you want to replace the NaN values of your column df['nr_items'] with the mean of the column:
Use method .fillna():
mean_value=df['nr_items'].mean()
df['nr_item_ave']=df['nr_items'].fillna(mean_value)
I have created a new df column called nr_item_ave to store the new column with the NaN values replaced by the mean value of the column.
You should be careful when using the mean. If you have outliers is more recommendable to use the median
Another option besides those above is:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.
using sklearn library preprocessing class
from sklearn.impute import SimpleImputer
missingvalues = SimpleImputer(missing_values = np.nan, strategy = 'mean', axis = 0)
missingvalues = missingvalues.fit(x[:,1:3])
x[:,1:3] = missingvalues.transform(x[:,1:3])
Note: In the recent version parameter missing_values value change to np.nan from NaN
I use this method to fill missing values by average of a column.
fill_mean = lambda col : col.fillna(col.mean())
df = df.apply(fill_mean, axis = 0)
You can also use value_counts to get the most frequent values. This would work on different datatypes.
df = df.apply(lambda x:x.fillna(x.value_counts().index[0]))
Here is the value_counts api reference.

Python scatter plot vs line plot and column values

Wondering if anyone could clarify this for me.
Basically, I have a dataframe that looks like this:
Data_Value
Month_Day
01-01 1.1
01-02 3.9
01-03 3.9
01-04 4.4
I can generate a line plot based on this dataframe using this code:
ax.plot(df.values)
I have had some problems generating a scatter plot from the same data frame and I am wondering if it's possible given that there is a "-" in the index column of the dataframe. However, I am also thinking that since it's possible to generate a line plot it should also be possible to do a scatter plot?
Any insights would be most welcome.
When I try this code:
df = df.reset_index()
df['Month_Day'] = pd.to_datetime(df['Month_Day'], format='%m-%d')
df.plot(type='scatter',x='Month_Day',y='Data_Value')
I get this error msg:
AttributeError: Unknown property type
My Pandas version: 0.19.2
Not sure if I understood your issue totally, but if its just to create scatter plots, you can try to reset the index to convert 'Month_Date' to a regular column and also convert it to datetime. I tried the following:
df.reset_index(inplace=True)
df['Month_Day'] = pd.to_datetime(df['Month_Day'], format='%m-%d')
# you can replace the year with any value, using 2020 as an example
df['Month_Day'] = [val.replace(year=2020) for val in df['Month_Day']]
print(df)
Output:
Month_Day Data_Value
0 2020-01-01 1.1
1 2020-01-02 3.9
2 2020-01-03 3.9
3 2020-01-04 4.4
Then generate a scatter plot:
import matplotlib.pyplot as plt
# generate the plot
plt.scatter(df['Month_Day'], df['Data_Value'])
plt.show()
You can do it, but I believe you have to have 'Month_Day' in the columns so you reset the index.
df = df.reset_index()
df.plot(kind='scatter',x='Month_Day',y='Data_Value')
Result:

Pandas groupby in combination with sklean preprocessing continued

Continue from this post:
Pandas groupby in combination with sklearn preprocessing
I need to do preprocessing by scaling grouped data by two columns, somehow get some error for the second method
import pandas as pd
import numpy as np
from sklearn.preprocessing import robust_scale,minmax_scale
df = pd.DataFrame( dict( id=list('AAAAABBBBB'),
loc = (10,20,10,20,10,20,10,20,10,20),
value=(0,10,10,20,100,100,200,30,40,100)))
df['new'] = df.groupby(['id','loc']).value.transform(lambda x:minmax_scale(x.astype(float) ))
df['new'] = df.groupby(['id','loc']).value.transform(lambda x:robust_scale(x ))
The second one give me error like this:
ValueError: Expected 2D array, got 1D array instead: array=[ 0. 10.
100.]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a
single sample.
If I use reshape I got error like this:
Exception: Data must be 1-dimensional
If I ever print out the grouped data, g['value'] is pandas series.
for n, g in df.groupby(['id','loc']):
print(type(g['value']))
Do you know what might cause it?
Thanks.
Base on the warning code , you should add reshape and concatenate
df.groupby(['id','loc']).value.transform(lambda x:np.concatenate(robust_scale(x.values.reshape(-1,1))))
Out[606]:
0 -0.2
1 -1.0
2 0.0
3 1.0
4 1.8
5 0.0
6 1.0
7 -2.0
8 -1.0
9 0.0
Name: value, dtype: float64

Pandas not detecting the datatype of a Series properly

I'm running into something a bit frustrating with pandas Series. I have a DataFrame with several columns, with numeric and non-numeric data. For some reason, however, pandas thinks some of the numeric columns are non-numeric, and ignores them when I try to run aggregating functions like .describe(). This is a problem, since pandas raises errors when I try to run analyses on these columns.
I've copied some commands from the terminal as an example. When I slice the 'ND_Offset' column (the problematic column in question), pandas tags it with the dtype of object. Yet, when I call .describe(), pandas tags it with the dtype float64 (which is what it should be). The 'Dwell' column, on the other hand, works exactly as it should, with pandas giving float64 both times.
Does anyone know why I'm getting this behavior?
In [83]: subject.phrases['ND_Offset'][:3]
Out[83]:
SubmitTime
2014-06-02 22:44:44 0.3607049
2014-06-02 22:44:44 0.2145484
2014-06-02 22:44:44 0.4031347
Name: ND_Offset, dtype: object
In [84]: subject.phrases['ND_Offset'].describe()
Out[84]:
count 1255.000000
unique 432.000000
top 0.242308
freq 21.000000
dtype: float64
In [85]: subject.phrases['Dwell'][:3]
Out[85]:
SubmitTime
2014-06-02 22:44:44 111
2014-06-02 22:44:44 81
2014-06-02 22:44:44 101
Name: Dwell, dtype: float64
In [86]: subject.phrases['Dwell'].describe()
Out[86]:
count 1255.000000
mean 99.013546
std 30.109327
min 21.000000
25% 81.000000
50% 94.000000
75% 111.000000
max 291.000000
dtype: float64
And when I use the .groupby function to group the data by another attribute (when these Series are a part of a DataFrame), I get the DataError: No numeric types to aggregate error when I try to call .agg(np.mean) on the group. When I try to call .agg(np.sum) on the same data, on the other hand, things work fine.
It's a bit bizarre -- can anyone explain what's going on?
Thank you!
It might be because the ND_Offset column (what I call A below) contains a non-numeric value such as an empty string. For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0.36, ''], 'B': [111, 81]})
print(df['A'].describe())
# count 2.00
# unique 2.00
# top 0.36
# freq 1.00
# dtype: float64
try:
print(df.groupby(['B']).agg(np.mean))
except Exception as err:
print(err)
# No numeric types to aggregate
print(df.groupby(['B']).agg(np.sum))
# A
# B
# 81
# 111 0.36
Aggregation using np.sum works because
In [103]: np.sum(pd.Series(['']))
Out[103]: ''
whereas np.mean(pd.Series([''])) raises
TypeError: Could not convert to numeric
To debug the problem, you could try to find the non-numeric value(s) using:
for val in df['A']:
if not isinstance(val, float):
print('Error: val = {!r}'.format(val))