How to Box Plot Panda Timestamp series ? (Errors with Timestamp type) - pandas

I'm using:
Pandas version 0.23.0
Python version 3.6.5
Seaborn version 0.81.1
I'd like a Box Plot of a column of Timestamp data. My dataframe is not a time series, the index is just an integer but I have created a column of Timestamp data using:
# create a new column of time stamps corresponding to EVENT_DTM
data['EVENT_DTM_TS'] =pd.to_datetime(data.EVENT_DTM, errors='coerce')
I filter out all NaT values resulting from coerce.
dt_filtered_time = data[~data.EVENT_DTM_TS.isnull()]
At this point my data looks good and I can confirm the type of the EVENT_DM_TS column is Timestamp with no invalid values.
Finally to generate the single variable box plot I invoke:
ax = sns.boxplot(x=dt_filtered_time.EVENT_DTM_TS)
and get the error:
TypeError: ufunc add cannot use operands with types dtype('M8[ns]') and dtype( 'M8[ns]')
I've Googled and found:
https://github.com/pandas-dev/pandas/issues/13844
https://github.com/matplotlib/matplotlib/issues/9610
which seemingly indicate issues with data type representations.
I've also seen references to issues with pandas version 0.21.0.
Anyone have an easy fix suggestion or do I need to use a different data type to plot the box plot. I'd like to get the single picture of the distribution of the timestamp data.

This is the code I ended up with:
import time
#plt.FuncFormatter
def convert_to_date_string(x,pos):
return time.strftime('%Y-%m',time.localtime(x))
plt.figure(figsize=(15,4))
sns.set(style='whitegrid')
temp = dt_filtered_time.EVENT_DTM_TS.astype(np.int64)/1E9
ax = sns.boxplot(x=temp)
ax.xaxis.set_major_formatter(convert_to_date_string)
Here is the result:
Credit goes to ImportanceOfBeingErnest whose comment pointed me towards this solution.

Related

Cannot plot a histogram from a Pandas dataframe

I've used pandas.read_csv to generate a 1000-row dataframe with 32 columns. I'm looking to plot a histogram or bar chart (depending on data type) of each column. For columns of type 'int64', I've tried doing matplotlib.pyplot.hist(df['column']) and df.hist(column='column'), as well as calling matplotlib.pyplot.hist on df['column'].values and df['column'].to_numpy(). Weirdly, nthey all take areally long time (>30s) and when I've allowed them to complet, I get unit-height bars in multiple colors, as if there's some sort of implicit grouping and they're all being separated into different groups. Any ideas about what I can do to get a normal histogram? Unfortunately I closed the charts so I can't show you an example right now.
Edit - this seems to be a much bigger problem with Int columns, and casting them to float fixes the problem.
Follow these two steps:
import the Histogram class from the Matplotlib library
use the "plot" method, which will accept a dataframe as argument
import matplotlib.pyplot as plt
plt.hist(df['column'], color='blue', edgecolor='black', bins=int(45/1))
Here's the source.

Convert type object column to float

I have a table with a column named "price". This column is of type object. So, it contains numbers as strings and also NaN or ? characters. I want to find the mean of this column but first I have to remove the NaN and ? values and also convert it to float
I am using the following code:
import pandas as pd
import numpy as np
df = pd.read_csv('Automobile_data.csv', sep = ',')
df = df.dropna('price', inplace=True)
df['price'] = df['price'].astype('int')
df['price'].mean()
But, this doesn't work. The error says:
ValueError: No axis named price for object type DataFrame
How can I solve this problem?
edit: in pandas version 1.3 and less, you need subset=[col] wrapped in a list/array. In verison 1.4 and greater you can pass a single column as a string.
You've got a few problems:
df.dropna() arguments require the axis and then the subset. The axis is rows/columns, and then subset is which of those to look at. So you want this to be (I think) df.dropna(axis='rows',subset='price')
Using inplace=True makes the whole thing return None, and so you have set df = None. You don't want to do that. If you are using inplace=True, then you don't assign something to that, the whole line would just be df.dropna(...,inplace=True).
Don't use inplace=True, just do the assignment. That is, you should use df=df.dropna(axis='rows',subset='price')

Trying to Drop values by column (I convert these values to nan but could be anything) not working

Trying to drop NAs by column in Dask, given a certain threshold and I receive the error below.
I'm receiving the following error, but this should be working. Please advise.
reproducible example.
import pandas as pd
import dask
data = [['tom', 10], ['nick', 15], ['juli', 5]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
import numpy as np
df = df.replace(5, np.nan)
ddf = dd.from_pandas(df, npartitions = 2)
ddf.dropna(axis='columns')
Passing axis is not support for dask dataframes as of now. You cvan also print docstring of the function via ddf.dropna? and it will tell you the same:
Signature: ddf.dropna(how='any', subset=None, thresh=None)
Docstring:
Remove missing values.
This docstring was copied from pandas.core.frame.DataFrame.dropna.
Some inconsistencies with the Dask version may exist.
See the :ref:`User Guide <missing_data>` for more on which values are
considered missing, and how to work with missing data.
Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0 (Not supported in Dask)
Determine if rows or columns which contain missing values are
removed.
* 0, or 'index' : Drop rows which contain missing values.
* 1, or 'columns' : Drop columns which contain missing value.
.. versionchanged:: 1.0.0
Pass tuple or list to drop on multiple axes.
Only a single axis is allowed.
how : {'any', 'all'}, default 'any'
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.
* 'any' : If any NA values are present, drop that row or column.
* 'all' : If all values are NA, drop that row or column.
thresh : int, optional
Require that many non-NA values.
subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
inplace : bool, default False (Not supported in Dask)
If True, do operation inplace and return None.
Returns
-------
DataFrame or None
DataFrame with NA entries dropped from it or None if ``inplace=True``.
Worth noting that Dask Documentation is copied from pandas for many instances like this. But wherever it does, it specifically states that:
This docstring was copied from pandas.core.frame.DataFrame.drop. Some
inconsistencies with the Dask version may exist.
Therefore its always best to check docstring for dask's pandas-driven functions instead of relying on documentation
The reason this isn't supported in dask is because it requires computing the entire dataframe in order for dask to know the shape of the result. This is very different from the row-wise case, where the number of columns and partitions won't change, so the operation can be scheduled without doing any work.
Dask does not allow some parts of the pandas API which seem like normal pandas operations which might be ported to dask, but in reality can't be scheduled without triggering compute on the current frame. You're running into this issue by design, because while .dropna(axis=0) would work just fine as a scheduled operation, .dropna(axis=1) would have a very different implication.
You can do this manually with the following:
ddf[ddf.columns[~ddf.isna().any(axis=0)]]
but the filtering operation ddf.columns[~ddf.isna().any(axis=0)] will trigger a compute on the whole dataframe. It probably makes sense to persist prior to running this if you can fit the dataframe in your cluster's memory.

Writing data frame with object dtype to HDF5 only works after converting to string

I have a big data dataframe and I want to write it to disk for quick retrieval. I believe to_hdf(...) infers the data type of the columns and sometimes gets it wrong. I wonder what the correct way is to cope with this.
import pandas as pd
import numpy as np
length = 10
df = pd.DataFrame({"a": np.random.randint(1e7, 1e8, length),})
# df.loc[1, "a"] = "abc"
# df["a"] = df["a"].astype(str)
print(df.dtypes)
df.to_hdf("df.hdf5", key="data", format="table")
Uncommenting various lines leads me to the following.
Just filling the column with numbers will lead to a data type int32 and stores without problem
Setting one element to abc changes the data to object, but it seems that to_hdf internally infers another data type and throws an error: TypeError: object of type 'int' has no len()
Explicitely converting the column to str leads to success, and to_hdf stores the data.
Now I am wondering what is happening in the second case, and is there a way to prevent this? The only way I found was to go through all columns, check if they are dtype('O') and explicitely convert them to str.
Instead of using hdf5, I have found a generic pickling library which seems to be perfect for the job: jiblib
Storing and loading data is straight forward:
import joblib
joblib.dump(df, "file.jl")
df2 = joblib.load("file.jl")

Trouble with plotly charts

I am giving myself an intro to plotting data and have come across some trouble. I am working on a line chart that I plan on making animated as soon as I figure out this problem.
I want a graph that looks like this:
However this code I have now:
`x=df_pre_2003['year']
y=df_pre_2003['nAllNeonic']
trace=go.Scatter(
x=x,
y=y
)
data=[trace]
ply.plot(data, filename='test.html')`
is giving me this:
So I added y=df_pre_2003['nAllNeonic'].sum()
but, now it says ValueError:
Invalid value of type 'builtins.float' received for the 'y' property of scatter
Received value: 1133180.4000000006
The 'y' property is an array that may be specified as a tuple,
list, numpy array, or pandas Series
Which I tried and it still did not work. The data types for year is int64 and nAllNeonic is float64.
It looks like you have to sort the values first based on the date. Now it's connecting a value in the year 1997 with a value in 1994.
df_pre_2003.sort_values(by = ['year'])
This is not to answer this question, but to share my similar case for any future research needs:
In my case the error message was coming when I tried to export the django models objects to use it in the plotly scatter chart, and the error was as follows:
The 'x' property is an array that may be specified as a tuple, list, numpy array, or pandas Series
The solution for this in my case was to export the django model info into pandas data frame then use the pandas data frame columns instead of the model fields name.