NaN-columns is plotted as a all 0 column in pandas - pandas

I have got some problems with plotting a sliced DataFrame with entire columns filled with NaN's.
How come:
pandas.DataFrame(
dict(
A=pandas.Series([np.NaN]*32),
B=pd.Series(range(-1,32))
)
).plot()
differs from:
#Ugly fix
pandas.DataFrame(
dict(
A=pandas.Series( [0] + [numpy.NaN]*32),
B=pd.Series(range(-1,32))
)
).plot()
by plotting a 0-line as if the column is filled with zeros.
Shouldn't the first code work just as:
pylab.plot(
range(0,33),
range(-1,32),
range(0,32),
[numpy.NaN]*32
)
And also plotting just a Series filled with NaN works fine:
pandas.Series([numpy.NaN]*32).plot()
What am I missing? Is there a right way to plot a column with all NaN's or is it a bug?

This looks like a bug in pandas. Looking at the source code, in pandas.tools.plotting, lines 554:556:
empty = df[col].count() == 0
# is this right?
values = df[col].values if not empty else np.zeros(len(df))
If the column contains only NaNs, then empty is True and values is set to np.zeros().
Note: I did not add the "is this right?" comment: it's in the source code! (pandas v.0.8.1).
I've raised a bug about it: https://github.com/pydata/pandas/issues/1696

Related

how to transform numerical to categorical (binary) variable and why do nan values reappear after filling in with zeros?

I need to create a categorical (binary) variable from an existing numerical variable that has missing values. After filling in with zeros, nan values re-appear. This is causing issues as later I would like to remove missing values for my other variables which I did not show in the dataframe. I do not want to remove any observations from the Write_Off variable.
import pandas as pd
# create lists
df = [['F', 267], ['M', 230], ['F', ], ['M', ]]
# Create the pandas DataFrame
df = pd.DataFrame(df, columns = ['Gender', 'Write_Off'])
# print dataframe.
print(df)
# fill in missing values
df['Write_Off'].fillna(0)
print(df)
# check for missing values. the nan values are back for the Write_Off column!
df.isnull().sum()
# Create a dummy (integer) variable called y from the Write_Off column. Any value grater than 0 will take the value of 1. This means that there is a write-off amount. For values of zero there is no write-off amount.
df['y'] = (df.Write_Off > 0.).astype('int')
# print the dataframe.
print(df)
# transform the y variable to categorical (binary) data. Is this the correct way to do it? y will be the dependent variable in a logistic regression.
df['y'] = pd.Categorical(df.y)
#check the data types
df.dtypes
# print the dataframe. the Write_Off column still shows nan values.
print(df)
Please help me correct the code. Thanks.
Don't worry, you almost did.
You forgot to assign the value to data parameter when creating dataframe.
Try this way:
df = pd.DataFrame(data=df, columns = ['Gender', 'Write_Off'])
And everything will work.
Saying “thanks” is appreciated, but it doesn’t answer the question. Instead, vote up the answers that helped you the most! If these answers were helpful to you, please consider saying thank you in a more constructive way – by contributing your own answers to questions your peers have asked here.

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

Pandas dataframe mixed dtypes when reading csv

I am reading in a large dataframe that is throwing a DtypeWarning: Columns (I understand this warning) but am struggling to prevent it (I don't want to set low_memory to False as I would like to specify the correct dtypes.
For every columns, the majority of rows are float values and the last 3 rows are string (metadata basically, information about each column). I understand that I can set the dtype per column when reading in the csv, however I do not know how to change rows 1:n to be float32 for example and the last 3 rows to be strings. I would like to avoid reading in two separate CSVs. The resulting dtype of all columns after reading in the dataframe is 'object'. Below is a reproducible example. The dtype warning is not thrown when reading in i am guessing because of the size of the dataframe - however the result is exactly the same as the problem i am facing. i would like to make the first 3 rows float32 and the last 3 string so that they are the correct dtype. thank you!
reproducible example:
df = pd.DataFrame([[0.1, 0.2,0.3],[0.1, 0.2,0.3],[0.1, 0.2,0.3],
['info1', 'info2','info3'],['info1', 'info2','info3'],['info1', 'info2','info3']],
index=['index1', 'index2', 'index3', 'info1', 'info2', 'info3'],
columns=['column1', 'column2', 'column3'] )
df.to_csv('test.csv')
df1 = pd.read_csv('test.csv', index_col=0)

Using .loc to populate an empty dataframe... error = 'Passing list-likes to .loc or [] with any missing labels is no longer supported'

empty DF: raw_count_df
htp_raw: htp_raw - these are the values i want to enter into the corresponding columns in raw_count_df
How could I rewrite this code...
raw_count_df is the empty DF with the column headers htf_one, htf_two, htf_three and htf_average (the columns I am populating)
htf_raw is a dataframe containing the values I want to enter into the empty dataframe.
Using loc this code would identify the column htf_one and then use the index of the empty dataframe to place the value into the correct place. I only want values from htf_raw which match the index of the empty dataframe.
This code worked recently...
raw_count_df['htp_one'] = htp_raw.loc[raw_count_df.index, 'htf_one']
raw_count_df['htp_two'] = htp_raw.loc[raw_count_df.index, 'htf_two']
raw_count_df['htp_three'] = htp_raw.loc[raw_count_df.index, 'htf_three']
raw_count_df['htp_average'] = htp_raw.loc[raw_count_df.index, 'average']
Now I am getting this error..
Passing list-likes to .loc or [] with any missing labels is no longer supported
I am not sure how I would re-write this code using .reindex etc to populate the dataframe in the same way.

Infer Series Labels and Data from pandas dataframe column for plotting

Consider a simple 2x2 dataset with with Series labels prepended as the first column ("Repo")
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLib 140.0 47.0
Here are the DataFrame columns:
p(df.columns)
([u'Repo', u'AllTests', u'Restricted']
So we have the first column is the string/label and the second and third columns are data values. We want one series per row corresponding to the Galactian and the Forecast-MLlib repos.
It would seem this would be a common task and there would be a straightforward way to simply plot the DataFrame . However the following related question does not provide any simple way: it essentially throws away the DataFrame structural knowledge and plots manually:
Set matplotlib plot axis to be the dataframe column name
So is there a more natural way to plot these Series - that does not involve deconstructing the already-useful DataFrame but instead infers the first column as labels and the remaining as series data points?
Update Here is a self contained snippet
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['Galactian','Forecast-MLlib'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','AllTests','Restricted']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.show()
And here is output
Repo AllTests Restricted
0 Galactian 1860.0 410.0
1 Forecast-MLlib 140.0 47.0
And with piRSquared help it looks like this
So the data is showing now .. but the Series and Labels are swapped. Will look further to try to line them up properly.
Another update
By flipping the columns/labels the series are coming out as desired.
The change was to :
labels = npa(['AllTests','Restricted'])
..
colnames = ['Repo','Galactian','Forecast-MLlib']
So the updated code is
runtimes = npa([1860.,410.,140.,47.])
runtimes.shape = (2,2)
labels = npa(['AllTests','Restricted'])
labels.shape=(2,1)
rtlabels = np.concatenate((labels,runtimes),axis=1)
rtlabels.shape = (2,3)
colnames = ['Repo','Galactian','Forecast-MLlib']
df = pd.DataFrame(rtlabels, columns=colnames)
ps(df)
df.set_index('Repo').astype(float).plot()
plt.title("Restricting Long-Running Tests\nin Galactus and Forecast-ML")
plt.show()
p('df columns', df.columns)
ps(df)
Pandas assumes your label information is in the index and columns. Set the index first:
df.set_index('Repo').astype(float).plot()
Or
df.set_index('Repo').T.astype(float).plot()