How do I append a column from a numpy array to a pd dataframe? - pandas

I have a numpy array of 100 predicted values called first_100. If I convert these to a dataframe they are indexed as 0,1,2 etc. However, the predictions are for values that are in random indexed order, 66,201,32 etc. I want to be able to put the actual values and the predictions in the same dataframe, but I'm really struggling.
The real values are in a dataframe called first_100_train.
I've tried the following:
pd.concat([first_100, first_100_train], axis=1)
This doesn't work and for some reason returns the entire dataframe and indexed from 0 so there are lots of NaNs...
first_100_train['Prediction'] = first_100[0]
This is almost what I want, but again because the indexes are different the data doesn't match up. I'd really appreciate any suggestions.
EDIT: After managing to join the dataframes I now have this:
I'd like to be able to drop the final column...
Here is first_100.head()
and first_100_train.head()
The problem is that index 2 from first_100 actually corresponds to index 480 of first_100_train

Set default index values by DataFrame.reset_index and drop=True for correct alignment:
pd.concat([first_100.reset_index(drop=True),
first_100_train.reset_index(drop=True)], axis=1)
Or if first DataFrame have default RangeIndex solution is simplify:
pd.concat([first_100,
first_100_train.reset_index(drop=True)], axis=1)

Related

Problems plotting price data against my datetime which is indexed

Here is my code:
df=pd.read_csv('data.csv')
df['datetime']=pd.to_datetime(df['datetime'])
df=df.set_index('datetime')
data = df.filter(['avgLowPrice'])
plt.plot(data['avgLowPrice'])
plt.show()
the graph looks like this:
I have no idea why its doing this...
I suppose that your DataFrame is not sorted by the index, i.e.
consecutive rows have "intermixed" (instead of ordered) index values.
Sort your DataFrame, even in-place:
df.sort_index(inplace=True)
and then generate your plot.
Another (not related) hint, to make your code more concise:
To read your input file, convert datetime column to datetime and
set it as the index, in one go, run:
df = pd.read_csv('data.csv', parse_dates=['datetime'], index_col='datetime')

How to select a value in a dataframe with MultiIndex?

I use the Panda library to analyze the data coming from an excel file.
I used pivot_table to get a pivot table with the information I'm interested in. I end up with a multi index array.
For "OPE-2016-0001", I would like to obtain the figures for 2017 for example. I've tried lots of things and nothing works. What is the correct method to use? thank you
import pandas as pd
import numpy as np
from math import *
import tkinter as tk
pd.set_option('display.expand_frame_repr', False)
df = pd.read_csv('datas.csv')
def tcd_op_dataExcercice():
global df
new_df = df.assign(Occurence=1)
tcd= new_df.pivot_table(index=['Numéro opération',
'Libellé opération'],
columns=['Exercice'],
values=['Occurence'],
aggfunc=[np.sum],
margins=True,
fill_value=0,
margins_name='Total')
print(tcd)
print(tcd.xs('ALSTOM 8', level='Libellé opération', drop_level=False))
tcd_op_dataExcercice()
I get the following table (image).
How do I get the value framed in red?
You can use .loc to select rows by a DataFrame's Index's labels. If the Index is a MultiIndex, it will index into the first level of the MultiIndex (Numéro Opéracion in your case). Though you can pass a tuple to index into both levels (e.g. if you specifically wanted ("OPE-2016-0001", "ALSTOM 8"))
It's worth noting that the columns of your pivoted data are also a MultiIndex, because you specified the aggfunc, values and columns as lists, rather than individual values (i.e. without the []). Pandas creates a MultiIndex because of these lists, even though they had one
argument.
So you'll also need to pass a tuple to index into the columns to get the value for 2017:
tcd.loc["OPE-2016-0001", ('sum', 'Occurence', 2017)]
If you had instead just specified the aggfunc etc as individual strings, the columns would just be the years and you could select the values by:
tcd.loc["OPE-2016-0001", 2017]
Or if you specifically wanted the value for ALSTOM 8:
tcd.loc[("OPE-2016-0001", "ALSTOM 8"), 2017]
An alternative to indexing into a MultiIndex would also be to just .reset_index() after pivoting -- in which case the levels of the MultiIndex will just become columns in the data. And you can then select rows based on the values of those columns. E.g (assuming you specified aggfunc etc as strings):
tcd = tcd.reset_index()
tcd.query("'Numéro Opéracion' == 'OPE-2016-0001'")[2017]

Pandas dataframe mixed dtypes when reading csv

I am reading in a large dataframe that is throwing a DtypeWarning: Columns (I understand this warning) but am struggling to prevent it (I don't want to set low_memory to False as I would like to specify the correct dtypes.
For every columns, the majority of rows are float values and the last 3 rows are string (metadata basically, information about each column). I understand that I can set the dtype per column when reading in the csv, however I do not know how to change rows 1:n to be float32 for example and the last 3 rows to be strings. I would like to avoid reading in two separate CSVs. The resulting dtype of all columns after reading in the dataframe is 'object'. Below is a reproducible example. The dtype warning is not thrown when reading in i am guessing because of the size of the dataframe - however the result is exactly the same as the problem i am facing. i would like to make the first 3 rows float32 and the last 3 string so that they are the correct dtype. thank you!
reproducible example:
df = pd.DataFrame([[0.1, 0.2,0.3],[0.1, 0.2,0.3],[0.1, 0.2,0.3],
['info1', 'info2','info3'],['info1', 'info2','info3'],['info1', 'info2','info3']],
index=['index1', 'index2', 'index3', 'info1', 'info2', 'info3'],
columns=['column1', 'column2', 'column3'] )
df.to_csv('test.csv')
df1 = pd.read_csv('test.csv', index_col=0)

Resampling/interpolating/extrapolating columns of a pandas dataframe

I am interested in knowing how to interpolate/resample/extrapolate columns of a pandas dataframe for pure numerical and datetime type indices. I'd like to perform this with either straight-forward linear interpolation or spline interpolation.
Consider first a simple pandas data frame that has a numerical index (signifying time) and a couple of columns:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,2), index=np.arange(0,20,2))
print(df)
0 1
0 0.937961 0.943746
2 1.687854 0.866076
4 0.410656 -0.025926
6 -2.042386 0.956386
8 1.153727 -0.505902
10 -1.546215 0.081702
12 0.922419 0.614947
14 0.865873 -0.014047
16 0.225841 -0.831088
18 -0.048279 0.314828
I would like to resample the columns of this dataframe over some denser grid of time indices which possibly extend beyond the last time index (thus requiring extrapolation).
Denote the denser grid of indices as, for example:
t = np.arange(0,40,.6)
The interpolate method for a pandas dataframe seems to interpolate only nan's and thus requires those new indices (which may or may not coincide with the original indices) to already be part of the dataframe. I guess I could append a data frame of nans at the new indices to the original dataframe (excluding any indices appearing in both the old and new dataframes) and call interpolate and then remove the original time indices. Or, I could do everything in scipy and create a new dataframe at the desired time indices.
Is there a more direct way to do this?
In addition, I'd like to know how to do this same thing when the indices are, in fact, datetimes. That is, when, for example:
df.index = np.array('2015-07-04 02:12:40', dtype=np.datetime64) + np.arange(0,20,2)

Does a DataFrame with a single row have all the attributes of a DataFrame?

I am slicing a DataFrame from a large DataFrame and daughter df have only one row. Does a daughter df with a single row has same attributes like parent df?
import numpy as np
import pandas as pd
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,2),index=dates,columns=['col1','col2'])
df1=df.iloc[1]
type(df1)
>> pandas.core.series.Series
df1.columns
>>'Series' object has no attribute 'columns'
Is there a way I can use all attributes of pd.DataFrame on a pd.series ?
Possibly what you are looking for is a dataframe with one row:
>>> pd.DataFrame(df1).T # T -> transpose
col1 col2
2013-01-02 -0.428913 1.265936
What happens when you do df.iloc[1] is that pandas converts that to a series, which is one-dimensional, and the columns become the index. You can still do df1['col1'], but you can't do df.columns because a series is basically a column, and hence the old columns are now the new index
As a result, you can returns the former columns like this:
>>> df1.index.tolist()
['col1', 'col2']
This used to confuse me quite a bit. I also expected df.iloc[1] to be a dataframe with one row, but it has always been the default behavior of pandas to automatically convert any one dimensional dataframe slice (whether row or column) to a series. It's pretty natural for a row, but less so for a column (since the columns become the index), but really is not a problem once you understand what is happening.