Resampling/interpolating/extrapolating columns of a pandas dataframe - pandas

I am interested in knowing how to interpolate/resample/extrapolate columns of a pandas dataframe for pure numerical and datetime type indices. I'd like to perform this with either straight-forward linear interpolation or spline interpolation.
Consider first a simple pandas data frame that has a numerical index (signifying time) and a couple of columns:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10,2), index=np.arange(0,20,2))
print(df)
0 1
0 0.937961 0.943746
2 1.687854 0.866076
4 0.410656 -0.025926
6 -2.042386 0.956386
8 1.153727 -0.505902
10 -1.546215 0.081702
12 0.922419 0.614947
14 0.865873 -0.014047
16 0.225841 -0.831088
18 -0.048279 0.314828
I would like to resample the columns of this dataframe over some denser grid of time indices which possibly extend beyond the last time index (thus requiring extrapolation).
Denote the denser grid of indices as, for example:
t = np.arange(0,40,.6)
The interpolate method for a pandas dataframe seems to interpolate only nan's and thus requires those new indices (which may or may not coincide with the original indices) to already be part of the dataframe. I guess I could append a data frame of nans at the new indices to the original dataframe (excluding any indices appearing in both the old and new dataframes) and call interpolate and then remove the original time indices. Or, I could do everything in scipy and create a new dataframe at the desired time indices.
Is there a more direct way to do this?
In addition, I'd like to know how to do this same thing when the indices are, in fact, datetimes. That is, when, for example:
df.index = np.array('2015-07-04 02:12:40', dtype=np.datetime64) + np.arange(0,20,2)

Related

How to select a value in a dataframe with MultiIndex?

I use the Panda library to analyze the data coming from an excel file.
I used pivot_table to get a pivot table with the information I'm interested in. I end up with a multi index array.
For "OPE-2016-0001", I would like to obtain the figures for 2017 for example. I've tried lots of things and nothing works. What is the correct method to use? thank you
import pandas as pd
import numpy as np
from math import *
import tkinter as tk
pd.set_option('display.expand_frame_repr', False)
df = pd.read_csv('datas.csv')
def tcd_op_dataExcercice():
global df
new_df = df.assign(Occurence=1)
tcd= new_df.pivot_table(index=['Numéro opération',
'Libellé opération'],
columns=['Exercice'],
values=['Occurence'],
aggfunc=[np.sum],
margins=True,
fill_value=0,
margins_name='Total')
print(tcd)
print(tcd.xs('ALSTOM 8', level='Libellé opération', drop_level=False))
tcd_op_dataExcercice()
I get the following table (image).
How do I get the value framed in red?
You can use .loc to select rows by a DataFrame's Index's labels. If the Index is a MultiIndex, it will index into the first level of the MultiIndex (Numéro Opéracion in your case). Though you can pass a tuple to index into both levels (e.g. if you specifically wanted ("OPE-2016-0001", "ALSTOM 8"))
It's worth noting that the columns of your pivoted data are also a MultiIndex, because you specified the aggfunc, values and columns as lists, rather than individual values (i.e. without the []). Pandas creates a MultiIndex because of these lists, even though they had one
argument.
So you'll also need to pass a tuple to index into the columns to get the value for 2017:
tcd.loc["OPE-2016-0001", ('sum', 'Occurence', 2017)]
If you had instead just specified the aggfunc etc as individual strings, the columns would just be the years and you could select the values by:
tcd.loc["OPE-2016-0001", 2017]
Or if you specifically wanted the value for ALSTOM 8:
tcd.loc[("OPE-2016-0001", "ALSTOM 8"), 2017]
An alternative to indexing into a MultiIndex would also be to just .reset_index() after pivoting -- in which case the levels of the MultiIndex will just become columns in the data. And you can then select rows based on the values of those columns. E.g (assuming you specified aggfunc etc as strings):
tcd = tcd.reset_index()
tcd.query("'Numéro Opéracion' == 'OPE-2016-0001'")[2017]

Python CountVectorizer for Pandas DataFrame

I have got a pandas dataframe which looks like the following:
df.head()
categorized.Hashtags
0 icietmaintenant supyoga standuppaddleportugal ...
1 instapaysage bretagne labellebretagne bretagne...
2 bretagne lescrepescestlavie quimper bzh labret...
3 bretagne mer paysdiroise magnifique phare plou...
4 bateaux baiededouarnenez voiliers vieuxgreemen..
Now instead of using pandas get_dummmies() command I would like to use CountVectorizer to create the same output. Because get_dummies takes too much time.
df_x = df["categorized.Hashtags"]
vect = CountVectorizer(min_df=0.,max_df=1.0)
X = vect.fit_transform(df_x)
count_vect_df = pd.DataFrame(X.todense(), columns = vect.get_feature_names())
When I now output the respective data frame "count_vect_df" then the data frame contains a lot of columns which are empty/ contains only zero values. How can I avoid this?
Cheers,
Andi
From scikit-learn CountVectorizer docs:
Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts
using scipy.sparse.csr_matrix.
The CountVectorizer returns a sparse-matrix, which contains most of zero values, where non-zero values represent the number of times that specific term has appeared in the particular document.

How do I append a column from a numpy array to a pd dataframe?

I have a numpy array of 100 predicted values called first_100. If I convert these to a dataframe they are indexed as 0,1,2 etc. However, the predictions are for values that are in random indexed order, 66,201,32 etc. I want to be able to put the actual values and the predictions in the same dataframe, but I'm really struggling.
The real values are in a dataframe called first_100_train.
I've tried the following:
pd.concat([first_100, first_100_train], axis=1)
This doesn't work and for some reason returns the entire dataframe and indexed from 0 so there are lots of NaNs...
first_100_train['Prediction'] = first_100[0]
This is almost what I want, but again because the indexes are different the data doesn't match up. I'd really appreciate any suggestions.
EDIT: After managing to join the dataframes I now have this:
I'd like to be able to drop the final column...
Here is first_100.head()
and first_100_train.head()
The problem is that index 2 from first_100 actually corresponds to index 480 of first_100_train
Set default index values by DataFrame.reset_index and drop=True for correct alignment:
pd.concat([first_100.reset_index(drop=True),
first_100_train.reset_index(drop=True)], axis=1)
Or if first DataFrame have default RangeIndex solution is simplify:
pd.concat([first_100,
first_100_train.reset_index(drop=True)], axis=1)

What is non-concatenation axis in Pandas?

I am appending a dataframe df1 to a dataframe df2 with both having the same columns but not necessarily in the same order.
df = df1.append(df2)
I am seeing this warning as a result of the above operation.
"FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default."
I understand that the resulting dataframe has columns in alphabetical order after this operation but I am trying to underrstand the definition of non-concatenation axis which is mentioned in the warning. What is it and where is it significant in pandas?
It is the other axis—the axis along which you do not concatenate. If you're concatenating along the index (axis=0), then the non-concatenation axis would be 1 (i.e., the columns), and vice versa.

Does a DataFrame with a single row have all the attributes of a DataFrame?

I am slicing a DataFrame from a large DataFrame and daughter df have only one row. Does a daughter df with a single row has same attributes like parent df?
import numpy as np
import pandas as pd
dates = pd.date_range('20130101',periods=6)
df = pd.DataFrame(np.random.randn(6,2),index=dates,columns=['col1','col2'])
df1=df.iloc[1]
type(df1)
>> pandas.core.series.Series
df1.columns
>>'Series' object has no attribute 'columns'
Is there a way I can use all attributes of pd.DataFrame on a pd.series ?
Possibly what you are looking for is a dataframe with one row:
>>> pd.DataFrame(df1).T # T -> transpose
col1 col2
2013-01-02 -0.428913 1.265936
What happens when you do df.iloc[1] is that pandas converts that to a series, which is one-dimensional, and the columns become the index. You can still do df1['col1'], but you can't do df.columns because a series is basically a column, and hence the old columns are now the new index
As a result, you can returns the former columns like this:
>>> df1.index.tolist()
['col1', 'col2']
This used to confuse me quite a bit. I also expected df.iloc[1] to be a dataframe with one row, but it has always been the default behavior of pandas to automatically convert any one dimensional dataframe slice (whether row or column) to a series. It's pretty natural for a row, but less so for a column (since the columns become the index), but really is not a problem once you understand what is happening.