I have a dataframe with missing rows that I interpolate and resample. I would like to know if there was a way to grab the index of the rows that are added to the dataframe when I resample it ?
This is how I create/resample/interpolate the dataframe:
import numpy as np
import pandas as pd
from datetime import *
# Create df and drop a few rows
rng = pd.date_range('2000-01-01', periods=365, freq='D')
df = pd.DataFrame({'Val': np.random.randn(len(rng)) },index = rng)
df = df.drop([datetime(2000,1,5),datetime(2000,1,24)])
df = df.resample('D').interpolate(method='linear')
you can get the additional index elements by taking the difference between the new and the old ones
In [16]: df_new = df.resample('D').interpolate(method='linear')
In [17]: df_new.index.difference(df.index)
Out[17]: DatetimeIndex(['2000-01-05', '2000-01-24'], dtype='datetime64[ns]', freq=None)
Related
I would like to pefrom following operation using list comprehension:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('tips')
df.head()
for i in df.columns:
print(df.loc[:, i].is_unique)
Using [x.is_unique for x in df.loc[:, i] for i in df.columns] does not work
Use Series.is_unique with one for:
out = [df[i].is_unique for i in df.columns]
Alternative solution (I prefer first for more clear iterate by columns):
out = [df[i].is_unique for i in df]
I have a simple Pandas data frame with two columns, 'Angle' and 'rff'. I want to get an interpolated 'rff' value based on entering an Angle that falls between two Angle values (i.e. between two index values) in the data frame. For example, I'd like to enter 3.4 for the Angle and then get an interpolated 'rff'. What would be the best way to accomplish that?
import pandas as pd
data = [[1.0,45.0], [2,56], [3,58], [4,62],[5,70]] #Sample data
s= pd.DataFrame(data, columns = ['Angle', 'rff'])
print(s)
s = s.set_index('Angle') #Set 'Angle' as index
print(s)
result = s.at[3.0, "rff"]
print(result)
You may use numpy:
import numpy as np
np.interp(3.4, s.index, s.rff)
#59.6
You could use numpy for this:
import numpy as np
import pandas as pd
data = [[1.0,45.0], [2,56], [3,58], [4,62],[5,70]] #Sample data
s= pd.DataFrame(data, columns = ['Angle', 'rff'])
print(s)
print(np.interp(3.4, s.Angle, s.rff))
>>> 59.6
How could I convert a dataframe like this:
import pandas as pd
A = [0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4]
dfA = pd.DataFrame(A)
to a new dataframe like this:
# Expect output:
B = [[0,1,2,3,4],[0,1,2,3,4],[0,1,2,3,4],[0,1,2,3,4]]
dfB = pd.DataFrame(B)
I have multiple categorical columns with millions of distinct values in these categorical columns. So, I am using dask and pd.get_dummies for converting these categorical columns into bit vectors. Like this:
import pandas as pd
import numpy as np
import scipy.sparse
import dask.dataframe as dd
import multiprocessing
train_set = pd.read_csv('train_set.csv')
def convert_into_one_hot (col1, col2):
return pd.get_dummies(train_set, columns=[col1, col2], sparse=True)
ddata = dd.from_pandas(train_set, npartitions=2*multiprocessing.cpu_count()).map_partitions(lambda df: df.apply((lambda row: convert_into_one_hot(row.col1, row.col2)), axis=1)).compute(scheduler='processes')
But, I get this error:
ValueError: Metadata inference failed in `lambda`.
You have supplied a custom function and Dask is unable to determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
------------------------
KeyError("None of [Index(['foo'], dtype='object')] are in the [columns]")
What am I doing wrong here? Thanks.
EDIT:
A small example to reproduce the error. Hope it helps to understand the problem.
def convert_into_one_hot (x, y):
return pd.get_dummies(df, columns=[x, y], sparse=True)
d = {'col1': ['a', 'b'], 'col2': ['c', 'd']}
df = pd.DataFrame(data=d)
dd.from_pandas(df, npartitions=2*multiprocessing.cpu_count()).map_partitions(lambda df: df.apply((lambda row: convert_into_one_hot(row.col1, row.col2)), axis=1)).compute(scheduler='processes')
I think you could have some problems if you try to use get_dummies within partitions. there is a dask version for this and should work as following
import pandas as pd
import dask.dataframe as dd
import multiprocessing as mp
d = {'col1': ['a', 'b'], 'col2': ['c', 'd']}
df = pd.DataFrame(data=d)
Pandas
pd.get_dummies(df, columns=["col1", "col2"], sparse=True)
Dask
ddf = dd.from_pandas(df, npartitions=2 * mp.cpu_count())
# you need to converts columns dtypes to category
dummies_cols = ["col1", "col2"]
ddf[dummies_cols] = ddf[dummies_cols].categorize()
dd.get_dummies(ddf, columns=["col1", "col2"], sparse=True)
I am trying to use dask dataframe map_partition to apply a function which access the value in the dataframe index, rowise and create a new column.
Below is the code I tried.
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame(index = ["row0" , "row1","row2","row3","row4"])
df
ddf = dd.from_pandas(df, npartitions=2)
res = ddf.map_partitions(lambda df: df.assign(index_copy= str(df.index)),meta={'index_copy': 'U' })
res.compute()
I am expecting df.index to be the value in the row index, not the entire partition index which it seems to refer to. From the doc here, this work well for columns but not the index.
what you want to do is this
df.index = ['row'+str(x) for x in df.index]
and for that first create your pandas dataframe and then run this code after you will have your expected result.
let me know if this works for you.