Create new panda dataframe with fixed distance using interpolate - pandas

I have a dataframe of the following form.
df = {'X': [0, 3, 6, 7, 8, 11],
'Y1': [8, 5, 4, 3, 2, 1.5],
'Y2': [1, 2, 4, 5, 5, 5]}
I would like to create a new dataframe where I use interpolate where 'X' is stepping in fixed steps [0, 2, 4, 6, 8, 10].
To find the new 'Y' values I need to find f(x)=Y1 and then I can evaluate for each step in X. But since I have many Y's I think there must be a more clever way to do this.

The solution I found was the following:
step_size = 0.25
no_steps = int(np.floor(max(b['X'])/step_size))
for i in range(0,no_steps+1):
b = b.append({'X' : 0.25*i, 'StepNo' : 10, 'PointNo' : 23+i}, ignore_index=True)
b = b.sort_values(['X'])
b = b.set_index(['X'])
c = b.interpolate('index')
c = c.reset_index()
c = c.sort_values(['PointNo'])
So first I define step size. Then I calculate number of steps. Then I append the steps into the dataframe. Sort the dataframe and reindex so I can use interpolate using 'index' as values.

Related

Given two arrays, `a` and `b`, how to find efficiently all combinations of elements in `b` that have equal value in `a`?

Given two arrays, a and b, how to find efficiently all combinations of elements in b that have equal value in a?
here is an example:
Given
a = [0, 0, 0, 1, 1, 2, 2, 2, 2]
b = [1, 2, 4, 5, 9, 3, 7, 22, 10]
how would you calculate
c = [[1, 2],
[1, 4],
[2, 4],
[5, 9],
[3, 7],
[3, 22],
[3, 10],
[7, 22],
[7, 10],
[22, 10]]
?
a can be assumed to be sorted.
I can do this with loops, a la:
import torch
a = torch.tensor([0, 0, 0, 1, 1, 2, 2, 2, 2])
b = torch.tensor([1, 2, 4, 5, 9, 3, 7, 22, 10])
jumps = torch.cat((torch.tensor([0]),
torch.where(a.diff() > 0)[0] + 1,
torch.tensor([len(a)])))
cs = []
for i in range(len(jumps) - 1):
cs.append(torch.combinations(b[jumps[i]:jumps[i + 1]]))
c = torch.cat(cs)
Is there any efficient way to avoid the loop? The solution should work for CPU and CUDA.
Also, the solution should have runtime O(m * m), where m is the largest number of equal elements in a and not O(n * n) where n is the length of of a.
I prefer solutions for pytorch, but I am curious for solution for numpy as well.
I think the overhead of using torch is only justified for bigger datasets, as there is basically no computational difficulty in the function, imho you can achieve same results with:
from collections import Counter
def find_combinations1(a, b):
count_a = Counter(a)
combinations = []
for x in set(b):
if count_a[x] == b.count(x):
combinations.append(x)
return combinations
or even a simpler:
def find_combinations2(a, b):
return list(set(a) & set(b))
With pytorch I assume the most simple approach is:
import torch
def find_combinations3(a, b):
a = torch.tensor(a)
b = torch.tensor(b)
eq = torch.eq(a, b.view(-1, 1))
indices = torch.nonzero(eq)
return indices[:, 1]
This option has of course a time complexity of O(n*m) where n is the size of a and m is the size of b, and O(n+m) is the memory for the tensors.

Creating multiple columns in pandas with lambda function

I'm trying to create a set of new columns with growth rates within my df in a more efficient way than multiply imputing them one by one.
My df has +100 variables, but for simplicity, assume the following:
consumption = [5, 10, 15, 20, 25, 30, 35, 40]
wage = [10, 20, 30, 40, 50, 60, 70, 80]
period = [1, 2, 3, 4, 5, 6, 7, 8]
id = [1, 1, 1, 1, 1, 1, 1, 1]
tup= list(zip(id , period, wage))
df = pd.DataFrame(tup,
columns=['id ', 'period', 'wage'])
With two variables I could simply do this:
df['wage_chg']= df.sort_values(by=['id', 'period']).groupby(['id'])['wage'].apply(lambda x: (x/x.shift(4)-1)).fillna(0)
df['consumption_chg']= df.sort_values(by=['id', 'period']).groupby(['id'])['consumption'].apply(lambda x: (x/x.shift(4)-1)).fillna(0)
But maybe by using a for loop or something I could iterate over my column names creating new growth rate columns with the name columnname_chg as in the example above.
Any ideas?
Thanks
You can try DataFrame operation rather than Series operation in groupby.apply
cols = ['wage', 'columnname']
out = df.join(df.sort_values(by=['id', 'period'])
.groupby(['id'])[cols]
.apply(lambda g: (g/g.shift(4)-1)).fillna(0)
.add_suffix('_chg'))

Tensorflow filter operation on dataset with several columns

I want to create a subset of my data by applying tf.data.Dataset filter operation. I have this data:
data = tf.convert_to_tensor([[1, 2, 1, 1, 5, 5, 9, 12], [1, 2, 3, 8, 4, 5, 9, 12]])
dataset = tf.data.Dataset.from_tensor_slices(data)
I want to retrieve a subset of 'dataset' which corresponds to all elements whose first column is equal to 1. So, result should be:
[[1, 1, 1], [1, 3, 8]] # dtype : dataset
I tried this:
subset = dataset.filter(lambda x: tf.equal(x[0], 1))
But I don't get the correct result, since it sends me back x[0]
Someone to help me ?
I finally resolved it:
a = tf.convert_to_tensor([1, 2, 1, 1, 5, 5, 9, 12])
b = tf.convert_to_tensor([1, 2, 3, 8, 4, 5, 9, 12])
data_set = tf.data.Dataset.from_tensor_slices((a, b))
subset = data_set.filter(lambda x, y: tf.equal(x, 1))

How to delete rows from column which have matching values in the list Pandas

I am finding outliers from a column and storing them in a list. Now i want to delete all the values which
are present in my list from the column.
How can achieve this ?
This is my function for finding outliers
outlier=[]
def detect_outliers(data):
threshold=3
m = np.mean(data)
st = np.std(data)
for i in data:
#calculating z-score value
z_score=(i-m)/st
#if the z_score value is greater than threshold value than its a outlier
if np.abs(z_score)>threshold:
outlier.append(i)
return outlier
This is my column in data frame
df_train_11.AMT_INCOME_TOTAL
import numpy as np, pandas as pd
df = pd.DataFrame(np.random.rand(10,5))
outlier_list=[]
def detect_outliers(data):
threshold=0.5
for i in data:
#calculating z-score value
z_score=(df.loc[:,i]- np.mean(df.loc[:,i])) /np.std(df.loc[:,i])
outliers = np.abs(z_score)>threshold
outlier_list.append(df.index[outliers].tolist())
return outlier_list
outlier_list = detect_outliers(df)
[[1, 2, 4, 5, 6, 7, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 4, 8],
[0, 1, 3, 4, 6, 8],
[0, 1, 3, 5, 6, 8, 9]]
This way, you get the outliers of each column. outlier_list[0] gives you [1, 2, 4, 5, 6, 7, 9] which means that the rows 1,2,etc are outliers for column 0.
EDIT
Shorter answer:
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]
This willfilter the DataFrame where only ONE column (e.g. 'B') is within three standard deviations.

Creating dictionary from a numpy array "ValueError: too many values to unpack"

I am trying to create a dictionary from a relatively large numpy array. I tried using the dictionary constructor like so:
elements =dict((k,v) for (a[:,0] , a[:,-1]) in myarray)
I am assuming I am doing this incorrectly since I get the error: "ValueError: too many values to unpack"
The numPy array looks like this:
[ 2.01206281e+13 -8.42110000e+04 -8.42110000e+04 ..., 0.00000000e+00
3.30000000e+02 -3.90343147e-03]
I want the first column 2.01206281e+13 to be the key and the last column -3.90343147e-03 to be the value for each row in the array
Am I on the right track/is there a better way to go about doing this?
Thanks
Edit: let me be more clear I want the first column to be the key and the last column to be the value. I want to do this for every row in the numpy array
This is kind of a hard question on answer without knowing what exactly myarray is, but this might help you get started.
>>> import numpy as np
>>> a = np.random.randint(0, 10, size=(3, 2))
>>> a
array([[1, 6],
[9, 3],
[2, 8]])
>>> dict(a)
{1: 6, 2: 8, 9: 3}
or
>>> a = np.random.randint(0, 10, size=(3, 5))
>>> a
array([[9, 7, 4, 4, 6],
[8, 9, 1, 6, 5],
[7, 5, 3, 4, 7]])
>>> dict(a[:, [0, -1]])
{7: 7, 8: 5, 9: 6}
elements = dict( zip( * [ iter( myarray ) ] * 2 ) )
What we see here is that we create an iterator based on the myarray list. We put it in a list and double it. Now we've got the same iterator bound to the first and second place in a list which we give as arguments to the zip function which creates a list of pairs for the dict creator.