Downsample a Tensor - tensorflow

Let's assume we have a 3D Tensor of shape a = [batch_size, length, 1] and we want to discard every 5th sample from the length axis. The new indices for every batch element could be calculated as indices = tf.where(tf.range(a.shape[1]) % 5 != 0).
Could you please help me with an operation that obtains the shorter Tensor shape b = [batch_size, length2, 1], where length2 = 4/5 * length ? I assume this is attainable with tf.gather_nd, but I am having an issue with providing the indices in the right format. It does not simply work to tile the indices Tensor batch_size times and provide the resulting 2D tensor to tf.gather_nd taking the 3D tensor as parameters.
Thank you.

You can simply do the following:
import tensorflow as tf
# Example data
a = tf.reshape(tf.range(60), [5, 12, 1])
print(a.numpy()[:, :, 0])
# [[ 0 1 2 3 4 5 6 7 8 9 10 11]
# [12 13 14 15 16 17 18 19 20 21 22 23]
# [24 25 26 27 28 29 30 31 32 33 34 35]
# [36 37 38 39 40 41 42 43 44 45 46 47]
# [48 49 50 51 52 53 54 55 56 57 58 59]]
# Mask every one in five items
mask = tf.not_equal(tf.range(tf.shape(a)[1]) % 5, 0)
b = tf.boolean_mask(a, mask, axis=1)
# Show result
print(b.numpy()[:, :, 0])
# [[ 1 2 3 4 6 7 8 9 11]
# [13 14 15 16 18 19 20 21 23]
# [25 26 27 28 30 31 32 33 35]
# [37 38 39 40 42 43 44 45 47]
# [49 50 51 52 54 55 56 57 59]]

Related

How to append a pyspark dataframes inside a for loop?

example: I have a pyspark dataframe as:
df=
x_data y_data
2.5 1.5
3.5 8.5
4.5 89.5
5.5 20.5
Let's say have some calculation to be done on each column on df which I do inside a for loop. After that my final output should be like this:
df_output=
cal_1 cal_2 Cal_3 Cal_4 Datatype
23 24 34 36 x_data
12 13 18 90 x_data
23 54 74 96 x_data
41 13 38 50 x_data
53 74 44 6 y_data
72 23 28 50 y_data
43 24 44 66 y_data
41 23 58 30 y_data
How do I append these results calculated on each column into the same pyspark output data frame inside the for loop?
You can use functools.reduce to union the list of dataframes created in each iteration.
Something like this :
import functools
from pyspark.sql import DataFrame
output_dfs = []
for c in df.columns:
# do some calculation
df_output = _ # calculation result
output_dfs.append(df_output)
df_output = functools.reduce(DataFrame.union, output_dfs)

Length of passed values is 1, index implies 10

Why and what is this error about??? It shows Length of passed
values is 1, index implies 10. I tried many times to run the
code and I come across the same
ser = pd.Series(np.random.randint(1, 50, 10))
result = np.argwhere(ser % 3==0)
print(result)
argwhere() operates on a numpy array not a panda series. See below
a = np.random.randint(1, 50, 12)
a = pd.Series(a)
print(a)
np.argwhere(a.values%3==0)
output
0 28
1 46
2 4
3 40
4 19
5 26
6 6
7 24
8 26
9 30
10 33
11 27
dtype: int64
[250]:
array([[ 6],
[ 7],
[ 9],
[10],
[11]])
Please read documentation for numpy.random.randint You will see that the parameters are (low, high, size).
In your case, you are sending (1, 50, 10). So 10 random numbers will be generated between 1 and 50.
If you want multiples of 3, then you need to do this ser[ser % 3==0] not use np.anywhere.
See similar issue raised earlier and answered on Stack Overflow
import pandas as pd
import numpy as np
ser = pd.Series(np.random.randint(1, 50, 10))
print (ser)
result = ser[ser % 3==0]
print(result)
Output of this will be:
Original Series.
0 17
1 34
2 29
3 15
4 24
5 20
6 21
7 48
8 6
9 42
dtype: int64
Multiples of 3 will be:
3 15
4 24
6 21
7 48
8 6
9 42
dtype: int64
Use Index.tolist:
In [1374]: ser
Out[1374]:
0 44
1 5
2 35
3 10
4 16
5 20
6 25
7 9
8 44
9 16
dtype: int64
In [1372]: l = ser[ser % 3 == 0].index.tolist()
In [1373]: l
Out[1373]: [7]
where l will be a list of indexes of elements which are a multiple of 3.

MultiIndex isn't kept when pd.concating multiple subtotal rows

I lose my multiIndex when I try to pd.concat a second subtotal. I'm able to add the first subtotal but not the second which is a sum of B0.
This is how my current df is:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
First Total 6 4 10 8
C1 D0 9 8 11 10
D1 13 12 15 14
First Total 22 20 26 24
C2 D0 17 16 19 18
After trying to add the second subtotal I get this:
lvl0 a b
lvl1 bar foo bah foo
(A0, B0, C2, First Total) 38 36 42 40
(A0, B0, C3, D0) 25 24 27 26
(A0, B0, C3, D1) 29 28 31 30
(A0, B0, C3, First Total) 54 52 58 56
(A0, B0, Second Total) 120 112 136 128
(A0, B1, C0, D0) 33 32 35 34
(A0, B1, C0, D1) 37 36 39 38
(A0, B1, C0, First Total) 70 68 74 72
(A0, B1, C1, D0) 41 40 43 42
You should be able to copy and paste the code below to test
import pandas as pd
import numpy as np
# creating multiIndex
def mklbl(prefix, n):
return ["%s%s" % (prefix, i) for i in range(n)]
miindex = pd.MultiIndex.from_product([mklbl('A', 4),
mklbl('B', 2),
mklbl('C', 4),
mklbl('D', 2)])
micolumns = pd.MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'),
('b', 'foo'), ('b', 'bah')],
names=['lvl0', 'lvl1'])
dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns))
.reshape((len(miindex), len(micolumns))),
index=miindex,
columns=micolumns).sort_index().sort_index(axis=1)
# My code STARTS HERE
# creating the first subtotal
print(dfmi.index)
df1 = dfmi.groupby(level=[0,1,2]).sum()
df2 = dfmi.groupby(level=[0, 1]).sum()
df1 = df1.set_index(np.array(['First Total'] * len(df1)), append=True)
dfmi = pd.concat([dfmi, df1]).sort_index(level=[0, 1])
print(dfmi)
# this is where the multiIndex is lost
df2 = df2.set_index(np.array(['Second Total'] * len(df2)), append=True)
dfmi = pd.concat([dfmi, df2]).sort_index(level=[1])
print(dfmi)
How I would want it to look:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
First Total 6 4 10 8
C1 D0 9 8 11 10
D1 13 12 15 14
First Total 22 20 26 24
C2 D0 17 16 19 18
D1 21 20 23 22
First Total 38 36 42 40
C3 D0 25 24 27 26
D1 29 28 31 30
First Total 54 52 58 56
Second Total 120 112 136 128
B1 C0 D0 33 32 35 34
D1 37 36 39 38
First Total 70 68 74 72
C1 D0 41 40 43 42
D1 45 44 47 46
First Total 86 84 90 88
C2 D0 49 48 51 50
D1 53 52 55 54
First Total 102 100 106 104
C3 D0 57 56 59 58
D1 61 60 63 62
First Total 118 116 122 120
Second Total 376 368 392 384
the first total is sum of level 2,
the second total is sum of level 1
dfmi has a 4-level MultiIndex:
In [208]: dfmi.index.nlevels
Out[208]: 4
df2 has a 3-level MultiIndex. Instead, if you use
df2 = df2.set_index([np.array(['Second Total'] * len(df2)), [''] * len(df2)], append=True)
then df2 ends up with a 4-level MultiIndex. When dfmi and df2 have the same number of levels,
then pd.concat([dfmi, df2]) produces the desired result.
One problem you may face when sorting by index labels is that it relies on the strings 'First' and 'Second'
appearing last in alphabetic order. An alterative to sorting by index would be assigning a numeric order column
and sorting by that instead:
dfmi['order'] = range(len(dfmi))
df1['order'] = dfmi.groupby(level=[0,1,2])['order'].last() + 0.1
df2['order'] = dfmi.groupby(level=[0,1])['order'].last() + 0.2
...
dfmi = pd.concat([dfmi, df1, df2])
dfmi = dfmi.sort_values(by='order')
Incorporating Scott Boston's improvement, the code would then look like this:
import pandas as pd
import numpy as np
def mklbl(prefix, n):
return ["%s%s" % (prefix, i) for i in range(n)]
miindex = pd.MultiIndex.from_product([mklbl('A', 4),
mklbl('B', 2),
mklbl('C', 4),
mklbl('Z', 2)])
micolumns = pd.MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'),
('b', 'foo'), ('b', 'bah')],
names=['lvl0', 'lvl1'])
dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns))
.reshape((len(miindex), len(micolumns))),
index=miindex,
columns=micolumns).sort_index().sort_index(axis=1)
df1 = dfmi.groupby(level=[0,1,2]).sum()
df2 = dfmi.groupby(level=[0, 1]).sum()
dfmi['order'] = range(len(dfmi))
df1['order'] = dfmi.groupby(level=[0,1,2])['order'].last() + 0.1
df2['order'] = dfmi.groupby(level=[0,1])['order'].last() + 0.2
df1 = df1.assign(lev4='First').set_index('lev4', append=True)
df2 = df2.assign(lev3='Second', lev4='').set_index(['lev3','lev4'], append=True)
dfmi = pd.concat([dfmi, df1, df2])
dfmi = dfmi.sort_values(by='order')
dfmi = dfmi.drop(['order'], axis=1)
print(dfmi)
which yields
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 Z0 1 0 3 2
Z1 5 4 7 6
First 6 4 10 8
C1 Z0 9 8 11 10
Z1 13 12 15 14
First 22 20 26 24
C2 Z0 17 16 19 18
Z1 21 20 23 22
First 38 36 42 40
C3 Z0 25 24 27 26
Z1 29 28 31 30
First 54 52 58 56
Second 120 112 136 128
...
#unutbu points out the nature of the problem. df2 has three levels of a multiindex and you need a 4th level.
I would use assign and set_index to create that fourth level:
df2 = df2.assign(lev3='Second Total', lev4='').set_index(['lev3','lev4'], append=True)
This avoids calculating the length of the dataframe.

pandas: conditionally select a row cell for each column based on a mask

I want to be able to extract values from a pandas dataframe using a mask. However, after searching around, I cannot find a solution to my problem.
df = pd.DataFrame(np.random.randint(0,2, size=(2,10)))
mask = np.random.randint(0,2, size=(1,10))
I basically want the mask to serve as a index lookup for each column.
So if the mask was [0,1] for columns [a,b], I want to return:
df.iloc[0,a], df.iloc[1,b]
but in a pythonic way.
I have tried e.g.:
df.apply(lambda x: df.iloc[mask[x], x] for x in range(len(mask)))
which gives a Type error that I don't understand.
A for loop can work but is slow.
With NumPy, that's covered as advanced-indexing and should be pretty efficient -
df.values[mask, np.arange(mask.size)]
Sample run -
In [59]: df = pd.DataFrame(np.random.randint(11,99, size=(5,10)))
In [60]: mask = np.random.randint(0,5, size=(1,10))
In [61]: df
Out[61]:
0 1 2 3 4 5 6 7 8 9
0 17 87 73 98 32 37 61 58 35 87
1 52 64 17 79 20 19 89 88 19 24
2 50 33 41 75 19 77 15 59 84 86
3 69 13 88 78 46 76 33 79 27 22
4 80 64 17 95 49 16 87 82 60 19
In [62]: mask
Out[62]: array([[2, 3, 0, 4, 2, 2, 4, 0, 0, 0]])
In [63]: df.values[mask, np.arange(mask.size)]
Out[63]: array([[50, 13, 73, 95, 19, 77, 87, 58, 35, 87]])

how to find mid values of plot coordinates and segment the area into smaller rectangles?

I have the data set 'datatoread' and following code as below, populates the dataframe 'MyData'.
datatoread <- "X Y
29 21
18 23
28 24
16 26
3 27
18 29
2 33
3 37
26 39
2 42
25 47
9 54
13 57
17 58
29 60
5 63
23 66
4 69
3 72
17 73
7 73
12 72
8 69
20 66
12 63
8 60
28 58
3 57
18 54
11 47
21 42
8 39
1 37
16 29
3 27
17 22
3 19
6 17
19 14
18 10"
MyData <- read.table(textConnection(datatoread), header = TRUE)
closeAllConnections()
MyData
What I want to do is:
Plot the data, find the mid value on X-axis and draw a vertical line from that X-value until Y-value (the corresponding Y coordinate to that mid-X).
Segment the left half to the mid-X into equal distance (as shown in the picture below) and tabulate the segments in a way that results, say,
Result (please note these are only indicative for coordinates of segments, actual might differ)
Seg X1 Y1 X2 Y2
seg1 18 23 29 21
seg2 29 21 28 24
. . . . .
For first part of it, I tried in SAS as:
data Trapezoidal;
set x_y end=last;
retain integral;
lag_x=lag(x); lag_y = lag(y);
if _N_ eq 1 then integral = 0;
else integral = integral + (x - lag_x) * (y + lag_y) / 2;
run;
What could be equivalent code in R or SQL?
Alternatively, assume this data:
x <- seq(-12,12,by=1)
y <- dnorm(x,mean=2.5,sd=5)
plot(x,y, type = "l")
z <- cbind(x,y)
plot(z, type = "l")
You can work on 'z' dataframe as well.