Partitioning datasets and get the dynamic averages of rows with the same ID(objects in this case) - sql

I have a large dataset with thousands of rows though fewer columns, i have ordered them by row values so that each of the 'objects' are grouped together, just like the dataset in Table1 below:
#Table1 :
data = [['ALFA', 351740.00, 0.31, 0.22, 0.44, 0.19, 0.05],
['ALFA', 401740.00, 0.43, 0.26, 0.23, 0.16, 0.09],
['ALFA', 892350.00, 0.58, 0.24, 0.05, 0.07, 0.4],
['Bravo', 511830.00, 0.52, 0.16, 0.08, 0.26, 0],
['Charlie', 590030.00, 0.75, 0.2, 0.14, 0.37, 0.06],
['Charlie', 590030.00, 0.75, 0.2, 0.27, 0.2, 0.01],
['Charlie', 590030.00, 0.75, 0.2, 0.29, 0.11, 0.04],
['Charlie', 590030.00, 0.75, 0.2, 0.27, 0.2, 0.01],
['Charlie', 401740.00, 0.43, 0.26, 0.14, 0.37, 0.06],
['Charlie', 511830.00, 0.52, 0.16, 0.13, 0.22, 0.01],
['Delta', 590030.00, 0.75, 0.2, 0.34, 0.3, 0],
['Delta', 590030.00, 0.75, 0.2, 0, 0.28, 0],
['Delta', 351740.00, 0.31, 0.22, 0.44, 0.19, 0.05],
['Echo', 892350.00, 0.58, 0.24, 0.23, 0.16, 0.09],
['Echo', 590030.00, 0.75, 0.2, 0.05, 0.07, 0.4],
['Echo', 590030.00, 0.75, 0.2, 0.08, 0.26, 0],
['Echo', 590030.00, 0.75, 0.2, 0.14, 0.37, 0.06],
['Foxtrot', 401740.00, 0.43, 0.26, 0.27, 0.2, 0.01],
['Foxtrot', 511830.00, 0.52, 0.16, 0.29, 0.11, 0.04],
['Golf', 590030.00, 0.75, 0.2, 0.27, 0.2, 0.01],
['Golf', 590030.00, 0.75, 0.2, 0.14, 0.37, 0.06],
['Golf', 351740.00, 0.31, 0.22, 0.13, 0.22, 0.01],
['Hotel', 892350.00, 0.58, 0.24, 0.34, 0.3, 0],
['Hotel', 590030.00, 0.75, 0.2, 0, 0.28, 0],
['Hotel', 590030.00, 0.75, 0.2, 0.29, 0.11, 0.04]]
df = pd.DataFrame(data, columns= ['Objects', 'Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6'])
df
However i would like to write a query to go through the dataset, partition the data by these objects and get only the averages for all the columns (for each object) in a separate table much like the Table2 below:
#Table2:
data2 = [['ALFA', 548610.00, 0.44, 0.24, 0.24, 0.14, 0.18],
['Bravo', 511830.00, 0.52, 0.16, 0.08, 0.26, 0],
['Charlie', 545615.00, 0.66, 0.20, 0.21, 0.25, 0.03],
['Delta', 510600.00, 0.60, 0.21, 0.26, 0.26, 0.02],
['Echo', 665610.00, 0.71, 0.21, 0.13, 0.22, 0.14],
['Foxtrot', 456785.00, 0.48, 0.21, 0.28, 0.16, 0.03],
['Golf', 510600.00, 0.60, 0.21, 0.18, 0.26, 0.03],
['Hotel', 690803.33, 0.69, 0.21, 0.21, 0.23, 0.01]]
df2 = pd.DataFrame(data, columns= ['Objects', 'Column1', 'Column2', 'Column3', 'Column4', 'Column5', 'Column6'])
df2
Please note that the number of the objects vary across the dataset so the query should be able to count the number of objects and use that number to get the average of all the columns for each object and then present all these values in a new table (much Like what partition windows function does).
For instance note that the '548610.00' alues in Table2 for ALFA(column1) is merely an addition of Column1 values of ALFA in Table1 (351740.00 + 401740.00 + 401740.00) and divide by the count of ALFA being '3'

I believe a simple avg() function should answer your question
SELECT Objects,
AVG(Column1),
AVG(Column2),
AVG(Column3),
AVG(Column4),
AVG(Column5),
AVG(Column6)
FROM tableA
GROUP BY Objects
db fiddle link

Related

create a lineplot from a few variables

I have a dataframe with 3 variables, each one is representing different time point for the same outcome (e.g. weight):
df = pd.DataFrame({"Time_1": [-4.5, -0.8, -3.0, 0.2, -2.5], \
"Time_2": [-3, -0.2, -2.5, 0.3, 1], "TIme_3": [-2, 0, -1, 0.5, 1]})
I want to plot a trajectory for this variable identical to this graph:
Where I have a first point of (0,0) for the basline and three additional points on X axis with the correspondign values.
You could just use df.shift().fillna(0).cumsum().plot(marker='D') to get a plot of the 3 variables together. Shift and fillna are used so that the first line can be 0 for all the variables.
df = pd.DataFrame({"Time_1": [-4.5, -0.8, -3.0, 0.2, -2.5], \
"Time_2": [-3, -0.2, -2.5, 0.3, 1], "Time_3": [-2, 0, -1, 0.5, 1]})
df.shift().fillna(0).cumsum().plot(marker='D')

How to shuffle groups of values but not within groups in numpy?

I need to shuffle by rows knowing that the first value in each row is a day number. Rows of the same day number should be kept together. Groups may contain 1, 2, 3 or 4 rows. Each row has the same number of values. Hope the examples below will tell you more.
I have this:
a = np.array([
[0, 0.02, 0.03, 0.04],
[0, 0.02, 0.03, 0.04],
[0, 0.02, 0.03, 0.04],
[1, 0.12, 0.13, 0.14],
[2, 0.22, 0.23, 0.24],
[2, 0.22, 0.23, 0.24],
[3, 0.32, 0.33, 0.34],
[3, 0.32, 0.33, 0.34],
[3, 0.32, 0.33, 0.34],
[3, 0.32, 0.33, 0.34]
])
I need to have this:
a = np.array([
[3, 0.32, 0.33, 0.34],
[3, 0.32, 0.33, 0.34],
[3, 0.32, 0.33, 0.34],
[3, 0.32, 0.33, 0.34],
[0, 0.02, 0.03, 0.04],
[0, 0.02, 0.03, 0.04],
[0, 0.02, 0.03, 0.04],
[2, 0.22, 0.23, 0.24],
[2, 0.22, 0.23, 0.24],
[1, 0.12, 0.13, 0.14]
])
This approach using python's random.sample to permute a list of arrays is not fast, but easier to follow. This only works if groups are sorted in blocks beforehand.
import random
random.seed(25) # used for reproducibility only
groups = a[:,0].astype('int')
idx = (groups[1:] ^ groups[:-1]).nonzero()[0] + 1
np.vstack(random.sample(np.split(a, idx), len(idx)+1))
Output
array([[3. , 0.32, 0.33, 0.34],
[3. , 0.32, 0.33, 0.34],
[3. , 0.32, 0.33, 0.34],
[3. , 0.32, 0.33, 0.34],
[0. , 0.02, 0.03, 0.04],
[0. , 0.02, 0.03, 0.04],
[0. , 0.02, 0.03, 0.04],
[2. , 0.22, 0.23, 0.24],
[2. , 0.22, 0.23, 0.24],
[1. , 0.12, 0.13, 0.14]])
Assuming groups are not split in the input array, you can apply the following strategy:
# Find the number of groups and the number of item in each group
unique, srcCounts = np.unique(a[:,0], return_counts=True)
shuffledGroupPos = np.random.permutation(np.arange(len(unique)))
# Compute the source start/end group indices
srcEnd = np.cumsum(srcCounts)
srcStart = srcEnd - srcCounts
# Find the destination start/end group indices
dstCounts = srcCounts[shuffledGroupPos]
dstEnd = np.cumsum(dstCounts)
dstStart = dstEnd - dstCounts
# Remap the source start/end group indices regarding the destination indices
srcStart = srcStart[shuffledGroupPos]
srcEnd = srcEnd[shuffledGroupPos]
# Output array
result = np.empty_like(a)
# Loop iterating over the groups.
# While this loop can be avoided, the code far much simpler with it.
for i in range(unique.size):
result[dstStart[i]:dstEnd[i]] = a[srcStart[i]:srcEnd[i]]

Create range from 0 to 1 with step 0.05 in Numpy

I want create a list from 0 to 1 with step 0.05, the result will like this: [0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1]
I try with following code, but the output seems not correct. Anyone could help? Thanks.
print(np.arange(0, 1, 0.05).tolist())
Output:
[0.0, 0.05, 0.1, 0.15000000000000002, 0.2, 0.25, 0.30000000000000004, 0.35000000000000003, 0.4, 0.45, 0.5, 0.55, 0.6000000000000001, 0.65, 0.7000000000000001, 0.75, 0.8, 0.8500000000000001, 0.9, 0.9500000000000001]
You want np.linspace()
np.linspace(0, 1, 21)
Out[]:
array([0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ])
Its not necessary use .tolist().
Try this:
a = np.arange(0, 1, 0.05)
print (a)
Output:
[0. 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65
0.7 0.75 0.8 0.85 0.9 0.95]
This works:
print(np.arange(0, 1, 0.05).round(2).tolist())
Output:
[0.0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95]

tensorflow how do one get the output the same size as input tensor after segment sum

I'm using the tf.unsorted_segment_sum method of TensorFlow and it works.
For example:
tf.unsorted_segment_sum(tf.constant([0.2, 0.1, 0.5, 0.7, 0.8]),
tf.constant([0, 0, 1, 2, 2]), 3)
Gives the right result:
array([ 0.3, 0.5 , 1.5 ], dtype=float32)
I want to get:
array([0.3, 0.3, 0.5, 1.5, 1.5], dtype=float32)
I've solved it.
data = tf.constant([0.2, 0.1, 0.5, 0.7, 0.8])
gr_idx = tf.constant([0, 0, 1, 2, 2])
y, idx, count = tf.unique_with_count(gr_idx)
group_sum = tf.segment_sum(data, gr_idx)
group_sup = tf.gather(group_sum, idx)
answer:
array([0.3, 0.3, 0.5, 1.5, 1.5], dtype=float32)

Add an extra column to ndarray in python

I have a ndarray as follows.
feature_matrix = [[0.1, 0.3], [0.7, 0.8], [0.8, 0.8]]
I have a position ndarray as follows.
position = [10, 20, 30]
Now I want to add the position value at the beginning of the feature_matrix as follows.
[[10, 0.1, 0.3], [20, 0.7, 0.8], [30, 0.8, 0.8]]
I tried the answers in this: How to add an extra column to an numpy array
E.g.,
feature_matrix = np.concatenate((feature_matrix, position), axis=1)
However, I get the error saying that;
ValueError: all the input arrays must have same number of dimensions
Please help me to resolve this prblem.
This solved my problem. I used np.column_stack.
feature_matrix = [[0.1, 0.3], [0.7, 0.8], [0.8, 0.8]]
position = [10, 20, 30]
feature_matrix = np.column_stack((position, feature_matrix))
It is the shape of the position array which is incorrect regarding the shape of the feature_matrix.
>>> feature_matrix
array([[ 0.1, 0.3],
[ 0.7, 0.8],
[ 0.8, 0.8]])
>>> position
array([10, 20, 30])
>>> position.reshape((3,1))
array([[10],
[20],
[30]])
The solution is (with np.concatenate):
>>> np.concatenate((position.reshape((3,1)), feature_matrix), axis=1)
array([[ 10. , 0.1, 0.3],
[ 20. , 0.7, 0.8],
[ 30. , 0.8, 0.8]])
But np.column_stack is clearly great in your case !