Get all first elements of list/array in BigQuery - sql

I have a large number of .csv files with the following cell values:
"[[0.0, 4.0], .... , [240.0, 0.0], [248.0, 0.0]]"
The string contains a nested list and is a result of a histogram reducer with 32 bins for 8bit data and contains the lower bin value and the count.
For instance, the first element contains the lower bin value of the 1st bin (0.0) and the count (4.0). The last element contains the lower bin value of the 32nd bin (248.0) and count (0.0).
Since the lower bin values do not change and are known [0,8,16 ... 248], I would like to extract only the counts i.e.
[4, .... , 0 ]
In Python, this would be straight forward, however the amount of data is quite big and I have 3,422,250 of these histograms. Therefore I considered using Google BigQuery to get the job done.
When I load the cvs data in BigQuery, the histograms are stored as type STRING.
How can I get nested lists (arrays) that are stored as string in csv, in the ARRAY datatype in BigQuery? In the documentation, it says that nested arrays are not yet supported. Are there workarounds?
guidance on how to get the first element of multiple arrays is very welcome too!
p.s. I already tried to solve the problem upstream to no avail.
Example csv file

Not sure if it is exactly what you are asking, but hope below example (for BigQuery Standard SQL) will help you
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id,'[[0.0, 4.0], [8.0, 0.0], [16.0, 0.0], [24.0, 0.0], [32.0, 0.0], [40.0, 0.0], [48.0, 0.0], [56.0, 0.0], [64.0, 1.0], [72.0, 1.0], [80.0, 4.0], [88.0, 0.0], [96.0, 0.0], [104.0, 0.0], [112.0, 0.0], [120.0, 0.0], [128.0, 0.0], [136.0, 0.0], [144.0, 0.0], [152.0, 0.0], [160.0, 0.0], [168.0, 0.0], [176.0, 0.0], [184.0, 0.0], [192.0, 0.0], [200.0, 0.0], [208.0, 0.0], [216.0, 0.0], [224.0, 0.0], [232.0, 0.0], [240.0, 0.0], [248.0, 0.0]]' histogram UNION ALL
SELECT 2, '[[0.0, 0.0], [8.0, 0.0], [16.0, 0.0], [24.0, 0.0], [32.0, 0.0], [40.0, 0.0], [48.0, 0.0], [56.0, 0.0], [64.0, 0.0], [72.0, 0.0], [80.0, 0.0], [88.0, 0.0], [96.0, 0.0], [104.0, 0.0], [112.0, 1.0], [120.0, 0.0], [128.0, 1.0], [136.0, 0.0], [144.0, 0.0], [152.0, 0.0], [160.0, 0.0], [168.0, 0.0], [176.0, 0.0], [184.0, 0.0], [192.0, 0.0], [200.0, 0.0], [208.0, 0.0], [216.0, 0.0], [224.0, 0.0], [232.0, 0.0], [240.0, 0.0], [248.0, 0.0]]'
)
SELECT id,
SPLIT(bin)[OFFSET(0)] value,
SPLIT(bin)[OFFSET(1)] frequency
FROM `project.dataset.table`, UNNEST(SPLIT(REGEXP_REPLACE(histogram, r'\[\[|]]|\s', ''), '],[')) bin
Note: this assumes When I load the cvs data in BigQuery, the histograms are stored as type STRING as
"[[0.0, 4.0], .... , [240.0, 0.0], [248.0, 0.0]]"
OR - if you want to keep rows intact and have histogram presented as string to be transformed into array - you can try below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id,'[[0.0, 4.0], [8.0, 0.0], [16.0, 0.0], [24.0, 0.0], [32.0, 0.0], [40.0, 0.0], [48.0, 0.0], [56.0, 0.0], [64.0, 1.0], [72.0, 1.0], [80.0, 4.0], [88.0, 0.0], [96.0, 0.0], [104.0, 0.0], [112.0, 0.0], [120.0, 0.0], [128.0, 0.0], [136.0, 0.0], [144.0, 0.0], [152.0, 0.0], [160.0, 0.0], [168.0, 0.0], [176.0, 0.0], [184.0, 0.0], [192.0, 0.0], [200.0, 0.0], [208.0, 0.0], [216.0, 0.0], [224.0, 0.0], [232.0, 0.0], [240.0, 0.0], [248.0, 0.0]]' histogram UNION ALL
SELECT 2, '[[0.0, 0.0], [8.0, 0.0], [16.0, 0.0], [24.0, 0.0], [32.0, 0.0], [40.0, 0.0], [48.0, 0.0], [56.0, 0.0], [64.0, 0.0], [72.0, 0.0], [80.0, 0.0], [88.0, 0.0], [96.0, 0.0], [104.0, 0.0], [112.0, 1.0], [120.0, 0.0], [128.0, 1.0], [136.0, 0.0], [144.0, 0.0], [152.0, 0.0], [160.0, 0.0], [168.0, 0.0], [176.0, 0.0], [184.0, 0.0], [192.0, 0.0], [200.0, 0.0], [208.0, 0.0], [216.0, 0.0], [224.0, 0.0], [232.0, 0.0], [240.0, 0.0], [248.0, 0.0]]'
)
SELECT id,
ARRAY(
SELECT AS STRUCT
SPLIT(bin)[OFFSET(0)] value,
SPLIT(bin)[OFFSET(1)] frequency
FROM UNNEST(SPLIT(REGEXP_REPLACE(histogram, r'\[\[|]]|\s', ''), '],[')) bin
) histogram_as_array
FROM `project.dataset.table`

Related

How to completely remove left and bottom white margins of matplotlib draw?

import numpy as np
from matplotlib import pyplot as plt
data = np.array([[0.8, 2.4, 2.5, 3.9, 0.0, 4.0, 0.0],
[2.4, 0.0, 4.0, 1.0, 2.7, 0.0, 0.0],
[1.1, 2.4, 0.8, 4.3, 1.9, 4.4, 0.0],
[0.6, 0.0, 0.3, 0.0, 3.1, 0.0, 0.0],
[0.7, 1.7, 0.6, 2.6, 2.2, 6.2, 0.0],
[1.3, 1.2, 0.0, 0.0, 0.0, 3.2, 5.1],
[0.1, 2.0, 0.0, 1.4, 0.0, 1.9, 6.3]])
plt.figure(figsize=(6, 4))
im = plt.imshow(data, cmap="YlGn")
linewidth = 2
for axis in ['top', 'bottom', 'left', 'right']:
plt.gca().spines[axis].set_linewidth(linewidth)
plt.gca().set_xticks(np.arange(data.shape[1] + 1) - .5, minor=True)
plt.gca().set_yticks(np.arange(data.shape[0] + 1) - .5, minor=True)
plt.gca().grid(which="minor", color="black", linewidth=linewidth)
plt.gca().tick_params(which="minor", bottom=False, left=False)
plt.tight_layout()
plt.gca().set_xticks(ticks=[])
plt.gca().set_yticks(ticks=[])
plt.savefig("test.pdf",
bbox_inches="tight",
transparent="True",
pad_inches=1.0/72.0 * linewidth / 2.0)
This code will output the following pdf, but you can see that there are white borders on the left and bottom, so the pdf is not centered after being inserted into LaTex. How to solve this problem?
plt result:
import numpy as np
from matplotlib import pyplot as plt
data = np.array([[0.8, 2.4, 2.5, 3.9, 0.0, 4.0, 0.0],
[2.4, 0.0, 4.0, 1.0, 2.7, 0.0, 0.0],
[1.1, 2.4, 0.8, 4.3, 1.9, 4.4, 0.0],
[0.6, 0.0, 0.3, 0.0, 3.1, 0.0, 0.0],
[0.7, 1.7, 0.6, 2.6, 2.2, 6.2, 0.0],
[1.3, 1.2, 0.0, 0.0, 0.0, 3.2, 5.1],
[0.1, 2.0, 0.0, 1.4, 0.0, 1.9, 6.3]])
plt.figure(figsize=(6, 4))
im = plt.imshow(data, cmap="YlGn")
linewidth = 2
for axis in ['top', 'bottom', 'left', 'right']:
plt.gca().spines[axis].set_linewidth(linewidth)
plt.gca().set_xticks(np.arange(data.shape[1] + 1) - .5, minor=True)
plt.gca().set_yticks(np.arange(data.shape[0] + 1) - .5, minor=True)
plt.gca().grid(which="minor", color="black", linewidth=linewidth)
plt.gca().tick_params(which="minor", bottom=False, left=False)
plt.tight_layout()
plt.gca().set_xticks(ticks=[])
plt.gca().set_yticks(ticks=[])
plt.gca().tick_params(axis="both",
which="major",
left=False,
bottom=False,
labelleft=False,
labelbottom=False)
plt.savefig("test.pdf",
bbox_inches="tight",
transparent="True",
pad_inches=1.0 / 72.0 * linewidth / 2.0)
It was an issue with ticks, solved now.

How to train LSTM model with variable-length sequence input

I'm trying to train LSTM model in Keras using data of variable timestep, for example, the data looks like:
<tf.RaggedTensor [[[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]],
[[1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]],
[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0]], ...,
[[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]],
[[1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
[1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]],
[[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]]>
and its corresponding label:
<tf.RaggedTensor [[6, 6], [7, 7], [8], ..., [6], [11, 11, 11, 11, 11], [24, 24, 24, 24, 24]]>
Each input data have 13 features, so for each time step, the model receives a 1 x 13 vector. I wonder if it is possible to do so? I don't mind doing this on pytorch either.
I try to align them with no reshape layer.
However, my input for each time step in the LSTM layer is a vector of dimension 13. And each sample has variable-length of these vectors, which means the time step is not constant for each sample. Can you show me a code example of how to train such model? –
TurquoiseJ
First of all, the concept of windows length and time steps is they take the same amount of the input with a higher number of length and time.
We assume the input to extract features can be divide by multiple times of windows travels along with axis, please see the attached for idea.
[Codes]:
batched_features = tf.constant( [ [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], ], shape=( 2, 1, 13 ) )
batched_labels = tf.constant( [[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]], shape=( 2, 13 ) )
dataset = tf.data.Dataset.from_tensor_slices((batched_features, batched_labels))
dataset = dataset.batch(10)
batched_features = dataset
[Sample]:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bidirectional (Bidirectiona (None, 1, 64) 11776
l)
bidirectional_1 (Bidirectio (None, 64) 24832
nal)
dense (Dense) (None, 13) 845
=================================================================
Total params: 37,453
Trainable params: 37,453
Non-trainable params: 0
_________________________________________________________________
<BatchDataset element_spec=(TensorSpec(shape=(None, 1, 13), dtype=tf.int32, name=None), TensorSpec(shape=(None, 13), dtype=tf.int32, name=None))>
Epoch 1/100
2022-03-28 05:19:04.116345: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8100
1/1 [==============================] - 8s 8s/step - loss: 0.0000e+00 - accuracy: 1.0000 - val_loss: 0.0000e+00 - val_accuracy: 1.0000
Epoch 2/100
1/1 [==============================] - 0s 38ms/step - loss: 0.0000e+00 - accuracy: 1.0000 - val_loss: 0.0000e+00 - val_accuracy: 1.0000
Assume each windows consume about 13 level of the input :
batched_features = tf.constant( [ [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], ], shape=( 2, 1, 13 ) )
batched_labels = tf.constant( [[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]], shape=( 2, 13 ) )
Adding more windows is easy by
batched_features = tf.constant( [ [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], ], shape=( 3, 1, 13 ) )
batched_labels = tf.constant( [[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ]], shape=( 3, 13 ) )
dataset = tf.data.Dataset.from_tensor_slices((batched_features, batched_labels))
dataset = dataset.batch(10)
batched_features = dataset
At least you tell me what is the purpose they can use reverse windows to have certain results. ( Apmplitues frequency )
The results will look like these for each windows :
[ Output ] : 2 and 3 Windows
# Sequence types with timestep #1:
# <BatchDataset element_spec=(TensorSpec(shape=(None, 1, 13), dtype=tf.int32, name=None), TensorSpec(shape=(None, 13), dtype=tf.int32, name=None))>
# Sequence types with timestep #2:
# <BatchDataset element_spec=(TensorSpec(shape=(None, 1, 13), dtype=tf.int32, name=None), TensorSpec(shape=(None, 13), dtype=tf.int32, name=None))>
[ Result ]:

How to change dtypes of numpy array for tensorflow

I am creating a neural network in tensorflow and I have created the placeholders like this:
input_tensor = tf.placeholder(tf.float32, shape = (None,n_input), name = "input_tensor")
output_tensor = tf.placeholder(tf.float32, shape = (None,n_classes), name = "output_tensor")
During the training process, I was getting the following error:
Traceback (most recent call last):
File "try.py", line 150, in <module>
sess.run(optimizer, feed_dict={X: x_train[i: i + 1], Y: y_train[i: i + 1]})
TypeError: unhashable type: 'numpy.ndarray'
I identified that is because of the different datatypes of my x_train and y_train to the datatypes of the placeholders.
My x_train looks somewhat like this:
array([[array([[ 1., 0., 0.],
[ 0., 1., 0.]])],
[array([[ 0., 1., 0.],
[ 1., 0., 0.]])],
[array([[ 0., 0., 1.],
[ 0., 1., 0.]])]], dtype=object)
It was initially a dataframe like this:
0 [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]]
1 [[0.0, 1.0, 0.0], [1.0, 0.0, 0.0]]
2 [[0.0, 0.0, 1.0], [0.0, 1.0, 0.0]]
I did x_train = train_x.values to get the numpy array
And y_train looks this:
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
x_train has dtype object and y_train has dtype float64.
What I want to know is that how I can change the datatypes of my training data so that it can work well with the tensorflow placeholders. Or please suggest if I am missing something.
It is little hard to guess what shape you want your data to be, but I am guessing one of the two combinations which you might be looking for. I will also try to simulate your data in Pandas dataframe.
df = pd.DataFrame([[[[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]]],
[[[0.0, 1.0, 0.0], [1.0, 0.0, 0.0]]],
[[[0.0, 0.0, 1.0], [0.0, 1.0, 0.0]]]], columns = ['Mydata'])
print(df)
x = df.Mydata.values
print(x.shape)
print(x)
print(x.dtype)
Output:
Mydata
0 [[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]]
1 [[0.0, 1.0, 0.0], [1.0, 0.0, 0.0]]
2 [[0.0, 0.0, 1.0], [0.0, 1.0, 0.0]]
(3,)
[list([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]])
list([[0.0, 1.0, 0.0], [1.0, 0.0, 0.0]])
list([[0.0, 0.0, 1.0], [0.0, 1.0, 0.0]])]
object
Combination 1
y = [item for sub_list in x for item in sub_list]
y = np.array(y, dtype = np.float32)
print(y.dtype, y.shape)
print(y)
Output:
float32 (6, 3)
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 0. 1.]
[ 0. 1. 0.]]
Combination 2
y = [sub_list for sub_list in x]
y = np.array(y, dtype = np.float32)
print(y.dtype, y.shape)
print(y)
Output:
float32 (3, 2, 3)
[[[ 1. 0. 0.]
[ 0. 1. 0.]]
[[ 0. 1. 0.]
[ 1. 0. 0.]]
[[ 0. 0. 1.]
[ 0. 1. 0.]]]
Your x_train is a nested object containing arrays, so you have to unpack it and reshape it. Here's a general purpose hack:
def unpack(a, aggregate=[]):
for x in a:
if type(x) is float:
aggregate.append(x)
else:
unpack(x, aggregate=aggregate)
return np.array(aggregate)
x_train = unpack(x_train.values).reshape(x_train.shape[0],-1)
Once you've got a dense array (y_train is already dense), you can use a function like the following:
def cast(placeholder, array):
dtype = placeholder.dtype.as_numpy_dtype
return array.astype(dtype)
x_train, y_train = cast(X,x_train), cast(Y,y_train)

How to create a new tensor in this situation (derive b from a)?

I have a tensor 'a', I want to modify a element of it.
a = tf.convert_to_tensor([[1.0, 1.0, 1.0],
[1.0, 2.0, 1.0],
[1.0, 1.0, 1.0]], dtype=tf.float32)
And I can got the index of that element.
index = tf.where(a==2)
How to derive 'b' from 'a'?
b = tf.convert_to_tensor([[1.0, 1.0, 1.0],
[1.0, 0.0, 1.0],
[1.0, 1.0, 1.0]], dtype=tf.float32)
I know that I can't not modify a tensor from this post.
I solve it by using tf.sparse_to_dense()
import tensorflow as tf
a = tf.convert_to_tensor([[1.0, 1.0, 1.0],
[1.0, 2.0, 1.0],
[1.0, 1.0, 1.0]], dtype=tf.float32)
index = tf.where(a > 1)
zero = tf.sparse_to_dense(index, tf.shape(a, out_type=tf.int64), 0., 1.)
update = tf.sparse_to_dense(index, tf.shape(a, out_type=tf.int64), 0., 0.)
b = a * zero + update
with tf.Session() as sess:
print sess.run(b)

Matplotlib:empty confusion matrix

Need to plot a confusion matrix with this script. By running it an empty plot appears. Seems I am close to solution. Any hint?
from numpy import *
import matplotlib.pyplot as plt
from pylab import *
conf_arr = [[50.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [3.0, 26.0, 0.0, 0.0, 0.0, 1.0, 0.0], [0.0, 0.0, 10.0, 0.0, 0.0, 0.0, 0.0], [4.0, 1.0, 0.0, 5.0, 0.0, 0.0, 0.0], [3.0, 0.0, 1.0, 0.0, 6.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 47.0, 0.0], [2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0]]
norm_conf = []
for i in conf_arr:
a = 0
tmp_arr = []
a = sum(i,0)
for j in i:
tmp_arr.append(float(j)/float(a))
norm_conf.append(tmp_arr)
plt.clf()
fig = plt.figure()
ax = fig.add_subplot(111)
res = ax.imshow(array(norm_conf), cmap=cm.jet, interpolation='nearest')
cb = fig.colorbar(res)
savefig("confmat.png", format="png")
Thanks, I have the plot. Now, the ticks in the x-axes are very small (the graph dimension is: 3 cm x 10 cm or so). How can I enlarge them in order to have a more proportioned graph, lets say 10cm x 10 cm plot? A possible reason is that I visualize the graph as a subplot? Was not able to find the suitable literature to adjust that.
You don't need to clear a current figure (plt.clf()) before adding a new one.
#plt.clf() # <<<<< here
fig = plt.figure()
ax = fig.add_subplot(111)