Text manipulation using AWK - awk

Input file (example_file.txt):
chr20:1000026:T:C, 0.997, 0, 0.998, 0, 0.013, 0.980, 0.989, 1.000, 0, 0.995
chr20:10000775:A:G, 1.000, 0, 0.938, 0, 0, 0.982, 0, 0, 1.985, 1.180
Desired output (using awk):
chr20:1000026:T:C, C, T, 0.997, 0, 0.998, 0, 0.013, 0.980, 0.989, 1.000, 0, 0.995
chr20:10000775:A:G, G, A, 1.000, 0, 0.938, 0, 0, 0.982, 0, 0, 1.985, 1.180
I can get desired output with:
awk '{print $1}' example_file.txt > file1.tmp
awk -F: '{print $4",", $3","}' example_file.txt > file2.tmp
awk '{print $2, $3, $4, $5, $6, $7, $8, $9, $10, $11}' example_file.txt > file3.tmp
paste file1.tmp file2.tmp file3.tmp > output.file
output.file:
chr20:1000026:T:C, C, T, 0.997, 0, 0.998, 0, 0.013, 0.980, 0.989, 1.000, 0, 0.995
chr20:10000775:A:G, G, A, 1.000, 0, 0.938, 0, 0, 0.982, 0, 0, 1.985, 1.180
but this method is fragmented and tedious and the actual input files have >>11 columns.

splitting $1 and prepending parts to $2:
$ awk '
BEGIN {
FS=OFS=", " # proper field delimiters
}
{
n=split($1,a,/:/) # get parts of first field
for(i=3;i<=n;i++) # from the 3rd part on
$2=a[i] OFS $2 # prepend to 2nd field
}1' file # output
Output:
chr20:1000026:T:C, C, T, 0.997, 0, 0.998, 0, 0.013, 0.980, 0.989, 1.000, 0, 0.995
chr20:10000775:A:G, G, A, 1.000, 0, 0.938, 0, 0, 0.982, 0, 0, 1.985, 1.180

Related

How do I sum values of a column cumulatively with awk?

I have a sample.csv and want to sum it cumulatively by column, as below:
Input csv: Output csv:
01/01/2020, 0, 0, 2, 1 01/01/2020, 0, 0, 2, 1
18/04/2022, 7, 5, 1, 3 18/04/2022, 7, 5, 3, 4
01/05/2022, 8,21, 9, 4 01/05/2022,15,26,12, 8
I've tried
awk '{ for (i=1; i<=NF; ++i) {sum[i]+=$i; $i=sum[i] }; print $0}' sample.csv
But it returns this instead:
Input csv: Output csv:
01/01/2020, 0, 0, 2, 1 01/01/2020, 0, 0, 2, 1, 0, 0, 0, 0, 0
18/04/2022, 7, 5, 1, 3 18/04/2022, 7, 5, 1, 3, 0, 0, 0, 0, 0
01/05/2022, 8,21, 9, 4 01/05/2022, 8,21, 9, 4, 0, 0, 0, 0, 0
I'm at a loss as to how to resolve this.
Note: I am writing this in a bash script, not the terminal. And I'm not allowed to use any tools other than awk for this
I can't duplicate your output. Other than whitespace mangling, this seems to do what you want:
awk '{ for (i=2; i<=NF; i+=1) {
sum[i]+=$i; $(i)=sum[i];
}; print $0 }' FS=, OFS=, sample.csv
To get the whitespace you want, you could do:
awk '{
for (i=2; i<=NF; i+=1) {
sum[i]+=$i; $(i)=sum[i];
}
printf "%s,%2d,%2d,%2d,%2d\n", $1, $2, $3, $4, $5
}' FS=, sample.csv
If you don't know the number of columns, you could write that final printf in a loop.
Tested and confirmed working on
gawk 5.1.1,
mawk 1.3.4,
mawk 1.9.9.6, and
macos nawk
——————————————————————————————————————————————
# gawk profile, created Thu May 19 15:59:38 2022
function fmt(_) {
return +_<=_^(_<_) \
? "" : $_ = sprintf("%5.f",___[_]+=$_)
}
BEGIN { split(sprintf("%0*.f",((__=++_)+(++_*_)\
)^++_,!_),___,"""")
OFS = ", "
FS = "[,][ ]*"
} { _ = NF
while(_!=__) { fmt(_--) } }_'
——————————————————————————————————————————————
01/01/2020, 0, 0, 2, 1
18/04/2022, 7, 5, 3, 4
01/05/2022, 15, 26, 12, 8

How to one hot encode numpy array of arrays using keras to_categorical?

I have target data y_tr in shape (92107, 49) where each sample (row) is an array of length 49. I would like to loop over y_tr and one hot encode each array. I have been using keras to_categorical though I am facing an Indexerror.
Array in y_tr:
array([ 379, 379, 379, 252, 4166, 391, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0], dtype=int32)
The number of classes to encode each integer index = 10,000. (I do not wish to encode zero values):
onehots=[]
for seq in y_tr[:,1:]:
try:
onehots.append(to_categorical(seq - 1, num_classes=tar_vocab_size, dtype=np.int32))
except IndexError as e:
raise e
76 n = y.shape[0]
77 categorical = np.zeros((n, num_classes), dtype=dtype)
---> 78 categorical[np.arange(n), y] = 1
79 output_shape = input_shape + (num_classes,)
80 categorical = np.reshape(categorical, output_shape)
IndexError: index 28350 is out of bounds for axis 1 with size 10000

Can't use deployed TF BERT model to get GCloud online predictions from SavedModel: "Bad Request" error

I trained a BERT model based on this notebook.
I export it as a tf SavedModel this way:
def serving_input_fn():
receiver_tensors = {
"input_ids": tf.placeholder(dtype=tf.int32, shape=[1, MAX_SEQ_LENGTH])
}
features = {
"input_ids": receiver_tensors['input_ids'],
"input_mask": 1 - tf.cast(tf.equal(receiver_tensors['input_ids'], 0), dtype=tf.int32),
"segment_ids": tf.zeros(dtype=tf.int32, shape=[1, MAX_SEQ_LENGTH]),
"label_ids": tf.placeholder(tf.int32, [None], name='label_ids')
}
return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)
estimator._export_to_tpu = False
estimator.export_saved_model("export", serving_input_fn)
Then if I try to use the saved model locally it works:
from tensorflow.contrib import predictor
predict_fn = predictor.from_saved_model("export/1575241274/")
print(predict_fn({
"input_ids": [[101, 10468, 99304, 11496, 171, 112, 10176, 22873, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
}))
# {'probabilities': array([[-0.01023898, -4.5866656 ]], dtype=float32), 'labels': 0}
Then I uploaded the SavedModel to a bucket and created a model and a model version on gcloud this way:
gcloud alpha ai-platform versions create v1gpu --model [...] --origin=[...] --python-version=3.5 --runtime-version=1.14 --accelerator=^:^count=1:type=nvidia-tesla-k80 --machine-type n1-highcpu-4
No issue there, the model is deployed and displayed as working in the console.
But if I try to get predictions, as such:
import googleapiclient.discovery
service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/[project_name]/models/[model_name]/versions/v1gpu'
response = service.projects().predict(
name=name,
body={'instances': [{
"input_ids": [[101, 10468, 99304, 11496, 171, 112, 10176, 22873, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
}]}
).execute()
print(response["predictions"])
All I get is the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/dist-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/googleapiclient/http.py", line 851, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://ml.googleapis.com/v1/projects/[project_name]/models/[model_name]/versions/v1gpu:predict?alt=json returned "Bad Request">
I get the same error if I test the model from the gcloud console using the "Test your model with sample input data" feature.
Edit:
The saved_model has a tagset "serve" and signature_def "serving_default".
Output of "saved_model_cli show --dir 1575241274/ --tag_set serve --signature_def serving_default":
The given SavedModel SignatureDef contains the following input(s):
inputs['input_ids'] tensor_info:
dtype: DT_INT32
shape: (1, 128)
name: Placeholder:0
The given SavedModel SignatureDef contains the following output(s):
outputs['labels'] tensor_info:
dtype: DT_INT32
shape: ()
name: loss/Squeeze:0
outputs['probabilities'] tensor_info:
dtype: DT_FLOAT
shape: (1, 2)
name: loss/LogSoftmax:0
Method name is: tensorflow/serving/predict
The body of the request sent to the API has the form:
{"instances": [<instance 1>, <instance 2>, ...]}
As specified in documentation we need something like this:
{
"instances": [
<object>
...
]
}
In this case you have:
{
"instances": [
{
"input_ids":
[ <object> ]
}
...
]
}
You need to replace input_ids to instances:
{
"instances":
[
[101, 10468, 99304, 11496, 171, 112, 10176, 22873, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
]
}
Note If you can show the saved_model_cli will be great.
Also gcloud local predict command is also a good option for testing.
It depend of the signature of the model. In my case I have the following signature (just keeping the input part):
The given SavedModel SignatureDef contains the following input(s):
inputs['attention_mask'] tensor_info:
dtype: DT_INT32
shape: (-1, 128)
name: serving_default_attention_mask:0
inputs['input_ids'] tensor_info:
dtype: DT_INT32
shape: (-1, 128)
name: serving_default_input_ids:0
inputs['token_type_ids'] tensor_info:
dtype: DT_INT32
shape: (-1, 128)
name: serving_default_token_type_ids:0
and I need to pass data in the following format (in this case 2 examples):
{'instances':
[
{'input_ids': [101, 143, 18267, 15470, 90395, ...],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, .....],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, .....]
},
{'input_ids': [101, 17664, 143, 30728, .........],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, .......],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, ....]
}
]
}
I am using it with a Keras model with Tensorflow 2.2.0
I guess in your case you need (for 2 examples):
{'instances':
[
{'input_ids': [101, 143, 18267, 15470, 90395, ...]},
{'input_ids': [101, 17664, 143, 30728, .........]}
]
}

JAGS: prior's distribution inside a distribution

This is our first model:
# Data:
x1 = as.factor(c(0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1))
x2 = as.factor(c(0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1))
x3 = as.factor(c(0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1))
x4 = as.factor(c(0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1))
n = rep(1055, 16)
y = c(239, 31, 15, 11, 7, 5, 18, 100, 262, 32, 38, 32, 8, 7, 16, 234)
# Model:
mymodel = function(){
for (i in 1:16){
y[i] ~ dbin(theta[i], n[i] )
eta[i] <- gamma*x1[i]+beta1*x2[i]+beta2*x3[i]+beta3*x4[i]
theta[i] <- 1/(1+exp(-eta[i]))
}
# Prior
gamma ~ dnorm(0,0.00001)
beta1 ~ dnorm(0,0.00001)
beta2 ~ dnorm(0,0.00001)
beta3 ~ dnorm(0,0.00001)
}
Now we are asked to add alpha as a Normal, with known mean and unknown variance. But the variance has a uniform prior as shown in the image:
I do not know how to add alpha to the model, and then specify the new parameter in the priors...
You would just add alpha into your linear predictor and give it a distribution like any of the other parameters. However, JAGS parameterizes the Normal distribution as precision instead of variance (precision is just the inverse of the variance). The model would look something like this instead. Also, you can just use logit(eta) instead of applying the inverse logit.
mymodel = function(){
for (i in 1:16){
y[i] ~ dbin(eta[i], n[i] )
logit(eta[i]) <- alpha + gamma*x1[i]+beta1*x2[i]+beta2*x3[i]+beta3*x4[i]
}
# Prior
alpha ~ dnorm(0, tau_alpha)
tau_alpha <- 1 / var_alpha
var_alpha ~ dunif(0, 10)
gamma ~ dnorm(0,0.00001)
beta1 ~ dnorm(0,0.00001)
beta2 ~ dnorm(0,0.00001)
beta3 ~ dnorm(0,0.00001)
}

using awk to perform operation on second value, only on lines with a specific string

ok. i have this which works. yay!
awk '/string/ { print $0 }' file1.csv | awk '$2=$2+(int(rand()*10))' > file2.csv
but i want the context of my file also printed into file2.csv. this program only prints the lines which contain my string, which is good, it's a start.
i would, but, i can't simply apply the operation to the $2 values on every line because the values of $2 on lines that don't contain my string, are not to be changed.
so, i want the original (file1.csv) contents intact with the only difference being an adjusted value at $2 on lines matching my string.
can someone help me? thank you.
And here are 4 lines from the original csv:
2, 0, Control, 0, 7, 1000
2, 0, Control, 0, 10, 540
2, 30720, String, 0, 76, 100
2, 32620, String, 0, 76, 0
Expected output is the same aside from small variations to $2:
2, 0, Control, 0, 7, 1000
2, 0, Control, 0, 10, 540
2, 30725, String, 0, 76, 100
2, 32621, String, 0, 76, 0
$ awk 'BEGIN{FS=OFS=", "} /String/{$2+=int(rand()*10)}1' file
2, 0, Control, 0, 7, 1000
2, 0, Control, 0, 10, 540
2, 30722, String, 0, 76, 100
2, 32622, String, 0, 76, 0
you probably need to initialize the rand seed in the BEGIN block as well, otherwise you'll always get the same rand sequence.
Something like this, maybe:
$ awk -F", " -v OFS=", " '$3 ~ /String/ { $2=$2+(int(rand()*10))} { print $0 }' file1.csv
2, 0, Control, 0, 7, 1000
2, 0, Control, 0, 10, 540
2, 30722, String, 0, 76, 100
2, 32622, String, 0, 76, 0