How to add a new field to a json-string column in hive - hive

Say I have a row that looks like this
data (string) | key
======================================
'{ "val1": 3, "val2": 4 }' 1
How can I add a new field to the json string in the data column? For brevity, say I would like to add a constant value to it.
'{ "val1": 3, "val2": 4, "new_field": "x" }'
I think I can do this with string functions like concat, length and substr with something like
concat( substr(data, 0, length(data) - 1, ', "new_field": "x"', '}' )
I am wondering, is there a more json-native way of doing this ?

Related

normalize column data with average value of that column with awk

I have 3 columns in a data file look like below and continues up to 250 rows:
0.9967 0.7765 0.5798
0.9955 0.7742 0.5767
0.9942 0.7769 0.5734
I want to normalise each column based on the average value of that column.
I am using the code below (e.g. for column 1) but it does not print my desired output.
The results should be very close to 1
awk 'NR==FNR{sum+= $1; next}{avg=(NR/sum)}FNR>1{print($1/avg)}' f.dat f.dat
expected output for first column.
1.003
1.001
0.9988
You need separate placeholders for storing the sum and the count of columns. Recommend using an array for storing it for each column.
awk '
NR==FNR {
for (col=1; col<=NF; col++) {
avg[col] += $col
len[col] += 1
}
next
}
{
for (col=1; col<=NF; col++) {
colAvg = avg[col]/len[col]
printf "%.3f%s", $col/colAvg, (col<NF ? FS : ORS)
}
}
' file file
Or if you want to update the entire table with the new normalized values, drop the FNR==1 from the above snippet. If you want to increase the precision of the averaged value, change %.2f to how many digits you want as preferable

Awk get unique elements from array

file.txt:
INTS11:P446P&INTS11:P449P&INTS11:P518P&INTS11:P547P&INTS11:P553P
PLCH2:A1007int&PLCH1:D987int&PLCH2:P977L
I am attempting to create a hyperlink by transforming the content of a file. The hyperlink will have the following style:
somelink&gene=<gene>[&gene=<gene>]&mutation=<gene:key>[&mutation=<gene:key>]
where INTS11:P446P corresponds to gene:key for example
The problem is that I am looping on the each row to create an array that contains the genes as values and thus multiple duplicated entries can be found for the same gene.
My attempt is the following
Split on & and store in a
For each element in a, split on : and add a[i] to array b
The problem is that I don't know how to get unique values from my array. I found this question but it talks about files and not arrays like in my case.
The code:
awk '#include "join"
{
split($0,a,"&")
for ( i = 1; i <= length(a); i++ ) {
split(a[i], b, ":");
genes[i] = "&gene="b[1];
keys[i] = "&mutation="b[1]":"b[2]
}
print "somelink"join(genes, 1, length(genes),SUBSEP)join(keys, 1, length(keys),SUBSEP)
delete genes
delete keys
}' file.txt
will output:
somelink&gene=INTS11&gene=INTS11&gene=INTS11&gene=INTS11&gene=INTS11&mutation=INTS11:P446P&mutation=INTS11:P449P&mutation=INTS11:P518P&mutation=INTS11:P547P&mutation=INTS11:P553P
somelink&gene=PLCH2&gene=PLCH1&gene=PLCH2&mutation=PLCH2:A1007int&mutation=PLCH1:D987int &mutation=PLCH2:P977L
I wish to obtain something similar like (notice how many &gene= is there):
somelink&gene=INTS11&mutation=INTS11:P446P&INTS11:P449P&INTS11:P518P&INTS11:P547P&INTS11:P553P
somelink&gene=PLCH2&gene=PLCH1&mutation=PLCH2:A1007int&mutation=PLCH1:D987int&mutation=PLCH2:P977L
EDIT:
my problem was partly solved thanks to Pierre Francois's answer which was the SUBSEP. My other issue is that I want to get only unique elements from my arrays genes and keys.
Thank you.
Supposing you want to remove the spaces between the fields concatenated with the join function of awk, the 4th argument you have to provide to the join function is the magic number SUBSEP and not an empty string "" as you did. Try:
awk '#include "join"
{
split($0,a,"&")
for ( i = 1; i <= length(a); i++ ) {
split(a[i], b, ":");
genes[i] = "&gene="b[1];
keys[i] = "&mutation="b[1]":"b[2]
}
print "somelink"join(genes, 1, length(genes),SUBSEP)join(keys, 1, length(keys),SUBSEP)
delete genes
delete keys
}' file.txt

R inner join different data types

I was wondering if there was a way or maybe another package that uses SQL queries to manipulate dataframes so that I don't necessarily have to convert numerical variables to strings/characters.
input_key <- c(9061,8680,1546,5376,9550,9909,3853,3732,9209)
output_data <- data.frame(input_key)
answer_product <- c("Water", "Bread", "Soda", "Chips", "Chicken", "Cheese", "Chocolate", "Donuts", "Juice")
answer_data <- data.frame(cbind(input_key, answer_product), stringsAsFactors = FALSE)
left_join(output_data,answer_data, by = "input_key")
The left_join function from dplyr work also with numerical value as key.
I think that you problem come from the 'cbind' function, because its output is a matrix those can only store one kind of data type. In your case, the numeric values are casted to char.
In contrary of matrix, data.frame could store different type of data, like a list.
Form your code, the key column is converted to char:
> str(answer_data)
'data.frame': 9 obs. of 2 variables:
$ input_key : chr "9061" "8680" "1546" "5376" ...
$ answer_product: chr "Water" "Bread" "Soda" "Chips" ...
If instead you construct the data.frame with:
answer_data_2 <- data.frame(
input_key = input_key,
answer_product = answer_product,
stringsAsFactors = FALSE
)
the key colunm stay numeric
> str(answer_data_2)
'data.frame': 9 obs. of 2 variables:
$ input_key : num 9061 8680 1546 5376 9550 ...
$ answer_product: chr "Water" "Bread" "Soda" "Chips" ...
and
left_join(output_data,answer_data, by = "input_key")
work with the numerical keys

jq test if any of several substrings is in a string

Let's say I have a list of items like this:
[
"abcdef",
"defghi",
"euskdh"
]
I want to write a filter that returns all of the items that contain an "a", "d", or "h". This is the best I could come up with:
. as $val | select(any(["a", "d", "h"]; inside($val)))
Is there any way to do it without using a variable?
Assuming your jq has regex support:
map(select(test("a|d|h")))
Or if you want a stream of values:
.[] | select(test("a|d|h"))
If your jq does not have regex support, then if it has any/2, the following will produce a stream of values:
.[] | select( any( index( "a", "d", "h"); . != null ) )
All else failing, the following will do the job but is inefficient:
.[] | select( [index("a", "d", "h")] | any )
Here is a solution using index
.[]
| if index("a") or index("d") or index("h") then . else empty end

Delete every other line starting with a 1

I can find sed solutions to delete all lines in a text file starting with a '1' as well as solutions to delete every other line of all the lines in the text file but I want to combine the two.. of all the lines starting with '1' delete every other one of them and keep the other lines that do not start with a 1.
So if I have a text file:
1, 1
1, 2
2, 3
3, 4
4, 5
2, 6
1, 7
3, 8
1, 9
4, 10
I want the output to be:
1, 1
2, 3
3, 4
4, 5
2, 6
1, 7
3, 8
4, 10
You could do this in awk:
awk -F, '!($1 == 1 && n++ % 2)' file
-F, means use comma as the field separator, so the two numbers on each line will be the variables $1 and $2.
awk will print the line if the last thing it evaluates is true. The ! negates the contents of the parentheses, so in order to print, the contents must be false.
If the first field isn't 1, short-circuiting takes place, as (false && anything) will always be false. This means that the second half after the && will not be evaluated.
If $1 == 1, then the second half is evaluated. As n is being used for the first time in a numeric context, it will assume the value 0. The modulo operation n % 2 will return 0 (false) for even numbers and 1 (true) for odd numbers. Using the increment n++ means that the result will alternate between true and false.
You may prefer the reverse logic, which would be:
awk -F, '$1 != 1 || ++n % 2' file
The || is also short-circuiting, so if the first value isn't 1 then the line gets printed. Otherwise, the second half is evaluated. This time, the increment goes before the n so that the first value of n is 1, making the expression evaluate to true.
Either way, the output is:
1, 1
2, 3
3, 4
4, 5
2, 6
1, 7
3, 8
4, 10
This might work for you (GNU sed):
sed '/^1/{x;/./{z;x;d};x;h}' file
Use the hold space to toggle the deletion of lines beginning with 1.
An alternative:
sed '/^1/{:a;n;//d;ba}' file
Here you go:
awk '$1=="1," && !(f=f?0:1) {next} 1' file
1, 1
2, 3
3, 4
4, 5
2, 6
1, 7
3, 8
4, 10
$1=="1," Test if first field is 1
f=f?0:1 Flop the f between 0 and 1 for every time $1=="1," is true
!(...) True if f is 0
Here's an awk-based solution without requiring any modulo math whatsoever :
[ngm]awk 'FS~NF||_*=--_' FS='^1'
|
1, 1
2, 3
3, 4
4, 5
2, 6
1, 7
3, 8
4, 10
This leverages the interesting property that x *= --x generates an alternating sequence of 1s and 0s that never converges and never diverges.