I'm trying to clean up the syntax of a pseudo-json file. The file is too large to open in a text editor (20 gb), so I have to do all of this via command line (running Arch linux). The one thing I cannot figure out how to do is replace new line characters in sed (GNU sed v. 4.8)
Specifically I have data of the form:
{
"id" : 1,
"value" : 2
}
{
"id" : 2,
"value" : 4
}
And I need to put a comma after the closed curly bracket (but not the last one). So I want the output to looks like:
{
"id" : 1,
"value" : 2
},
{
"id" : 2,
"value" : 4
}
Ideally I'd just do this in sed, but from reading about this, sed flattens the text first, so it's not clear how to replace newline characters.
Ideally I'd just run something like sed 's/}\n{/},\n{/g' test.json, but this doesn't work (nor does using \\n in place of \n).
I've also tried awk, but have run into similar issue of not being able to replace the combination of a hard return with brackets. And I can get tr to replace the hard returns, but not the combination of characters.
Any thoughts on how to solve this?
Yeah, by default sed works line by line. You cannot match across multiple lines unless you use features to bring in multiple lines to the pattern space. Here's one way to do it, provided the input strictly follows the sample shown:
sed '/}$/{N; s/}\n{/},\n{/}' ip.txt
/}$/ match } at the end of a line
{} allows you to group commands to be executed for a particular address
N will add the next line to the pattern space
s/}\n{/},\n{/ perform the required substitution
Use -i option for in-place editing
This solution can fail for sequences like shown below, but I assume two lines ending with } will not occur in a row.
}
}
{
abc
}
Use sed '/}$/{N; s/}\n{/},\n{/; P; D}' if the above sequence can occur.
With your shown samples please try following awk program; using RS and setting its value to null then simply apply gsub(Global substitution) to substitute from }\n{ to },\n{ in matches.
awk -v RS= '{gsub(/}\n{/,"},\n{")} 1' Input_file
You can used GNU sed for -z to do that:
$ sed -z 's/}\n{/},\n{/g' file
{
"id" : 1,
"value" : 2
},
{
"id" : 2,
"value" : 4
}
but then it's non-portable, has to read the whole file into memory at once, and is hard to adapt if the file format isn't exactly as you expect (e.g. additional spaces, comment lines, etc.) or you need to make any additional adjustments.
I'd just use awk, e.g. using any awk in any shell on every Unix box:
awk 'NR>1{print prev (prev=="}" ? "," : "")} {prev=$0} END{print prev}' file
{
"id" : 1,
"value" : 2
},
{
"id" : 2,
"value" : 4
}
That would be portable across all Unix boxes, just reads 1 line at a time so uses almost no memory, and would be trivial to adapt for any differences in your input or additional changes you want made to the output.
I would use GNU AWK following way, let file.txt content be
{
"id" : 1,
"value" : 2
}
{
"id" : 2,
"value" : 4
}
then
awk 'BEGIN{RS="}\n{"}{printf "%s%s",sep,$0;sep="},\n{"}' file.txt
output
{
"id" : 1,
"value" : 2
},
{
"id" : 2,
"value" : 4
}
Explanation: I used RS (row seperator) to split on }\n{ then I do not use ORS as this would result in trailing ORS, I use trick described here.
(tested in GNU Awk 5.0.1)
When the last } is on the last line, you can tell sed to skip the replacement on last line
sed '$ !s/}/},/' test.json
I am currently trying to insert all our DT files v2 into BQ.
I already did it with the click file, I spotted any trouble.
But it's not the same game with the activity and impression.
I wrote a quick script to help me in making the schema for the insertion :
import csv,json
import glob
data = []
for i in glob.glob('*.csv'):
print i
b = i.split("_")
print b[2]
with open(i, 'rb') as f:
reader = csv.reader(f)
row1 = next(reader)
title = [w.replace(' ', '_').replace('/', '_').replace(':', '_').replace('(', '_').replace(')', '').replace("-", "_") for w in row1]
print title
for a in title:
j={"name":"{0}".format(a),"type":"string","mode":"nullable"}
print j
if j not in data:
data.append(j)
with open('schema_' + b[2] + '.json', 'w') as outfile:
json.dump(data, outfile)
After that, I use the small bash script to insert all our data from our GCS .
#!/bin/bash
prep_files() {
date=$(echo "$f" | cut -d'_' -f4 | cut -c1-8)
echo "$n"
table_name=$(echo "$f" | cut -d'_' -f1-3)
bq --nosync load --field_delimiter=',' DCM_V2."$table_name""_""$date" "$var" ./schema/v2/schema_"$n".json
}
num=1
for var in $(gsutil ls gs://import-log/01_v2/*.csv.gz)
do
if test $num -lt 10000
then
echo "$var"
f=$(echo "$var" | cut -d'/' -f5)
n=$(echo "$f" | cut -d'_' -f3)
echo "$n"
prep_files
num=$(($num+1))
else
echo -e "Wait the next day"
echo "$num"
sleep $(( $(date -d 'tomorrow 0100' +%s) - $(date +%s) ))
num=0
fi
done
echo 'Import done'
But I have this kind of error :
Errors:
Too many errors encountered. (error code: invalid)
/gzip/subrange//bigstore/import-log/01_v2/dcm_accountXXX_impression_2016101220_20161013_073847_299112066.csv.gz: CSV table references column position 101, but line starting at position:0 contains only 101 columns. (error code: invalid)
So I check the number of columns in my schema with :
$awk -F',' '{print NF}'
But I have the good number of column...
So I thought that was because we had comma in value (some publishers are using a .NET framework, that allows comma in url). But theses values are enclosed with double quote.
So I made a test with a small file :
id,url
1,http://www.google.com
2,"http://www.google.com/test1,test2,test3"
And this loading works...
If someone have a clue to help me, that could be realy great. :)
EDIT : I did another test by make the load with an already decompressed file.
Too many errors encountered. (error code: invalid)
file-00000000: CSV table references column position 104, but line starting at position:2006877004 contains only 104 columns. (error code: invalid)
I used this command to find the line : $tail -c 2006877004 dcm_accountXXXX_activity_20161012_20161013_040343_299059260.csv | head -n 1
I get :
3079,10435077,311776195,75045433,1,2626849,139520233,IT,,28,0.0,22,,4003208,,dc_pre=CLHihcPW1M8CFTEC0woddTEPSQ;~oref=http://imasdk.googleapis.com/js/core/bridge3.146.2_en.html,1979747802,1476255005253094,2,,4233079,CONVERSION,POSTVIEW,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,0.000000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
After that : $head -n1 dcm_account8897_activity_20161012_20161013_040343_299059260.csv | awk -F',' '{print NF}'
Response : 102
So, I have 104 columns in the first row and 102 on this one...
Anyone else have trouble with the DT files v2 ?
I had this similar issue and found the problem was due to a few records being separated by carriage returns into 2 lines. Removing \r solved the problem
The line affected is usually not the line reflected in the error log.
I would open the csv file from google sheets, and compare the columns with the schema you generated.
Most probably you will found a bug in the schema.
I am working on centos 6. I want to perform mass insertion into redis cache from mySQL table that has more than 10 million records. I have already read about redis protocol and its format. I am able to copy table data into redis protocol format in text file. But when I try to execute pipe command an error is coming.
Error reading from the server: Connection reset by peer
or
ERR Protocol error: expected '$', got '3'
I am using following command:
cat insert.txt | redis-cli --pipe
insert.txt contains data in this format:
*3
$3
SET
$2
K1
$2
V1
If possible,please tell me format for multiple commands in this text file as in above example, this file only contains one set command. I will be thankful if you will give me example of text file which has at least two commands in it.
I have also tried file data in this format.
"*3\r\n$3\r\nSET\r\n$3\r\nkey\r\n$5\r\nvalue\r\n"
It gives the following error:
ERR unknown command '*3 $3 SET $3 key $5 value '
Right, so it's really important that you send the data with CRLF line separators -- not the string \r\n but actually those characters. This means no regular unix line endings, which is just LF, or \n.
How you insert those is depending on the editor you use, but the easiest way to work with what you have would be to have each symbol on it's own line:
*3
$3
SET
$2
K1
$2
V2
And use sed:
cat insert.txt | sed 's/$/\r/g' | redis-cli --pipe
Should do the trick.
Then, in order to send many commands in one go (pipelining), you just add that to the end of your file, like so:
*3
$3
SET
$2
K1
$2
V2
*3
$3
SET
$2
K2
$2
V3
'awk 'BEGIN{FS=OFS=","}{print $2,$3,$5;}' <file>'
Using this command, is it possible to have it go through multiple files, ie file* at the end and if not how can i do this?
i have
file.01
file.02
through file.20
All of the files can be replaced directly. An out file is not needed although i still need the split files to exist in their current chunks of 250mb
Yes - awk takes any number of files as arguments, and processes them in sequence. See man awk:
SYNOPSIS
awk [ -F fs ] [ -v var=value ] [ 'prog' | -f progfile ] [ file ... ]
and, in fact, you can access the name of the file current filename with the variable FILENAME.
You can do this with:
awk 'BEGIN{FS=OFS=","}{print $2,$3,$5;}' file*
for all files that start with the text file.
This is my hadoop job:
hadoop streaming \
-D mapred.map.tasks=1\
-D mapred.reduce.tasks=1\
-mapper "awk '{if(\$0<3)print}'" \ # doesn't work
-reducer "cat" \
-input "/user/***/input/" \
-output "/user/***/out/"
this job always fails, with an error saying:
sh: -c: line 0: syntax error near unexpected token `('
sh: -c: line 0: `export TMPDIR='..../work/tmp'; /bin/awk { if ($0 < 3) print } '
But if I change the -mapper into this:
-mapper "awk '{print}'"
it works without any error. What's the problem with the if(..) ?
UPDATE:
Thank #paxdiablo for your detailed answer.
what I really want to do is filter out some data whose 1st column is greater than x, before piping the input data to my custom bin. So the -mapper actually looks like this:
-mapper "awk -v x=$x{if($0<x)print} | ./bin"
Is there any other way to achieve that?
The problem's not with the if per se, it's to do with the fact that the quotes have been stripped from your awk command.
You'll realise this when you look at the error output:
sh: -c: line 0: `export TMPDIR='..../work/tmp'; /bin/awk { if ($0 < 3) print } '
and when you try to execute that quote-stripped command directly:
pax> echo hello | awk {if($0<3)print}
bash: syntax error near unexpected token `('
pax> echo hello | awk {print}
hello
The reason the {print} one works is because it doesn't contain the shell-special ( character.
One thing you might want to try is to escape the special characters to ensure the shell doesn't try to interpret them:
{if\(\$0\<3\)print}
It may take some effort to get the correctly escaped string but you can look at the error output to see what is generated. I've had to escape the () since they're shell sub-shell creation commands, the $ to prevent variable expansion, and the < to prevent input redirection.
Also keep in mind that there may be other ways to filter depending on you needs, ways that can avoid shell-special characters. If you specify what your needs are, we can possibly help further.
For example, you could create an shell script (eg, pax.sh) to do the actual awk work for you:
#!/bin/bash
awk -v x=$1 'if($1<x){print}'
then use that shell script in the mapper without any special shell characters:
hadoop streaming \
-D mapred.map.tasks=1 -D mapred.reduce.tasks=1 \
-mapper "pax.sh 3" -reducer "cat" \
-input "/user/***/input/" -output "/user/***/out/"