How to remove double quotes from column field value of csv file,that is copy to a table in postgreSQL - sql

I used \copy parts from home/sherin_ag/parts.csv with (format csv, header false) this command on bash the problem is on of the row in csv column value comtain double quotes between the characters
example: "35387","20190912","X99","1/4" KEYWAY","KEYWAY","1","FORD"
The problem is "1/4" KEYWAY" it contain a double quotes after 1/4
so i got an error like
ERROR: unterminated CSV quoted field

This CSV is invalid, since it has an unescaped " in a record with the same character as quote character. As #404 said, the generated csv file is the problem.
That being said, you can correct the file beforehand using sed from your console, e.g:
cat vasher_dummy.txt | sed -r 's/\" /\\"" /g' | psql db -c "COPY t FROM STDIN CSV"
This command will search for every occurrence of "_ ( _ meaning space) in your file and will replace it with an escaped "
SELECT * FROM t;
a | b | c | d | e | f | g
-------+----------+-----+--------------+--------+---+------
35387 | 20190912 | X99 | 1/4\" KEYWAY | KEYWAY | 1 | FORD
(1 Zeile)
If it isn't the only problem you have in your file, you'll have to change the sed string to make further corrections. Check the sed documentation.. it's really powerful.
Update based on the comments: the OP needs a counter for each line and needs to filter out a few lines.
nl -ba -nln -s, < vasher_dummy.txt | sed -r 's/\" /\ /g' | grep 'SVCPTS' | psql db -c "COPY t FROM STDIN CSV"

Related

Is there a simple way to save variables from sed/awk? [duplicate]

Is there a way to tell sed to output only captured groups?
For example, given the input:
This is a sample 123 text and some 987 numbers
And pattern:
/([\d]+)/
Could I get only 123 and 987 output in the way formatted by back references?
The key to getting this to work is to tell sed to exclude what you don't want to be output as well as specifying what you do want. This technique depends on knowing how many matches you're looking for. The grep command below works for an unspecified number of matches.
string='This is a sample 123 text and some 987 numbers'
echo "$string" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'
This says:
don't default to printing each line (-n)
exclude zero or more non-digits
include one or more digits
exclude one or more non-digits
include one or more digits
exclude zero or more non-digits
print the substitution (p) (on one line)
In general, in sed you capture groups using parentheses and output what you capture using a back reference:
echo "foobarbaz" | sed 's/^foo\(.*\)baz$/\1/'
will output "bar". If you use -r (-E for OS X) for extended regex, you don't need to escape the parentheses:
echo "foobarbaz" | sed -r 's/^foo(.*)baz$/\1/'
There can be up to 9 capture groups and their back references. The back references are numbered in the order the groups appear, but they can be used in any order and can be repeated:
echo "foobarbaz" | sed -r 's/^foo(.*)b(.)z$/\2 \1 \2/'
outputs "a bar a".
If you have GNU grep:
echo "$string" | grep -Po '\d+'
It may also work in BSD, including OS X:
echo "$string" | grep -Eo '\d+'
These commands will match any number of digit sequences. The output will be on multiple lines.
or variations such as:
echo "$string" | grep -Po '(?<=\D )(\d+)'
The -P option enables Perl Compatible Regular Expressions. See man 3 pcrepattern or man 3 pcresyntax.
Sed has up to nine remembered patterns but you need to use escaped parentheses to remember portions of the regular expression.
See here for examples and more detail
you can use grep
grep -Eow "[0-9]+" file
run(s) of digits
This answer works with any count of digit groups. Example:
$ echo 'Num123that456are7899900contained0018166intext' \
| sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'
123 456 7899900 0018166
Expanded answer.
Is there any way to tell sed to output only captured groups?
Yes. replace all text by the capture group:
$ echo 'Number 123 inside text' \
| sed 's/[^0-9]*\([0-9]\{1,\}\)[^0-9]*/\1/'
123
s/[^0-9]* # several non-digits
\([0-9]\{1,\}\) # followed by one or more digits
[^0-9]* # and followed by more non-digits.
/\1/ # gets replaced only by the digits.
Or with extended syntax (less backquotes and allow the use of +):
$ echo 'Number 123 in text' \
| sed -E 's/[^0-9]*([0-9]+)[^0-9]*/\1/'
123
To avoid printing the original text when there is no number, use:
$ echo 'Number xxx in text' \
| sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1/p'
(-n) Do not print the input by default.
(/p) print only if a replacement was done.
And to match several numbers (and also print them):
$ echo 'N 123 in 456 text' \
| sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1 /gp'
123 456
That works for any count of digit runs:
$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" \
| sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'
123 456 7899900 0018166
Which is very similar to the grep command:
$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" | grep -Po '\d+'
123
456
7899900
0018166
About \d
and pattern: /([\d]+)/
Sed does not recognize the '\d' (shortcut) syntax. The ascii equivalent used above [0-9] is not exactly equivalent. The only alternative solution is to use a character class: '[[:digit:]]`.
The selected answer use such "character classes" to build a solution:
$ str='This is a sample 123 text and some 987 numbers'
$ echo "$str" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'
That solution only works for (exactly) two runs of digits.
Of course, as the answer is being executed inside the shell, we can define a couple of variables to make such answer shorter:
$ str='This is a sample 123 text and some 987 numbers'
$ d=[[:digit:]] D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D+($d+)$D*/\1 \2/p"
But, as has been already explained, using a s/…/…/gp command is better:
$ str='This is 75577 a sam33ple 123 text and some 987 numbers'
$ d=[[:digit:]] D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D*/\1 /gp"
75577 33 123 987
That will cover both repeated runs of digits and writing a short(er) command.
Give up and use Perl
Since sed does not cut it, let's just throw the towel and use Perl, at least it is LSB while grep GNU extensions are not :-)
Print the entire matching part, no matching groups or lookbehind needed:
cat <<EOS | perl -lane 'print m/\d+/g'
a1 b2
a34 b56
EOS
Output:
12
3456
Single match per line, often structured data fields:
cat <<EOS | perl -lape 's/.*?a(\d+).*/$1/g'
a1 b2
a34 b56
EOS
Output:
1
34
With lookbehind:
cat <<EOS | perl -lane 'print m/(?<=a)(\d+)/'
a1 b2
a34 b56
EOS
Multiple fields:
cat <<EOS | perl -lape 's/.*?a(\d+).*?b(\d+).*/$1 $2/g'
a1 c0 b2 c0
a34 c0 b56 c0
EOS
Output:
1 2
34 56
Multiple matches per line, often unstructured data:
cat <<EOS | perl -lape 's/.*?a(\d+)|.*/$1 /g'
a1 b2
a34 b56 a78 b90
EOS
Output:
1
34 78
With lookbehind:
cat EOS<< | perl -lane 'print m/(?<=a)(\d+)/g'
a1 b2
a34 b56 a78 b90
EOS
Output:
1
3478
I believe the pattern given in the question was by way of example only, and the goal was to match any pattern.
If you have a sed with the GNU extension allowing insertion of a newline in the pattern space, one suggestion is:
> set string = "This is a sample 123 text and some 987 numbers"
>
> set pattern = "[0-9][0-9]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
123
987
> set pattern = "[a-z][a-z]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
his
is
a
sample
text
and
some
numbers
These examples are with tcsh (yes, I know its the wrong shell) with CYGWIN. (Edit: For bash, remove set, and the spaces around =.)
Try
sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"
I got this under cygwin:
$ (echo "asdf"; \
echo "1234"; \
echo "asdf1234adsf1234asdf"; \
echo "1m2m3m4m5m6m7m8m9m0m1m2m3m4m5m6m7m8m9") | \
sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"
1234
1234 1234
1 2 3 4 5 6 7 8 9
$
You need include whole line to print group, which you're doing at the second command but you don't need to group the first wildcard. This will work as well:
echo "/home/me/myfile-99" | sed -r 's/.*myfile-(.*)$/\1/'
It's not what the OP asked for (capturing groups) but you can extract the numbers using:
S='This is a sample 123 text and some 987 numbers'
echo "$S" | sed 's/ /\n/g' | sed -r '/([0-9]+)/ !d'
Gives the following:
123
987
I want to give a simpler example on "output only captured groups with sed"
I have /home/me/myfile-99 and wish to output the serial number of the file: 99
My first try, which didn't work was:
echo "/home/me/myfile-99" | sed -r 's/myfile-(.*)$/\1/'
# output: /home/me/99
To make this work, we need to capture the unwanted portion in capture group as well:
echo "/home/me/myfile-99" | sed -r 's/^(.*)myfile-(.*)$/\2/'
# output: 99
*) Note that sed doesn't have \d
You can use ripgrep, which also seems to be a sed replacement for simple substitutions, like this
rg '(\d+)' -or '$1'
where ripgrep uses -o or --only matching and -r or --replace to output only the first capture group with $1 (quoted to be avoid intepretation as a variable by the shell) two times due to two matches.

How do I delete a line that contains a specific string at a specific location using sed or awk?

I want to find and delete all lines that have a specific string, with a specific length, at a specific location.
One line in my data set looks something like this:
STRING 1234567 1234567 7654321 6543217 5432176
Notes:
Entries have field widths of 8
Identification numbers can be repeated in the same line
Identification numbers can be repeated on a different line, but at a different location - these lines should not be deleted
In this example, I want to find lines containing "1234567" located at column 17 and spanning to column 24 (i.e. the third field) and delete them. How can I do this with sed or awk?
I have used the following, but it deletes lines that I want to keep:
sed -i '/1234567/d' ./file_name.dat
Cheers!
You may use
sed -i '/^.\{17\}1234567/d' ./file_name.dat
Details
^ - start of a line
.{17} - any 17 chars
1234567 - a substring.
See the online sed demo:
s="STRING 1234567 1234567 7654321 6543217 5432176
STRING 1234567 5534567 7654321 6543217 5432176"
sed '/^.\{17\}1234567/d' <<< "$s"
# => STRING 1234567 5534567 7654321 6543217 5432176
with awk, print lines except the substring match.
$ awk 'substr($0,17,7)=="1234567"{next}1' file > output_file
or perhaps inverse logic is easier
$ awk 'substr($0,17,7)!="1234567"' file > output_file

Awk, order foreach 12 lines to insert query

I have the following script:
curl -s 'https://someonepage=5m' | jq '.[]|.[0],.[1],.[2],.[3],.[4],.[5],.[6],.[7],.[8],.[9],.[10],.[11],.[12]' | perl -p -e 's/\"//g' |awk '/^[0-9]/{print; if (++onr%12 == 0) print ""; }'
This is part of result:
1517773500000
0.10250100
0.10275700
0.10243500
0.10256600
257.26700000
1517773799999
26.38912220
1229
104.32200000
10.70579910
0
1517773800000
0.10256600
0.10268000
0.10231600
0.10243400
310.64600000
1517774099999
31.83806883
1452
129.70500000
13.29758266
0
1517774100000
0.10243400
0.10257500
0.10211800
0.10230000
359.06300000
1517774399999
36.73708621
1296
154.78500000
15.84041910
0
I want to insert this data in a MySQL database. I want for each line this result:
(1517773800000,0.10256600,0.10268000,0.10231600,0.10243400,310.64600000,1517774099999,31.83806883,1452,129.70500000,13.29758266,0)
(1517774100000,0.10243400,0.10257500,0.10211800,0.10230000,359.06300000,151774399999,36.73708621,1296,154.78500000,15.84041910,0)
I need merge lines each 12 lines, any can help me for get this result.
Here's an all-jq solution:
.[] | .[0:12] | #tsv | gsub("\t";",") | "(\(.))"
In the sample, all the subarrays have length 12, so you might be able to drop the .[0:12] part of the pipeline. If using jq 1.5 or later, you could use join(“,”) instead of the #tsv|gsub portion of the pipeline. You might, for example, want to consider:
.[] | join(“,”) | “(\(.))”. # jq 1.5 or later
Invocation: use the -r command-line option
Sample output:
(1517627400000,0.10452300,0.10499000,0.10418200,0.10449400,819.50400000,1517627699999,85.57150693,2340,452.63400000,47.27213035,0)
(1517627700000,0.10435700,0.10449200,0.10366000,0.10370000,717.37000000,1517627999999,74.60582079,1996,321.25500000,33.42273846,0)
(1517628000000,0.10376600,0.10390000,0.10366000,0.10370400,519.59400000,1517628299999,53.88836170,1258,239.89300000,24.88613854,0)
$ awk 'BEGIN {RS=""; OFS=","} {$1=$1; $0="("$0")"}1' file
(1517773500000,0.10250100,0.10275700,0.10243500,0.10256600,257.26700000,1517773799999,26.38912220,1229,104.32200000,10.70579910,0)
(1517773800000,0.10256600,0.10268000,0.10231600,0.10243400,310.64600000,1517774099999,31.83806883,1452,129.70500000,13.29758266,0)
(1517774100000,0.10243400,0.10257500,0.10211800,0.10230000,359.06300000,1517774399999,36.73708621,1296,154.78500000,15.84041910,0)
RS="":
Treat groups of lines separated one or more blank lines as a record
OFS=","
Set the output separator to be a ","
$1=$1
Reconstitute the line, replacing the input separators with the output separator
$0="("$0")"
Surround the record with parens
1
Print the record

base64 decoding from file column

I have a file, every line with 6 columns separated by ",". Last column is zipped and encoded in base 64. Output file should be column 3 and column 6(decoded/unzipped).
I tried to do this by
awk -F',' '{"echo "$6" | base64 -di | gunzip" | getline x;print $3,x }' OFS=',' inputfile.csv >outptfile_decoded.csv
The result for the first lines is ok, but after some lines the decode output is the same as the line before. It seems that decoding & unzipping hungs, but I didn't get error message.
Singe decode/unzipping works fine i.e.
echo "H4sIAAAAAAAAA7NJTkuxs0lMLrEztNEHUTZAgcy8tHw7m7zSXLuS1BwrbRNjMzMTc3MDAzMDG32QqE1uSWVBqh2QB2HYlCYX2xnb6IMoG324ASCWHQAaafi1YQAAAA==" | base64 -di | gunzip
What can be the reason for this effect? (there are no error messages).
Is there another way which works reliable?
without a test case difficult to recommend anything. Here is a working script with input data
create a test data file
$ while read f; do echo $f,$(echo $f | gzip -f | base64); done < <(seq 5) | tee file.g
1,H4sIAJhBuVkAAzPkAgBT/FFnAgAAAA==
2,H4sIAJhBuVkAAzPiAgCQr3xMAgAAAA==
3,H4sIAJhBuVkAAzPmAgDRnmdVAgAAAA==
4,H4sIAJhBuVkAAzPhAgAWCCYaAgAAAA==
5,H4sIAJhBuVkAAzPlAgBXOT0DAgAAAA==
and decode
$ awk 'BEGIN {FS=OFS=","}
{cmd="echo "$2" | base64 -di | gunzip"; cmd | getline v; print $1,v}' file.g
1,1
2,2
3,3
4,4
5,5

awk - match strings containing characters in both upper & lower case and blank spaces

I have an input file in .csv format which contains entries of tax invoices separated by comma.
for example:
Header--TIN | NAME | INV NO | DATE | NET | TAX | OTHERS | TOTAL
Record1-29001234768 | A S Spares | AB012 | 23/07/2016 | 5600 | 200 | 10 | 5810
Record2-29450956221 | HONDA Spare Parts | HOSS0987 |29/09/2016 | 70000 | 2200 | 0 | 72200
My aim is to process these records using 'AWK'.
My requirements-
1) I need to check the 'NAME' field for special characters and digits (i.e it should only be an alphabetical string) and the length of string (including Spaces) in the 'NAME' field should not exceed 30.
If the above conditions are not satisfied I should report error to the user by printing the error records only
2) I need to check the 'INV NO' field for special characters including blank spaces (INV NO is an alphanumeric field). I also need to check the length of the contents of this field and it should not exceed 15.
Can anyone please give me the regular expression to match these above requirements and also the procedure of how to implement it.
Something like:
awk -f check.awk input.csv
where check.awk is:
BEGIN {
FS="," # the input field separator
}
# skip the header (NR>1), check regex for field 2, check length of field 2
NR>1 && $2 ~ /[^a-zA-Z ]/ || length($2)>30 {print "error w NAME "$1}
# skip the header (NR>1), check regex for field 3, check length of field 3
NR>1 && $3 ~ /[^0-9a-zA-Z]/ || length($3)>15 {print "error with INV NO "$1}
If you use gawk you can use the IGNORECASE global and use case-insensitive regexs
If your system has a modern grep (i.e. one that supports the -P option) then it I think would be easier to solve this using grep, e.g. like this:
grep -viP '^[^|]* \| [a-z0-9 ]{0,30} \| [a-z0-9]{0,15} \|' file.txt
The above command should print all lines that do not satisfy your requirements.