i have a string, i want to cut all occurrences from matching until first comma: example
[{"value":1,"btata":"15","Id":"17","","url":"","time":"222"{"value":1,"secId":"16","Id":"19","time":"20218 22status":""}
I want to get Id:17 Id:19
I have been able to get Id using sed -e 's/Id/_/g' -e 's/[^_]//g' -e 's/_/Id /g' but couldn't match until comma.
You can do it with sed but it requires two expressions. Essentially you need to remove all '"' characters and then split the input on ',' by replacing them with '\n'. The second expression simply locates the lines beginning with Id, e.g.
sed 's/"//g;s/,/\n/g' | sed -n /^Id/p
Example Use/Output
$ echo '[{value:1,btata:15,Id:17,,url:,time:222{value:1,secId:16,Id:19,time:20218 22status:}' |
sed 's/"//g;s/,/\n/g' | sed -n /^Id/p
Id:17
Id:19
(note: this all comes with the caveat that you should not process json with shell commands. Using a json validating tool like jq is recommended -- though this doesn't appear to be valid json either)
I have a variable and that variable only needs a '\' in front of it.
I would say that the sed command is the ideal tool for it?
I tried using single quotes, double quotes, multiple variables, combination of variables, ...
I don't get an error returned but the end result is not showing what I need it do be
FOLDER=$(echo `cat file.XML | grep "Value" | cut -d \" -f2`)
echo $FOLDER
sed -i "s#"$FOLDER"#"\\$FOLDER"#g" ./file.XML
echo $FOLDER
After execution, I get
$ ./script.sh
b4c17422-1365-4fbe-bccd-04e0d7dbb295
b4c17422-1365-4fbe-bccd-04e0d7dbb295
Eventually I need to have a result like
$ ./script.sh
b4c17422-1365-4fbe-bccd-04e0d7dbb295
\b4c17422-1365-4fbe-bccd-04e0d7dbb295
Fixed thanks to the input of Cyrus and Ed Morton.
FOLDER=$(echo `cat file.XML | grep "Value" | cut -d \" -f2`)
NEW_FOLDER="\\$FOLDER"
sed -i "s#$FOLDER#\\$NEW_FOLDER#g" ./file.XML
I want to extract all rows from here while ignoring the column headers as well as all page headers, i.e. Supported Devices.
pdftotext -layout DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - \
| sed '$d' \
| sed -r 's/ +/,/g; s/ //g' \
> output.csv
The resulting file should be in CSV spreadsheet format (comma separated value fields).
In other words, I want to improve the above command so that the output doesn't brake at all. Any ideas?
I'll offer you another solution as well.
While in this case the pdftotext method works with reasonable effort, there may be cases where not each page has the same column widths (as your rather benign PDF shows).
Here the not-so-well-known, but pretty cool Free and OpenSource Software Tabula-Extractor is the best choice.
I myself am using the direct GitHub checkout:
$ cd $HOME ; mkdir svn-stuff ; cd svn-stuff
$ git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor
I wrote myself a pretty simple wrapper script like this:
$ cat ~/bin/tabulaextr
#!/bin/bash
cd ${HOME}/svn-stuff/git.tabula-extractor/bin
./tabula $#
Since ~/bin/ is in my $PATH, I just run
$ tabulaextr --pages all \
$(pwd)/DAC06E7D1302B790429AF6E84696FCFAB20B.pdf \
| tee my.csv
to extract all the tables from all pages and convert them to a single CSV file.
The first ten (out of a total of 8727) lines of the CVS look like this:
$ head DAC06E7D1302B790429AF6E84696FCFAB20B.csv
Retail Branding,Marketing Name,Device,Model
"","",AD681H,Smartfren Andromax AD681H
"","",FJL21,FJL21
"","",Luno,Luno
"","",T31,Panasonic T31
"","",hws7721g,MediaPad 7 Youth 2
3Q,OC1020A,OC1020A,OC1020A
7Eleven,IN265,IN265,IN265
A.O.I. ELECTRONICS FACTORY,A.O.I.,TR10CS1_11,TR10CS1
AG Mobile,Status,Status,Status
which in the original PDF look like this:
It even got these lines on the last page, 293, right:
nabi,"nabi Big Tab HD\xe2\x84\xa2 20""",DMTAB-NV20A,DMTAB-NV20A
nabi,"nabi Big Tab HD\xe2\x84\xa2 24""",DMTAB-NV24A,DMTAB-NV24A
which look on the PDF page like this:
TabulaPDF and Tabula-Extractor are really, really cool for jobs like this!
Update
Here is an ASCiinema screencast (which you also can download and re-play locally in your Linux/MacOSX/Unix terminal with the help of the asciinema command line tool), starring tabula-extractor:
As Martin R commented, tabula-java is the new version of tabula-extractor and active. 1.0.0 was released on July 21st, 2017.
Download the jar file and with the latest java:
java -jar ./tabula-1.0.0-jar-with-dependencies.jar \
--pages=all \
./DAC06E7D1302B790429AF6E84696FCFAB20B.pdf
> support_devices.csv
What you want is rather easy, but you're having a different problem also (I'm not sure you are aware of it...).
First, you should add -nopgbrk for ("No pagebreaks, please!") to your command. Because these pesky ^L characters which otherwise appear in the output then need not be filtered out later.
Adding a grep -vE '(Supported Devices|^$)' will then filter out all the lines you do not want, including empty lines, or lines with only spaces:
pdftotext -layout -nopgbrk \
DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - \
| grep -vE '(Supported Devices|^$|Marketing Name)' \
| gsed '$d' \
| gsed -r 's# +#,#g' \
| gsed '# ##g' \
> output2.csv
However, your other problem is this:
Some of the table fields are empty.
Empty fields appear with the -layout option as a series of space characters, sometimes even two in the same row.
However, the text columns are not spaced identically from page to page.
Therefor you will not know from line to line how many spaces you need to regard as a an "empty CSV field" (where you'd need an extra , separator).
As a consequence, your current code will show only one, two or three (instead of four) fields for some lines, and these fields end up in the wrong columns!
There is a workaround for this:
Add the -x ... -y ... -W ... -H ... parameters to pdftotext to crop the PDF column-wise.
Then append the columns with a combination of utilities like paste and column.
The following command extracts the first columns:
pdftotext -layout -x 38 -y 77 -W 176 -H 500 \
DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 1st-columns.txt
These are for second, third and fourth columns:
pdftotext -layout -x 214 -y 77 -W 176 -H 500 \
DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 2nd-columns.txt
pdftotext -layout -x 390 -y 77 -W 176 -H 500 \
DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 3rd-columns.txt
pdftotext -layout -x 567 -y 77 -W 176 -H 500 \
DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - > 4th-columns.txt
BTW, I cheated a bit: in order to get a clue about what values to use for -x, -y, -W and -H I did first run this command in order to find the exact coordinates of the column header words:
pdftotext -f 1 -l 1 -layout -bbox \
DAC06E7D1302B790429AF6E84696FCFAB20B.pdf - | head -n 10
It's always good if you know how to read and make use of pdftotext -h. :-)
Anyway, how to append the four text files as columns side by side, with the proper CVS separator in between, you should find out yourself. Or ask a new question :-)
This can be done easily with an IntelliGet (http://akribiatech.com/intelliget) script as below
userVariables = brand, name, device, model;
{ start = Not(Or(Or(IsSubstring("Supported Devices",Line(0)),
IsSubstring("Retail Branding",Line(0))),
IsEqual(Length(Trim(Line(0))),0)));
brand = Trim(Substring(Line(0),10,44));
name = Trim(Substring(Line(0),45,79));
device = Trim(Substring(Line(0),80,114));
model = Trim(Substring(Line(0),115,200));
output = Concat(brand, ",", name, ",", device, ",", model);
}
For the case where you want to extract that tabular data from PDF over which you have control at creation time (for timesheets contracts your employees have to sign), the following solution will be cleaner:
Create a PDF form with field IDs.
Let people fill and save the PDF forms.
Use a Apache PDFBox, an open source tool that allows to extract form data from a PDF. It includes a command-line example tool PrintFields that you would call as follows to print the desired field information:
org.apache.pdfbox.examples.interactive.form.PrintFields file.pdf
For other options, see this question.
As an alternative to the above workflow, maybe you could also use a digital signature web service that allows PDF form filling and export of the data to tables. Such as SignRequest, which allows to create templates and later export the data of signed documents. (Not affiliated, just found this myself.)
I need to copy data from one database into my own database, because i want to run it as a daily cronjob i prefer to have it in bash. I also need to store the values in variables so i can run various checks/validations on the values. This is what i got so far:
echo "SELECT * FROM table WHERE value='ABC' AND value2 IS NULL ORDER BY time" | mysql -u user -h ip db -p | sed 's/\t/,/g' | awk -F, '{print $3,$4,$5,$7 }' > Output
cat Output | while read line
do
Value1=$(awk '{print "",$1}')
Value2=$(awk '{print "",$2}')
Value3=$(awk '{print "",$3}')
Value4=$(awk '{print "",$4}')
echo "INSERT INTO db (value1,value2,value3,value4,value5) VALUES($Value1,$Value2,'$Value3',$Value4,'n')" | mysql -u rb db -p
done
I get the data i need from the database and store it in a new file seperated by spaces. Then i read the file line by line and store the values in variables, and last i run an insert query with the right varables.
I think something goes wrong while storing the values but i cant really figure out what goes wrong.
The awk used to get Value2, Value3 and Value4 does not get the input from $line. You can fix this as:
Value1=$(echo $line | awk '{print $1}')
Value2=$(echo $line | awk '{print $2}')
Value3=$(echo $line | awk '{print $3}')
Value4=$(echo $line | awk '{print $4}')
There's no reason to call awk four times in a loop. That could be very slow. If you don't need the temporary file "Output" for another reason then you don't need it at all - just pipe the output into the while loop. You may not need to use sed to change tabs into commas (you could use tr, by the way) since awk will split fields on tabs (and spaces) by default (unless your data contains spaces, but some of it seems not to).
echo "SELECT * FROM table WHERE value='ABC' AND value2 IS NULL ORDER BY time" |
mysql -u user -h ip db -p |
sed 's/\t/,/g' | # can this be eliminated?
awk -F, '{print $3,$4,$5,$7 }' | # if you eliminate the previous line then omit the -F,
while read line
do
tmparray=($line)
Value1=${tmparray[0]}
Value2=${tmparray[1]}
Value3=${tmparray[2]}
Value4=${tmparray[3]}
echo "INSERT INTO predb (value1,value2,value3,value4,value5) VALUES($Value1,$Value2,'$Value3',$Value4,'n')" | mysql -u rb db -p
done
That uses a temporary array to split the values out of the line. This is another way to do that:
set -- $line
Value1=$1
Value2=$2
Value3=$3
Value4=$4
I am trying to write a file with format - "id file_absolute_path" which basically lists down all the files recursively in a folder and give an identifier to each file listed like 1,2,3,4.
I can get the absolute path of the files recursively using the following command:
ls -d -1 $PWD/**/*/*
However, I am unable to give an identifier from the output of the ls command. I am sure this can be done using awk, but can't seem to solve it.
Pipe the output through cat -n.
Assuming x is your command:
x | awk '{print NR, $0}'
will number the output lines
Two posible commands:
ls -d -1 $PWD/**/*/* | cat -n
ls -d -1 $PWD/**/*/* | nl
nl puts numbers to file lines.
I hope this clarifies too.
There is a tool named nl for that.
ls -la | nl
If you do ls -i, you'll get the inode number which is great as an id.
The only potential issue with using inodes is if you folder spans multiple file systems as an inode is only guaranteed to be unique within a file system.
ls -d -1 $PWD/**// | awk ' {x = x + 1} {print x " " $0} '