I have a few columns with html strings in a postgres 9.5 database. I want to count the words without html tags and their values to get the length of the plain text for each row.
Is there a stored procedure or another way to do this?
Edit:
existing sample text in one field:
<p>Lorem Ipsum: </p><p><br/></p><p align="center"><img src="d9b4c473-08ac-4cd8-883d-86ac30ee9044.png" width="287" height="192"/></p><p><br/></p><p>Lorem ipsum dolor sit amet, <span style="font-weight:bold;color:#86b920">consetetur</span> sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut l. </p><p><br/></p><p><br/></p><p><br/></p>
expected output for this text:
Lorem Ipsum: Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut l.
At best an additional column with a word count of this text
You can do it by using regexp_split_to_table - with the right regex expression you can break all words from the html and return them as a table.
Related
Let’s say I have my 24 word crypto backup phrase somewhere on my PC and I don’t know where. it’s a total of 2048 words or so.
How can I use grep to print all/any files containing at least 2 words in given string? I found how to print with?
grep 'extra|value' but that’s for 2 words and they both must be found. how found I grep or whatever command to find any file containing at least 2 words from given string of 2048 words. thanks!
grep 'extra|value'
You cannot use a single grep run to find two different words potentially on different lines. But you can first list all the files containing one word and then search only those for the second one:
find / -type f -exec grep -l 'extra' {} + | xargs grep 'value'
2 words and they both must be found
I would harness GNU grep for this task following way
grep --perl-regexp --recursive --null-data '(extra[.\n]*value)|(value[.\n]*extra)' .
Explanation: I start search from current directory (.) and traverse all subdirs (--recursive) looking for files which have (extra followed by zero-or-more any characters followed by value) OR (value followed by zero-or-more any characters followed by extra. I use --perl-regexp combined with --null-data combined with \n to allow words being in different lines. Consult grep man page if you need further explanation of options used.
Use find + awk
find / -type f -exec awk 'FNR==1{a=b=0} /extra/{a=1} /value/{b=1} a&&b{print FILENAME; nextfile}' {} +`
That requires an awk that has nextfile which most do these days. If yours doesn't then pipe the output to sort -u or uniq to ensure unique file names.
From man grep (GNU grep and BSD grep)
-E, --extended-regexp
Interpret PATTERNS as extended regular expressions (EREs,
see below).
...
grep understands three different versions of regular expression
syntax: “basic” (BRE), “extended” (ERE) ...
That includes the use of logical "or" | in the search pattern.
-n: print line numbers (somewhat guarantees : as record sep)
-o: only print matches (more than one hit on same line)
-H: print matching files names
The awk prints matched files with more than 1 hit.
% str="labore|dolor"
% grep -EnoH "${str}" {file,file2} |
awk -F ':' 'NF>1{x = $1} {mat[x,$NF]++}
END{for(i in mat){split(i, arr, SUBSEP); a[arr[1]]++};
for(i in a){if(a[i] > 1){print i}}}'
file
include -w to only match whole words.
Data
% cat file
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu
fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
labore labore labore culpa qui officia deserunt mollit anim id est laborum.
% cat file2
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
Commands tried:
pandoc -V 'fontfamily:Courier' --variable mainfont="Courier" --pdf-engine=pdflatex 1.txt -o 1.pdf
pandoc -V 'fontfamily:Courier' --variable mainfont="Courier" --pdf-engine=lualatex 1.txt -o 2.pdf
pandoc -V 'fontfamily:Courier' --variable mainfont="Courier" --pdf-engine=xelatex 1.txt -o 3.pdf
pandoc -V 'fontfamily:Courier' --variable mainfont="Courier" --pdf-engine=latexmk 1.txt -o 4.pdf
pandoc -V 'fontfamily:Courier' --variable mainfont="Courier" --pdf-engine=tectonic 1.txt -o 5.pdf
pandoc -V 'fontfamily:Courier' --variable mainfont="Courier" --pdf-engine=wkhtmltopdf 1.txt -o 6.pdf
pandoc -V 'fontfamily:Courier' --variable mainfont="Courier" --pdf-engine=weasyprint 1.txt -o 7.pdf
pandoc -V 'fontfamily:Courier' --variable mainfont="Courier" --pdf-engine=prince 1.txt -o 8.pdf
pandoc -V 'fontfamily:Courier' --variable mainfont="Courier" --pdf-engine=context 1.txt -o 9.pdf
pandoc -V 'fontfamily:Courier' --variable mainfont="Courier" --pdf-engine=pdfroff 1.txt -o 10.pdf
Contents of 1.txt:
--------------------------------------------------------------------------------
Left Right
--------------------------------------------------------------------------------
Lorem ipsum whatever. Lorem ipsum whatever. Lorem ipsum whatever. Lorem ipsum 1
whatever. Lorem ipsum whatever. Lorem ipsum whatever. Lorem ipsum whatever. 2
Lorem ipsum whatever. Lorem ipsum whatever. Lorem ipsum whatever. Lorem ipsum 3
whatever. Lorem ipsum whatever. Lorem ipsum whatever. Lorem ipsum whatever. 4
Lorem ipsum whatever. Lorem ipsum whatever. Lorem ipsum whatever. 5
--------------------------------------------------------------------------------
Results:
Out of all those allegedly supported "engines", only the first and third produce any PDF at all (the others just dump a bunch of nonsensical errors). And those two that do produce PDFs, produce horribly butchered ones:
"pdflatex" (the first command) entirely ignores the specified font, so it's completely useless.
"xelatex" (the third command) seems to be mostly using the right font, but seemingly deletes all the spaces between "Left" and "Right", morphs the "-"s into straight lines (that's not how that font looks...) and messes up the lines completely so that the numbers on the last columns are not aligned to the right, and has crammed the entire contents into the middle of the page instead of, as expected, near the top-left corner:
screenshot of the xelatex-produced PDF
I have spent enormous amounts of times hunting for options and trying a million variations of the above commands, but it seems like this tool is fundamentally broken. I have no idea how others (apparently) use these tools, but they just don't work. It's impossible to convert a text file to PDF...
Pandoc is not broken; it is doing just what its documentation says it will do. Pandoc treats your input file as Markdown with pandoc extensions (since you didn't specify a format). What you have here is a one-column simple table (since there is no break in the line of ----s to indicate a column break).
If what you want is a rendering of this context as verbatim text in a PDF, you could use e.g. enscript 1.txt --output=- | ps2pdf - > 1.pdf. If you want to do it using pandoc, then the easiest way is to put the content inside backtick fences so that it is treated as a markdown verbatim block. One way to do this would be to modify your file, but you could also do it by creating a file ticks.txt containing just
```
and then run
pandoc ticks.txt 1.txt ticks.txt -o 1.pdf
I have some pages, they all contain at a certain point<article> and after another certain point</article>
How can I use sed to delete every line before and every line after these tags?
I tried:
sed '/<article>/,/</article>/ !d'
but it didnt work.
In your sentence is just mising a \ to espace the / character:
sed '/<article>/,/<\/article>/ !d'
Another way to accomplish the same:
sed '\#<article>#,\#</article># !d'
From man sed:
Adresses:
\cregexpc
Match lines matching the regular expression regexp.
The c may be any character.
So i have a gigantic file (file1) where i need to delete or outcomment specific lines, this file could look something like this:
Lorem ipsum **abc** dolor sit amet,
consectetur adipiscing elit.
Cras finibus **123** laoreet dignissim.
Curabitur dignissim auctor tortor a cursus.
Nullam sapien ante, tempor eu rutrum
...
for this i have file2 which contains strings which i need to locate lines in file1
file2 could look like this:
abc
123
xyz
098
...
Now, when a string from file2 is found, the line, in file1, where it is found + the line directly beneath it, should be outcommented or deleted.
so that if 123 is found in the above example, it should delete these two lines (marked with --> ):
Lorem ipsum abc dolor sit amet,
consectetur adipiscing elit.
--> Cras finibus 123 laoreet dignissim.
--> Curabitur dignissim auctor tortor a cursus.
Nullam sapien ante, tempor eu rutrum
...
I hope this makes sense
I was fiddeling around with sed and awk, but never got it to work
Something like this would work:
awk 'NR==FNR{a[$0]; next}p{p=0;next}{for(i in a)if(p = $0 ~ i)next}1' file2 file1
Populate the array a with the lines in file2. The first block only applies to file2 because the total record number NR is equal to the record number of the current file FNR. next skips the rest of the blocks.
For each line of file1, loop through the keys in array a. If the current line matches the key, skip the line in the output. Also assign p the true value. For lines where p is true, set it back to false but skip the line in the output. The 1 at the end is always true, so any line that has made it this far is printed, as the default action is to print the line.
This might work for you (GNU sed):
sed 's|.*|/&/{N;d}|' file2 | sed -f - file1 >file3
Create a sed script from file2 and run it against file1 saving the results in file3.
This question already has answers here:
How to insert a line using sed before a pattern and after a line number?
(5 answers)
Closed 9 years ago.
I need to edit a good number of files, by inserting a line or multiple lines either right below a unique pattern or above it. Please advise on how to do that using sed, awk, perl (or anything else) in a shell. Thanks! Example:
some text
lorem ipsum dolor sit amet
more text
I want to insert consectetur adipiscing elit after lorem ipsum dolor sit amet, so the output file will look like:
some text
lorem ipsum dolor sit amet
consectetur adipiscing elit
more text
To append after the pattern: (-i is for in place replace). line1 and line2 are the lines you want to append(or prepend)
sed -i '/pattern/a \
line1 \
line2' inputfile
Output:
#cat inputfile
pattern
line1 line2
To prepend the lines before:
sed -i '/pattern/i \
line1 \
line2' inputfile
Output:
#cat inputfile
line1 line2
pattern
The following adds one line after SearchPattern.
sed -i '/SearchPattern/aNew Text' SomeFile.txt
It inserts New Text one line below each line that contains SearchPattern.
To add two lines, you can use a \ and enter a newline while typing New Text.
POSIX sed requires a \ and a newline after the a sed function. [1]
Specifying the text to append without the newline is a GNU sed extension (as documented in the sed info page), so its usage is not as portable.
[1] https://unix.stackexchange.com/questions/52131/sed-on-osx-insert-at-a-certain-line/
Insert a new verse after the given verse in your stanza:
sed -i '/^lorem ipsum dolor sit amet$/ s:$:\nconsectetur adipiscing elit:' FILE
More portable to use ed; some systems don't support \n in sed
printf "/^lorem ipsum dolor sit amet/a\nconsectetur adipiscing elit\n.\nw\nq\n" |\
/bin/ed $filename