SQL Server : searching strings for equivalent phrasing such as inch, inches,'' and " - sql

As per the title I am looking for a method to search data on an equivalence basis
Ie user searches for a value of 20" it will also search for 20 inch, 20 inches etc...
I've looked at possibly using full text search and a thesaurus but would have to build my own equivalence library
Is there any other alternatives I should be looking at? Or are there common symbol/word equivalence libraries already written?
EDIT:
I dont mean the like keyword and wild cards
if my data is
A pipe that is 20" wide
A pipe that is 20'' wide - NOTE::(this is 2 single quotes)
A pipe that is 20 cm wide
A pipe that is 20 inch wide
A pipe that is 20 inches wide
I would like to search for '20 inch' and be returned
A pipe that is 20" wide
A pipe that is 20'' wide
A pipe that is 20 inch wide
A pipe that is 20 inches wide

just answering this in case anyone else comes across it as I finally figured it out.
I ended up using an FTS thesaurus to assign equivalence to inch inches and ", and this work wonderfully for inch and inches but would return no results when I searched for 6"
It eventually turned out the underlying issue I had was that characters such as " are treated as word breakers by full text search.
I found that custom dictionary items seems to override the languages word breakers and so introducing a file called Custom0009.lex with a few lines of " and a few other characters/terms I wanted included that had word breakers in to C:\Program Files\Microsoft SQL Server\{instance name}\MSSQL\Binn and restarting the fdhost and rebuilding the index allowed my search for
select * from tbldescriptions where FREETEXT(MainDesc,'"')
or
select * from tbldescriptions where contains(MainDesc,'FORMSOF(Thesaurus,"""")')
notice the double " on the contains one as the search term is within " already it needed to be escaped to be seen.

Related

Trouble with tabulizer library in r recognizing non-alphanumeric (symbol) characters on a table in a PDF

I am using the tabulizer library in r to capture data from a table located inside a PDF on a public website
(https://www.waterboards.ca.gov/sandiego/water_issues/programs/basin_plan/docs/update082812/Chpt_2_2012.pdf).
The example table that I am interested in is on page 23 of the PDF (p. 2-21, document has a couple of blankpages at beginning). The table has a non-standard format and also different symbols (non-alphanumeric characters in the cells).
I want to extract most if not all tables from this document.
I want to end up with a table that has characters with codes (i.e., black circles with 999, white circles with 777, plus signs with -99, etc).
Tabulizer does a good job for the most part converting the dark circles into consistent alphanumeric codes, and keeping the plus signs, but runs into problems on the REC1 column with white
circles, which is odd since it does seems to recognize exotic characters on other columns.
Could anyone please help fix this? I also tried selecting the table area but the output was worse. Below is the r code I am using.
I know I can complete this process by hand for all the tables in the document using PDF's built-in select and export tools but would like to automate the process.
library("tabulizer")
f2 <- "https://www.waterboards.ca.gov/sandiego/water_issues/programs/basin_plan/docs/update082812/Chpt_2_2012.pdf"
tab <- extract_tables(f2, pages = 23, method = 'lattice')
head(tab[[1]])
df <- as.data.frame(tab)
write.csv(df, file = "test.csv")

identify paragraphs of pdf fiiles using itextsharp

Because of some semantic analysis work, I need identify paragraphs from pdf files with iTextSharp. I know the coordinates of iTextSharp live in the left bottom corner of a page. I find three features to define the paragraph boundaries:
if the horizontal axis of the first word in one line is less than that of the general lines;
if the leading of two consecutive lines is larger than that of the general ones;
if one line ends with "." and the horizontal axis of the ending word is less than that of the other lines
However, I am stuck on the second one. How can I know the general leading between two lines in a paragraph? I mean there are different gaps between two consecutive lines, because some letters like 'f','g' need more space than the others like 'a','n' and so on.
Thanks for your help!
I'm assuming that you are parsing your PDF files using the parser functionality available in iTextSharp. See for instance Extract font height and rotation from PDF files with iText/iTextSharp to see how others have done this before you. A more elaborate article can be found here: Using Open Source PDF Technology to Solve the Unstructured Data Problem in Healthcare
Your question is: how can I calculate the leading? That is: how do I know the distance between the base lines of two consecutive lines?
When you parse a PDF using iTextSharp, you see each line as a series of TextRenderInfo object. These objects allow you to get the base line of the text:
LineSegment baseline = renderInfo.GetBaseline();
Vector startpoint = baseline.GetStartPoint();
This Vector consists of different elements: Getting Coordinates of string using ITextExtractionStrategy and LocationTextExtractionStrategy in Itextsharp
You need startpoint[Vector.I2]. See also: How to detect newline from PDF using iTextSharp
The difference between that value for two consecutive lines give you the value of the leading in its modern meaning. In the old times of printing, every character was a block of a fixed size. Printers (the people, not the machines) put a strip of lead between the rows of blocks to create some extra space between the lines. In modern computing, the word was preserved, but its meaning changed. There are no "blocks" anymore, but you could work with the font size. The font size is an average size of the glyphs in a font. Some glyphs will take more space in the height, some will take less, but taking both the leading (distance between baselines) and the font size (average height of each glyph) into account, you can get a fair idea of the "space between the lines".

Making the PDF format readable and diff-able

I am wondering if anyone have thought of a way to display the PDF document format in a more human readable form?
Now, to compare PDF files, or see exactly what have changed between to versions is very difficult. Many changes aren't visible to the naked eye since they are not a part of the graphical representation(as "created when", and similar).
So if a PDF is a result of an integration test, it is difficult to find the problem without a hex-editor. Also, it is difficult to disregard "created when" in the comparison.
I am not talking any interpretation and displaying, just converting the basic object types to some meta-language. For simplicity's sake, let's say XML. And name nodes like they are named in the PDF specification.
There are PDF-parsers available for most programming languages. Still, at least I, can't find anyone that have gone the distance to convert it to something readable.
Or have I missed it?
Edit:
To clarify(example from specification):
BI % Begin inline image object
/W 17 % Width in samples
/H 17 % Height in samples
/CS /RGB % Color space
/BPC 8 % Bits per component
/F [ /A85 /LZW ] % Filters
Would become:
<BI>
<W>17</W>
<H>17</H>
<CS><RGB/></CS>
<BPC>8</BPC>
<F>
<item>A85</item>
<item>LZW</item>
</F>
</BI>
..and so on.
Binary data could either be extracted to a file or just show a hash or size.

Postgresql database search with regex

I'm using PostgreSQL database with VB.NET and ODBC (Windows).
I'm searching sentences for whole words by combining SELECT with a regular expression, like this:
"SELECT dtbl_id, name
FROM mytable
WHERE name ~*'" + "( |^)" + TextBox1.Text + "([^A-z]|$)"
This searches well in some cases but because of syntax errors in text (or other reasons) it sometimes fails. For example, if I have the sentence
BILLY IDOL: WHITE WEDDING
the word "white" will be found. But if I have
CLASH-WHITE RIOT
then "white" will not be found, because there is no space between start of word "white".
The simplest solution would be to temporarily change or replace characters in the sentences :,.\/-= etc to spaces.
Is this possible to do in single SELECT line to be suitable for use with .NET/ODBC? Maybe inside the same regular expression?
If it is, how?
Try this:
SELECT 'CLASH-WHITE RIOT' ~ '[[:<:]]WHITE[[:>:]]';
[[:<:]] and [[:>:]] simply mean beginning and end of a word respectively
more info you can find at: http://www.postgresql.org/docs/9.1/static/functions-matching.html#FUNCTIONS-POSIX-REGEXP

Preserve "long" spaces in PDFBox text extraction

I am using PDFBox to extract text from PDF.
The PDF has a tabular structure, which is quite simple and columns are also very widely spaced from each-other
This works really well, except that all kinds of horizontal space gets converted into a single space character, so that I cannot tell columns apart anymore (space within words in a column looks just like space between columns).
I appreciate that a general solution is very hard, but in this case the columns are really far apart so that having a simple differentiation between "long spaces" and "space between words" would be enough.
Is there a way to tell PDFBox to turn horizontal whitespace of more then x inches into something other than a single space? A proportional approach (x inch become y spaces) would also work.
The pdftotext C library/tool has a '-layout' switch that tries to preserve the layout. Basically, if I can emulate that with PDFBox, that would be perfect.
There does not seem to be a setting for this, but I was able to modify the source for the PDFTextStripper tool to output a column separator (|) when a "long" space was encountered. In the code where it was building the output line it is possible to look at the x positions of the current and previous letter, and if it is large enough, do something special. PDFTextStripper has lots of protected methods, but turned out to be not really all that extensible. I ended up having to copy the whole class to change a private method.
Looking at the code in there, I call myself lucky that with the particular PDF, this simple approach was successful. A more general solution seems very tricky.
PDF text extraction is difficult.
If the text was output as one big string separated by spaces such as :-
PDFTextOut(" Column 1 Column 2 Column 3");
and you are using a fixed width font such as Courier then you could theoretically calculate the number of spaces between items of text because each character is the same width. If the font is proportional such a Arial then the calculation is harder.
In reality most PDF's generated by individually placing each piece of text directly into its position. Therefore, there is technically no space character or any other characters between columns. The text is just placed into an absolute position on the page.
PDFMoveTo(100,100);
PDFTextOut("Column 1");
PDFMoveTo(250,100);
PDFTextOut("Column 2");
In order to perform data extraction on PDF documents you have to do a little bit more work to find and match column data by using pixel locations as you have mentioned and by making some assumptions and having a little bit of luck.