Full Text Search for extracting a snippet of the text (returning intended text and it's surrounding) - sql

I'm using SQL file table and for instance I have a saved text file named "SOS.txt" which contains following text
For god's sake, save us right now please. We can't survive.
Now or never!
Now I want to find all files that contain the word save, so I execute following query
SELECT * FROM FileTableExample
WHERE CONTAINS(file_stream, 'save')
and here's the result:
stream file => 0x616C692053617665207573207269676874206E6F772E0D0A4E6F77206F72206E6576657221
As you can see I got the true result, the third column of the result indicates the file under name SOS.txt, I have the stream_id and stream_file but what I'm about to find is the way to show the the intended text in company with it's surrounding in human readable format.
Somethings like this:
Name | Excerpt
-------------+----------------------
SOS.txt |..sake, save us..
Is there any way?
Update:
After searching on the net I found this article which is useful but it didn't mention about full text search in filetable structure.
Based on this article, I converted file stream to string:
SELECT CONVERT(varchar(MAX), file_stream) AS Excerpt, *
from FileTableExample
where contains(file_stream, 'save')
It works if the file is a plain text like SOS.txt but if it's .docx or .pptx file, you are not going to gain a useful convention.

Use this, CAST(file_Stream as varchar(max))

Related

how can use Full Text Search For Persian Text File In Sql Server

I Have the Table that contains the varbinary col I insert my Text File To it.
I Add Catalog and Full Text Index and I can Search for English Words Successfully like below:
select * from [FILESTREAM_Documents]
where Contains([DocumentFS], 'Hello')
And I Get Result Correctly As Below:
I Used this Doc For getting here
Know My Question is How Can I Search Persian Word In FileStream Using Full Text Search Like Image Below:
select * from [FILESTREAM_Documents]
where Contains([DocumentFS], 'سلام')
I searched it so much and I test the (N) Prefix for the word but still does not work
How can I do this?
Thanks for Helping!

U-SQL extracting files complete contents (extracting full source code from html files)

I've got a bunch of HTML files in my Data Lake Store and would like to get their full source code into a table (just one column with the code from all the files, the output format is not relevant to me, but probably tsv). I can't find a way to use the standard Extractors or anything on the web that works for me. Do I have to write a custom Extractor for that?
I've tried the Extractors.Tsv() and Extractors.Text() with a whole bunch of delimiters. I first tried:
#data =
EXTRACT source string
FROM "<MY DIRECTORY IN ADL>"
USING Extractors.Text(delimiter:'');
This didnt work out as it seems to not like having no delimiter, but also when I tried using delimiters that aren't in the html files it didnt work out.
Has anyone got an idea how to get this done? It seems to me that I am just stupid, so I hope someone here is a little smarter.
Even better than just the source code would be if I had the source code + filename in two columns, but I wanna start small.
Thank you!
#files =
EXTRACT FileName string,
Text string
FROM #"/somepath/{FileName}.html"
USING Extractors.Text(silent: true, delimiter: '`');
OUTPUT #files
TO "/somepath/Test.txt"
USING Outputters.Tsv(outputHeader: false, quoting: false);

Analyze PDF files to detect malicious ones

I write a code in python that detect malicious PDF.
every file I analyze I calculate its hash value and save it in hash database, besides saving the output in text file.
If I want to scan another file I calculated it hash value then search it in hash database, if found I print the output from the text that is already exist.
but if the hash value is not exist it is saved and the output is saved in text file
I need help on how could I link between the hash value and the text that contain the output?
As Kyle mentioned, you can use a hash table. A hash table is similar to a dictionary. In python I actually believe they're called dictionaries. For more on that, look here: http://www.tutorialspoint.com/python/python_dictionary.htm
As far as your question is concerned, you have a variety of options. You will have to save your 'database' at some point and you could save it in many different formats. You could save it as a JSON file (a very popular style). It could be an XML file (very popular as well). You could even save it as a CSV (not nearly as popular, but it gets the job done). For the sake of this, let's say you save this 'database' in a text file which looks like this:
5a4730fc12097346fdf5589166c373d3{C:\PdfsOutput\FileName.txt}662ad9b45e0f30333a433566cee8988d{C:\PdfsOutput\SomeOtherFile.txt}
Essentially you're formatting it as HashValue{PathToFileOnDisk}... You could then parse this via regex that looks like [0-9a-f]{32}\{[^\}]+ Then you would scan your database on startup using this regex, load up all matches, iterate all matches, split each match at '{' and then put the ValueSplit[0] into a dictionary as the key with the path to that text file as the value for that key.
So, after you do the regex search, get your matches and are iterating them, within the iteration loop say something like:
ValueSplit = RegexMatch.split('{')
HashAndFileDict[ValueSplit[0]] = ValueSplit[1]
This code assumes the regex match in the loop is a string simply called 'RegexMatch'. It also assumes that your dictionary you're storing hash values and paths in is called 'HashAndFileDict'
Later in your code you can check a PDF hash value in question by saying:
if(!HashAndFileDict.hash_key(PDFHashValue):
TextFilePath = savePDFOutputText(ThePDFFile)
HashAndFileDict[PDFHashValue] = TextFilePath
else:
print("File already processed. Text is at: " + HashAndFileDict[PDFHashValue])
If I may, it might be wise to use 2 hashing algorithms and combine their hexadecimal digests into 1 string in order to prevent a collision when processing many PDF files.

How to get data from a .rtf file or excel file into database(sqlite) in iphone sdk?

I had lots of data in a .rtf file(having usernames and passwords).How can I fetch that data into a table. I'm using sqlite3.
I had created a "userDatabase.sql" in that I had created a table "usersList" having fields "username","password". I want to get the list of data in the "list.rtf" file in to my table "usersList". Please help me .
Thanks in advance.
Praveena.
I would write a little parser. Re-save the .rtf as a txt-file and assume it look like this:
user1:pass1
user2:pass2
user5:pass5
Now do this (in your code):
open the .txt file (NSString -stringWithContentsOfFile:usedEncoding:error:)
read line by line
for each line, fetch user and password (NSArray -componentsSeparatedByString)
store user/password into your DB
Best,
Christian
Edit: for parsing excel-sheets I recommend export as CSV file and then do the same
Parsing RTF files is mostly trivial. They're actually text, not binary (like doc pdf etc).
Last I used it, I remember the file format wasn't too difficult either.
Example:
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 Calibri;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 Username Password\par
Username2 Password2\par
UsernameN PasswordN\par
}
Do a regular expression match to get the last { ... } part. By sure to match { not \{.
Next, parse the text as you want, but keep in mind that:
everything starting with a \ is escaped, I would write a little function to unescape the text
the special identifier \par is for a new line
there are other special identifiers, such as \b which toggles bolding text
the color change identifier, \cfN changes the text color according to the color table defined in the file header. You would want to ignore this identifier since we're talking about plain text.

How to force scheme.ini to be used for MS Text Driver?

I am creating this huge csv import, that uses the ms text driver, to read the csv file.
And I am using ColdFusion to create the scheme.ini in each folder's location, where the file has been uploaded.
Here is a sample one I am using:
[some_filename.csv]
Format=CSVDelimited
ColNameHeader=True
MaxScanRows=0
Col1=user_id Text width 80
Col2=first_name Text width 20
Col3=last_name Text width 30
Col4=rights Text width 10
Col5=assign_training Text width 1
CharacterSet=ANSI
Then in my ColdFusion code, I am doing 2 cfdump's:
<cfdump var="#GetMetaData( csvfile )#" />
<cfdump var="#csvfile#">
The meta data shows that the query has not grabbed the correct data types for reading the csv file.
And the dump of the query to read file, shows that it is missing values, because of Excel we can not force them to use double quotes. And when fields have mixed data types, then it causes our process to not work..
How can I either change the data type inside the query, aka make it use scheme.ini, or update metadata to the correct data type.
I am using a view on information_schema in sql server 2005 to get the correct data types, column names, and max lengths...
Unless I have some kind of syntax error, I can't see why it's not grabbing the data as the correct data type.
Any suggestions?
Funnily, I had the filename spelled wrong, instead of using schema.ini i was having it as scheme.ini.
I hate when you make lil mistakes like this...
Thank You