U-SQL extracting files complete contents (extracting full source code from html files) - azure-data-lake

I've got a bunch of HTML files in my Data Lake Store and would like to get their full source code into a table (just one column with the code from all the files, the output format is not relevant to me, but probably tsv). I can't find a way to use the standard Extractors or anything on the web that works for me. Do I have to write a custom Extractor for that?
I've tried the Extractors.Tsv() and Extractors.Text() with a whole bunch of delimiters. I first tried:
#data =
EXTRACT source string
FROM "<MY DIRECTORY IN ADL>"
USING Extractors.Text(delimiter:'');
This didnt work out as it seems to not like having no delimiter, but also when I tried using delimiters that aren't in the html files it didnt work out.
Has anyone got an idea how to get this done? It seems to me that I am just stupid, so I hope someone here is a little smarter.
Even better than just the source code would be if I had the source code + filename in two columns, but I wanna start small.
Thank you!

#files =
EXTRACT FileName string,
Text string
FROM #"/somepath/{FileName}.html"
USING Extractors.Text(silent: true, delimiter: '`');
OUTPUT #files
TO "/somepath/Test.txt"
USING Outputters.Tsv(outputHeader: false, quoting: false);

Related

Multi-line text in a .env file

In vue, is there a way to have a value span multiple lines in an .env file. Ex:
Instead of:
someValue=[{"someValue":"Here is a really really long piece which should be split into multiple lines"}]
I want to do something like:
someValue=`[{"someValue":"Here is a really
really long piece which
should be split into multiple lines"}]`
Doing the latter gives me a JSON parsing error if I try to do JSON.parse(someValue) in my code
I don't know if this will work, but I can't format a comment appropriately enough to get the point across so see if this will work:
someValue=[{"someValue":"Here is a really\
really long piece which\
should be split into multiple lines"}]
Where "\" should escape the newline similar to how you can write long bash commands while escaping the newline. I'm not certain the .env interpreter will support it though.
EDIT
Looks like this won't work. This syntax was actually proposed, but I don't think it was incorporated. See motdotla/dotenv#333 (which is what Vue uses to parse .env).
Like #zero298 said, this isn't possible. Likely you could delimit the entry with a character that wouldn't show up normally in the text (^ is a good candidate), then parse it within the application using string.replace('^', '\n');

Full Text Search for extracting a snippet of the text (returning intended text and it's surrounding)

I'm using SQL file table and for instance I have a saved text file named "SOS.txt" which contains following text
For god's sake, save us right now please. We can't survive.
Now or never!
Now I want to find all files that contain the word save, so I execute following query
SELECT * FROM FileTableExample
WHERE CONTAINS(file_stream, 'save')
and here's the result:
stream file => 0x616C692053617665207573207269676874206E6F772E0D0A4E6F77206F72206E6576657221
As you can see I got the true result, the third column of the result indicates the file under name SOS.txt, I have the stream_id and stream_file but what I'm about to find is the way to show the the intended text in company with it's surrounding in human readable format.
Somethings like this:
Name | Excerpt
-------------+----------------------
SOS.txt |..sake, save us..
Is there any way?
Update:
After searching on the net I found this article which is useful but it didn't mention about full text search in filetable structure.
Based on this article, I converted file stream to string:
SELECT CONVERT(varchar(MAX), file_stream) AS Excerpt, *
from FileTableExample
where contains(file_stream, 'save')
It works if the file is a plain text like SOS.txt but if it's .docx or .pptx file, you are not going to gain a useful convention.
Use this, CAST(file_Stream as varchar(max))

How to write results in to NSArray and save it as csv file using objective-c

I'm trying to store my results in NSArray and save it as CSV File using Objective-C but i don't seem to find any solution which is relevant. Please find the below sample code:
int a=5,b=10;
int c=b-a;
double d=4.5,e=3.0;
double h=d-e;
NSLog(#"host_port:%f", c);
NSLog(#"host_size:%d", h;
I would like to store my values c and h in array and write that to CSV File. Any advise on this would be helpful.
Thanks in advance.
When you ask a question on SO you need to show effort - code you've tried, details of what you've read - if you don't you'll get down and close votes (you have one of each as I write this). The code you have included has nothing to do with CSV or arrays, and is not even pasted in valid code (the formats are wrong).
That said, let's see if you can give you something to get you going.
A CSV file is just plain text, you don't need to use any packages to write one, just standard I/O routines will do the job. You also do not need to store all the values in an array and then output the array, or build up a string version of the whole CSV file and output that, you can output items as they are generated if you wish and it may be more efficient to do so. In your code fragment you only have two values, maybe you intend this to be the core of a loop, and given those we assume you wish the CSV file:
host_port,host_size
5,1.5
your values have basic types, int and double, they are not Objective-C object types. Given this you can use the standard C I/O operations to produce your file.
First you may need to obtain the destination file name from the user, assuming this is a GUI app look up NSOpenPanel for this. That will give you an NSURL from which you can obtain the file path as an NSString, and you can convert that into a C string using NSString methods.
Now you can enter the C I/O world, to find the documentation on the following functions open the Terminal and use the man command, e.g. man fopen etc.
To create and open for writing the file for writing use fopen() passing it the C string pathname you obtained above.
To write the headers and each row of data use fprintf(). This takes a format string just like NSLog(), but you must remember to explicitly include the line breaks by using \n in the format.
When you've finished close the file with fclose().
Now go read the documentation and write your CSV file!
HTH

Lucene- Extracting sentence in which word match occurs

I'm a newbie to Lucene. In the course of understanding it, I could successfully index the files in a directory and I did a basic lucene search to get the list of files in which a particular word is present.
Now I'm trying to extract the sentence from a file in which the search word is present.
I've searched a lot but couldn't figure out.
Regards.
Thank you all for your response.
I was trying to extract index of sentences in the directory of files but not the "relavent/best text/fragment".
Here is how I solved the problem:
Using "two-level indexing" --> first index the files in a directory & then index sentences in each file. This made my job pretty easier & faster.
Anyways, thanks again for the help :)
You're looking for the method
org.apache.lucene.search.highlight.Highlighter.getBestFragment
Such method gets in input the set of tokens generated analyzing the original text, and returns in output the most relevant text fragments. Please remember to trim the fragments if they are too big.

Cannot upload CSV that starts with an integer

I'm stuck with what seems like a weird BigQuery bug : I cannot upload a CSV file that starts (first line, first column) by an integer.
Here's my schema : COL1:INTEGER,COL2:INTEGER,COL3:STRING
Here's my csv file content :
100,4,XXX
100,4,XXX
If I put the STRING column as first column, the upload is OK.
If I add a header and tell BigQuery to skip it during the import, the upload is ok too.
But with the CSV and schema above, BigQuery always complains : Line:1 / Field:1, Value cannot be converted to expected type.
Anyone knows what the problem is ?
Thank you in advance,
David
I could not reproduce this problem--I copied and pasted the content into a file and uploaded it with no problems.
Perhaps the uploaded file format is corrupted somehow? If there are extra bytes at the beginning of the file, those would be ignored in a header row but might result in this error is the first value of the first field is expected to be an integer. I'd recommend examining the actual binary data in the file to make sure there's nothing funny going on.
Also, are you doing this import via web UI, command-line tool, or API? Have you tried one of the other methods?