How to handle text file with multiple spaces as delimiter - azure-data-lake

I have a source data set which consists of text files where the columns are separated by one or more spaces, depending on the width of the column value. The data is right adjusted, i.e. the spaces are added before the actual data.
Can I use one of the built-in extractors or do I have to implement a custom extractor?

#wBob's solution works if your row fits into a string (128kB). Otherwise, write your custom extractor that does fixed with extraction. Depending on what information you have on the format, you can write it by using input.Split() to split into rows and then split the rows based on your whitespace rules as shown below (full example for Extractor pattern is here) or you could write one similar to the one described in this blog post.
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow outputrow)
{
foreach (Stream current in input.Split(this._row_delim))
{
using (StreamReader streamReader = new StreamReader(current, this._encoding))
{
int num = 0;
string[] array = streamReader.ReadToEnd().Split(new string[]{this._col_delim}, StringSplitOptions.None).Where(x => !String.IsNullOrWhiteSpace(x)));
for (int i = 0; i < array.Length; i++)
{
// Now write your code to convert array[i] into the extract schema
}
}
yield return outputrow.AsReadOnly();
}
}
}

You could create a custom extractor or more simply, import the data as one row then split and clean and it using c# methods available to you within U-SQL like Split and IsNullOrWhiteSpace, something like this:
My right-aligned sample data
// Import the row as one column to be split later; NB use a delimiter that will NOT be in the import file
#input =
EXTRACT rawString string
FROM "/input/input.txt"
USING Extractors.Text(delimiter : '|');
// Add a row number to the line and remove white space elements
#working =
SELECT ROW_NUMBER() OVER() AS rn, new SqlArray<string>(rawString.Split(' ').Where(x => !String.IsNullOrWhiteSpace(x))) AS columns
FROM #input;
// Prepare the output, referencing the column's position in the array
#output =
SELECT rn,
columns[0] AS id,
columns[1] AS firstName,
columns[2] AS lastName
FROM #working;
OUTPUT #output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);
My results:
HTH

Related

Split by delimiter which is contained in a record

I have a column which I am splitting in Snowflake.
The format is as follows:
I have been using split_to_table(A, ',') inside of my query but as you can probably tell this uncorrectly also splits the Scooter > Sprinting, Jogging and Walking record.
Perhaps having the delimiter only work if there is no spaced on either side of it? As I cannot see a different condition that could work.
I have been researching online but haven't found a suitable work around yet, is there anyone that encountered a similar problem in the past?
Thanks
This is a custom rule for the split to table, so we can use a UDTF to apply a custom rule:
create or replace function split_to_table2(STR string, DELIM string, ROW_MUST_CONTAIN string)
returns table (VALUE string)
language javascript
strict immutable
as
$$
{
initialize: function (argumentInfo, context) {
},
processRow: function (row, rowWriter, context) {
var buffer = "";
var i;
const s = row.STR.split(row.DELIM);
for(i=0; i<s.length-1; i++) {
buffer += s[i];
if(s[i+1].includes(row.ROW_MUST_CONTAIN)) {
rowWriter.writeRow({VALUE: buffer});
buffer = "";
} else {
buffer += row.DELIM
}
}
rowWriter.writeRow({VALUE: s[i]})
},
}
$$;
select VALUE from
table(split_to_table2('Car > Bike,Bike > Scooter,Scooter > Sprinting, Jogging and Walking,Walking > Flying', ',', '>'))
;
Output:
VALUE
Car > Bike
Bike > Scooter
Scooter > Sprinting, Jogging and Walking
Walking > Flying
This UDTF adds one more parameter than the two in the build in table function split_to_table. The third parameter, ROW_MUST_CONTAIN is the string a row must contain. It splits the string on DELIM, but if it does not have the ROW_MUST_CONTAIN string, it concatenates the strings to form a complete string for a row. In this case we just specify , for the delimiter and > for ROW_MUST_CONTAIN.
We can get a little clever with regexp_replace by replacing the actual delimiters with something else before the table split. I am using double pipes '||' but you can change that to something else. The '\|\|\\1' trick is called back-referencing that allows us to include the captured group (\\1) as part of replacement (\|\|)
set str='car>bike,bike>car,truck, and jeep,horse>cat,truck>car,truck, and jeep';
select $str, *
from table(split_to_table(regexp_replace($str,',([^>,]+>)','\|\|\\1'),'||'))
Yes, you are right. The only pattern, which I can see, is the one with the whitespace after the comma.
It's a small workaround but we can make use of this pattern. In below code I am replacing such commas, where we do have whitespaces afterwards. Then I am applying split to table function and I am converting the previous replacement back.
It's not super pretty and would crash if your string contains "my_replacement" or any other new pattern, but its working for me:
select replace(t.value, 'my_replacement', ', ')
from table(
split_to_table(replace('Car > Bike,Bike > Scooter,Scooter > Sprinting, Jogging and Walking,Walking > Flying', ', ', 'my_replacement'),',')) t

select latest date from string array

below is the array of strings , ie , the names of the files and folders is something i will get in the array. now from this array i need to select the newest macro file. ie , among all the strings which ends with xslm in the string array , i will select the one which has the string 20200817_W.xslm .which is the latest file kept.
Edit :
for Min.Rep of the prob ,
here we are talking about a string array like below
{IOH Bot Files , Archive , IOH_AllPlants_BI_2020817_W.xlsm, IOH_AllPlants_BI_2020817_W.xlsm ,... }
from this array i need to choose , IOH_AllPlants_BI_2020817_W.xlsm- because this string has a date component in it and it is the latest in the available list of strings
You don't even need LINQ for this one, because of the regularity of the data:
Array.Sort(arr2)
Dim latestOne = arr(arr.Length-1)
Perhaps we should ensure only xlsm files of the right name are considered:
Dim arr2 = Array.FindAll(arr, Function(x) x.StartsWith("IOH_AllPlants_BI_") AndAlso x.EndsWith("xlsm"))
Array.Sort(arr)
Dim latestOne = arr(arr.Length-1)
We could use LINQ, and (keeping our "only matching names" logic) rather than using an expensive sort, just ask for the Max:
Dim onlyIOHXLSMFiles = arr.Where(Function(x) x.StartsWith("IOH_AllPlants_BI_") AndAlso x.EndsWith("xlsm"))
Dim latestOne = arr.Max()
We don't need to parse this date because it's yyyyMMdd; it sorts just fine as a string. Because it's just a simple string property it is fine to use with just Max which is more efficient than the typical "OrderBy/First" approach.
If the list was not just of a simple type, but instead was eg Person and you were wanting the most recently born Person (rather than just their birthdate, which is what Max would give you) you could:
Dim lastOne = personArr.OrderBy(Function(p) p.Birthdate).Last()
I use OrderBy/Last rather then OrderByDescending/First because it's fewer characters to type for the same effect
All these code samples (with the exception of the last one) make use of an array arr created like:
Dim arr = {"IOH Bot Files" , "Archive" , "IOH_AllPlants_BI_2020817_W.xlsm", "IOH_AllPlants_BI_2020817_W.xlsm" , ... }
See sample logic below which returns what you need. You can change the logic according to your need. You should read the file names to an array and use below logic.
using System;
using System.Linq;
namespace SampleConsoleApp
{
class Program
{
static void Main(string[] args)
{
string[] data = new string[] {
"IOH_AllPlants_BI_20200810_W.xslm"
, "IOH_AllPlants_BI_20200803_W.xslm"
, "IOH_AllPlants_BI_20200727_W.xslm"
, "IOH_AllPlants_BI_20200720_W.xslm",
"IOH_AllPlants_BI_20200817_W.xslm"
, "IOH_AllPlants_BI_20200713_W.xslm"
, "IOH_AllPlants_BI_20200706_W.xslm"};
var result = data.Select(s => s.Split('_')).Select(x => x[3]).OrderByDescending(x => x).First();
//result returns 20200817
}
}
}

How can I load in a pipe (|) delimited text file that has columns that sometimes contain line breaks?

I have built an SSIS package that loads in several delimited text files into a SQL database. One of the files often contains line spaces in it, which breaks the standard data flow task of setting a flat file source and mapping to an ado.net destination since it thinks it is on a new line when it reaches a line break. The vendor sending over the files does not want to sent the file without any edits and can't do XML at this time. Is there any way to fix this? I was thinking of writing a small vb.net program that would correct the files so they would work in the SSIS package, but not sure how to write that logic. The file has 5 columns, the first 2 are big integer and always contain some long integer ID, then there is a small text column that just contains one short word, then a date, and then a long comments field that is causing the problem. The comments field is sometimes blank (which is ok), the problem are the rows that have line breaks. I never know how many line breaks are in the comments, some have none, some can have several, even multiple line breaks in a row, so was wondering if this is even possible.
5787626|6547599|Approved|1/10/2017|Applicant request for fee waiver approved
5443221|7742812|Active|11/5/2013|
3430962|7643957|Re-Scheduled|5/25/2016|REVISED TERMS AND CONDITIONS REJECTED
Applicant has 30 DAYS To submit paperwork for extension.
34433624|7673715|Denied|1/24/2017|
34113575|7653748|Active|1/8/2014|New terms have been granted.
Sample File Format.
As long as there is logic that you can program/predict, it will be possible.
I would do it using a Script Component as a source, which means you don't need to rewrite the file before processing it. It also provides a lot of flexibility, e.g., you can store values in variables while iterating over multiple lines in the file, etc.
I posted another answer recently that gives a lot of detail on how to go about this: SSIS import a Flat File to SQL with the first row as header and last row as a total.
An example of holding the values in variables until the row is ready to be written:-
For this example I am writing three columns, ID1, ID2 and Comments. The file looks like this:
1|2|Comment1
Comment2
4|5|Comment3
Comment4
Comment5
6|7|Comment6
The Script Component contains the following method.
public override void CreateNewOutputRows()
{
System.IO.StreamReader reader = null;
try
{
bool readFirstLine = false;
int id1 = 0;
int id2 = 0;
string comments = null;
reader = new System.IO.StreamReader(Variables.FilePath); // this refers to a package variable that contains the file path
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
if (line.Contains("|"))
{
if (readFirstLine)
{
Output0Buffer.AddRow();
Output0Buffer.ID1 = id1;
Output0Buffer.ID2 = id2;
Output0Buffer.Comments = comments;
}
else
{
readFirstLine = true;
}
string[] fields = line.Split('|');
id1 = Convert.ToInt32(fields[0]);
id2 = Convert.ToInt32(fields[1]);
comments = fields[2];
}
else
{
comments += " " + line;
}
if (reader.EndOfStream)
{
Output0Buffer.AddRow();
Output0Buffer.ID1 = id1;
Output0Buffer.ID2 = id2;
Output0Buffer.Comments = comments;
}
}
}
catch
{
if (reader != null)
{
reader.Close();
reader.Dispose();
}
throw;
}
}
The result set is:
ID1 ID2 Comments
=== === ========
1 2 Comment1 Comment2
4 5 Comment3 Comment4 Comment5
6 7 Comment6

How to get all stored values not terms for a field in Lucene.Net?

I saw an example of extracting all available terms for a field here
The reason it doesn't fit my porpouses is because terms and stored values are different, e.g. stored value of "black cat" will be represnted as two terms "black" and "cat". in my code I need to extract whole stored values in this case "black cat".
Yes, you could do that. I'm not C# programmer, but hopefully you will understand Java code.
IndexReader reader = DirectoryReader.open(dir);
final int len = reader.maxDoc();
for (int i = 0; i < len; ++i) {
Document document = reader.document(i);
List<IndexableField> fields = document.getFields();
for (IndexableField field : fields) {
if (field.fieldType().stored()) {
System.out.println(field.stringValue());
}
}
}
So, basically, I'm traversing across all docs, get all fields, and if they are stored, get the data. You could filter it by the name of the field, that are needed for you.
Full test could be found here - https://raw.githubusercontent.com/MysterionRise/information-retrieval-adventure/master/src/main/java/org/mystic/GetAllStoredFieldValues.java (also with the proof, that it works correctly)

How can I generate schema from text file? (Hadoop-Pig)

Somehow i got filename.log which looks like for example (tab separated)
Name:Peter Age:18
Name:Tom Age:25
Name:Jason Age:35
because the value of key column may differ i cannot define schema when i load text like
a = load 'filename.log' as (Name:chararray,Age:int);
Neither do i want to call column by position like
b = foreach a generate $0,$1;
What I want to do is, from only that filename.log, to make it possible to call each value by key, for example
a = load 'filename.log' using PigStorage('\t');
b = group b by Name;
c = foreach b generate group, COUNT(b);
dump c;
for that purpose, i wrote some Java UDF which seperate key:value and get value for every field in tuple as below
public class SPLITALLGETCOL2 extends EvalFunc<Tuple>{
#Override
public Tuple exec(Tuple input){
TupleFactory mTupleFactory = TupleFactory.getInstance();
ArrayList<String> mProtoTuple = new ArrayList<String>();
Tuple output;
String target=input.toString().substring(1, input.toString().length()-1);
String[] tokenized=target.split(",");
try{
for(int i=0;i<tokenized.length;i++){
mProtoTuple.add(tokenized[i].split(":")[1]);
}
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}catch(Exception e){
output = mTupleFactory.newTupleNoCopy(mProtoTuple);
return output;
}
}
}
How should I alter this method to get what I want? or How should I write other UDF to get there?
Whatever you do, don't use a tuple to store the output. Tuples are intended to store a fixed number of fields, where you know what every field contains. Since you don't know that the keys will be in Name,Age form (or even exist, or that there won't be more) you should use a bag. Bags are unordered sets of tuples. They can have any number of tuples in them as long as the tuples have the same schema. These are all valid bags for the schema B: {T:(key:chararray, value:chararray)}:
{(Name,Foo),(Age,Bar)}
{(Age,25),(Name,Jim)}
{(Name,Bob)}
{(Age,30),(Name,Roger),(Hair Color,Brown)}
{(Hair Color,),(Name,Victor)} -- Note the Null value for Hair Color
However, it sounds like you really want a map:
myudf.py
#outputSchema('M:map[]')
def mapize(the_input):
out = {}
for kv in the_input.split(' '):
k, v = kv.split(':')
out[k] = v
return out
myscript.pig
register '../myudf.py' using jython as myudf ;
A = LOAD 'filename.log' AS (total:chararray) ;
B = FOREACH A GENERATE myudf.mapize(total) ;
-- Sample usage, grouping by the name key.
C = GROUP B BY M#'Name' ;
Using the # operator you can pull out all values from the map using the key you give. You can read more about maps here.