I have a column which I am splitting in Snowflake.
The format is as follows:
I have been using split_to_table(A, ',') inside of my query but as you can probably tell this uncorrectly also splits the Scooter > Sprinting, Jogging and Walking record.
Perhaps having the delimiter only work if there is no spaced on either side of it? As I cannot see a different condition that could work.
I have been researching online but haven't found a suitable work around yet, is there anyone that encountered a similar problem in the past?
Thanks
This is a custom rule for the split to table, so we can use a UDTF to apply a custom rule:
create or replace function split_to_table2(STR string, DELIM string, ROW_MUST_CONTAIN string)
returns table (VALUE string)
language javascript
strict immutable
as
$$
{
initialize: function (argumentInfo, context) {
},
processRow: function (row, rowWriter, context) {
var buffer = "";
var i;
const s = row.STR.split(row.DELIM);
for(i=0; i<s.length-1; i++) {
buffer += s[i];
if(s[i+1].includes(row.ROW_MUST_CONTAIN)) {
rowWriter.writeRow({VALUE: buffer});
buffer = "";
} else {
buffer += row.DELIM
}
}
rowWriter.writeRow({VALUE: s[i]})
},
}
$$;
select VALUE from
table(split_to_table2('Car > Bike,Bike > Scooter,Scooter > Sprinting, Jogging and Walking,Walking > Flying', ',', '>'))
;
Output:
VALUE
Car > Bike
Bike > Scooter
Scooter > Sprinting, Jogging and Walking
Walking > Flying
This UDTF adds one more parameter than the two in the build in table function split_to_table. The third parameter, ROW_MUST_CONTAIN is the string a row must contain. It splits the string on DELIM, but if it does not have the ROW_MUST_CONTAIN string, it concatenates the strings to form a complete string for a row. In this case we just specify , for the delimiter and > for ROW_MUST_CONTAIN.
We can get a little clever with regexp_replace by replacing the actual delimiters with something else before the table split. I am using double pipes '||' but you can change that to something else. The '\|\|\\1' trick is called back-referencing that allows us to include the captured group (\\1) as part of replacement (\|\|)
set str='car>bike,bike>car,truck, and jeep,horse>cat,truck>car,truck, and jeep';
select $str, *
from table(split_to_table(regexp_replace($str,',([^>,]+>)','\|\|\\1'),'||'))
Yes, you are right. The only pattern, which I can see, is the one with the whitespace after the comma.
It's a small workaround but we can make use of this pattern. In below code I am replacing such commas, where we do have whitespaces afterwards. Then I am applying split to table function and I am converting the previous replacement back.
It's not super pretty and would crash if your string contains "my_replacement" or any other new pattern, but its working for me:
select replace(t.value, 'my_replacement', ', ')
from table(
split_to_table(replace('Car > Bike,Bike > Scooter,Scooter > Sprinting, Jogging and Walking,Walking > Flying', ', ', 'my_replacement'),',')) t
Related
Is there an easier approach to convert an Intellij IDEA environment variable into a list of Tuples?
My environment variable for Intellij is
GROCERY_LIST=[("egg", "dairy"),("chicken", "meat"),("apple", "fruit")]
The environment variable gets accessed into Kotlin file as String.
val g_list = System.getenv("GROCERY_LIST")
Ideally I'd like to iterate over g_list, first element being ("egg", "dairy") and so on.
And then ("egg", "dairy") is a tuple/pair
I have tried to split g_list by comma that's NOT inside quotes i.e
val splitted_list = g_list.split(",(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*\$)".toRegex()).toTypedArray()
this gives me first element as [("egg", second element as "dairy")] and so on.
Also created a data class and tried to map the string into data class using jacksonObjectMapper following this link:
val mapper = jacksonObjectMapper()
val g_list = System.getenv("GROCERY_LIST")
val myList: List<Shopping> = mapper.readValue(g_list)
data class Shopping(val a: String, val b: String)
You can create a regular expression to match all strings in your environmental variable.
Regex::findAll()
Then loop through the strings while creating a list of Shopping objects.
// Raw data set.
val groceryList: String = "[(\"egg\", \"dairy\"),(\"chicken\", \"meat\"),(\"apple\", \"fruit\")]"
// Build regular expression.
val regex = Regex("\"([\\s\\S]+?)\"")
val matchResult = regex.findAll(groceryList)
val iterator = matchResult.iterator()
// Create a List of `Shopping` objects.
var first: String = "";
var second: String = "";
val shoppingList = mutableListOf<Shopping>()
var i = 0;
while (iterator.hasNext()) {
val value = iterator.next().value;
if (i % 2 == 0) {
first = value;
} else {
second = value;
shoppingList.add(Shopping(first, second))
first = ""
second = ""
}
i++
}
// Print Shopping List.
for (s in shoppingList) {
println(s)
}
// Output.
/*
Shopping(a="egg", b="dairy")
Shopping(a="chicken", b="meat")
Shopping(a="apple", b="fruit")
*/
data class Shopping(val a: String, val b: String)
Never a good idea to use regex to match parenthesis.
I would suggest a step-by-step approach:
You could first match the name and the value by
(\w+)=(.*)
There you get the name in group 1 and the value in group 2 without caring about any subsequent = characters that might appear in the value.
If you then want to split the value, I would get rid of start and end parenthesis first by matching by
(?<=\[\().*(?=\)\])
(or simply cut off the first and last two characters of the string, if it is always given it starts with [( and ends in )])
Then get the single list entries from splitting by
\),\(
(take care that the split operation also takes a regex, so you have to escape it)
And for each list entry you could split that simply by
,\s*
or, if you want the quote character to be removed, use a match with
\"(.*)\",\s*\"(.*)\"
where group 1 contains the key (left of equals sign) and group 2 the value (right of equals sign)
I rather have this ugly way of building a string from a list as:
val input = listOf("[A,B]", "[C,D]")
val builder = StringBuilder()
builder.append("Serialized('IDs((")
for (pt in input) {
builder.append(pt[0] + " " + pt[1])
builder.append(", ")
}
builder.append("))')")
The problem is that it adds a comma after the last element and if I want to avoid that I need to add another if check in the loop for the last element.
I wonder if there is a more concise way of doing this in kotlin?
EDIT
End result should be something like:
Serialized('IDs((A B,C D))')
In Kotlin you can use joinToString for this kind of use case (it deals with inserting the separator only between elements).
It is very versatile because it allows to specify a transform function for each element (in addition to the more classic separator, prefix, postfix). This makes it equivalent to mapping all elements to strings and then joining them together, but in one single call.
If input really is a List<List<String>> like you mention in the title and you assume in your loop, you can use:
input.joinToString(
prefix = "Serialized('IDs((",
postfix = "))')",
separator = ", ",
) { (x, y) -> "$x $y" }
Note that the syntax with (x, y) is a destructuring syntax that automatically gets the first and second element of the lists inside your list (parentheses are important).
If your input is in fact a List<String> as in listOf("[A,B]", "[C,D]") that you wrote at the top of your code, you can instead use:
input.joinToString(
prefix = "Serialized('IDs((",
postfix = "))')",
separator = ", ",
) { it.removeSurrounding("[", "]").replace(",", " ") }
val input = listOf("[A,B]", "[C,D]")
val result =
"Serialized('IDs((" +
input.joinToString(",") { it.removeSurrounding("[", "]").replace(",", " ") } +
"))')"
println(result) // Output: Serialized('IDs((A B,C D))')
Kotlin provides an extension function [joinToString][1] (in Iterable) for this type of purpose.
input.joinToString(",", "Serialized('IDs((", "))')")
This will correctly add the separator.
I want to keep white space when I call text attribute of token, is there any way to do it?
Here is the situation:
We have the following code
IF L > 40 THEN;
ELSE
IF A = 20 THEN
PUT "HELLO";
In this case, I want to transform it into:
if (!(L>40){
if (A=20)
put "hello";
}
The rule in Antlr is that:
stmt_if_block: IF expr
THEN x=stmt
(ELSE y=stmt)?
{
if ($x.text.equalsIgnoreCase(";"))
{
WriteLn("if(!(" + $expr.text +")){");
WriteLn($stmt.text);
Writeln("}");
}
}
But the result looks like:
if(!(L>40))
{
ifA=20put"hello";
}
The reason is that the white space in $stmt was removed. I was wondering if there is anyway to keep these white space
Thank you so much
Update: If I add
SPACE: [ ] -> channel(HIDDEN);
The space will be preserved, and the result would look like below, many spaces between tokens:
IF SUBSTR(WNAME3,M-1,1) = ')' THEN M = L; ELSE M = L - 1;
This is the C# extension method I use for exactly this purpose:
public static string GetFullText(this ParserRuleContext context)
{
if (context.Start == null || context.Stop == null || context.Start.StartIndex < 0 || context.Stop.StopIndex < 0)
return context.GetText(); // Fallback
return context.Start.InputStream.GetText(Interval.Of(context.Start.StartIndex, context.Stop.StopIndex));
}
Since you're using java, you'll have to translate it, but it should be straightforward - the API is the same.
Explanation: Get the first token, get the last token, and get the text from the input stream between the first char of the first token and the last char of the last token.
#Lucas solution, but in java in case you have troubles in translating:
private String getFullText(ParserRuleContext context) {
if (context.start == null || context.stop == null || context.start.getStartIndex() < 0 || context.stop.getStopIndex() < 0)
return context.getText();
return context.start.getInputStream().getText(Interval.of(context.start.getStartIndex(), context.stop.getStopIndex()));
}
Looks like InputStream is not always updated after removeLastChild/addChild operations. This solution helped me for one grammar, but it doesn't work for another.
Works for this grammar.
Doesn't work for modern groovy grammar (for some reason inputStream.getText contains old text).
I am trying to implement function name replacement like this:
enterPostfixExpression(ctx: PostfixExpressionContext) {
// Get identifierContext from ctx
...
const token = CommonTokenFactory.DEFAULT.createSimple(GroovyParser.Identifier, 'someNewFnName');
const node = new TerminalNode(token);
identifierContext.removeLastChild();
identifierContext.addChild(node);
UPD: I used visitor pattern for the first implementation
I have a source data set which consists of text files where the columns are separated by one or more spaces, depending on the width of the column value. The data is right adjusted, i.e. the spaces are added before the actual data.
Can I use one of the built-in extractors or do I have to implement a custom extractor?
#wBob's solution works if your row fits into a string (128kB). Otherwise, write your custom extractor that does fixed with extraction. Depending on what information you have on the format, you can write it by using input.Split() to split into rows and then split the rows based on your whitespace rules as shown below (full example for Extractor pattern is here) or you could write one similar to the one described in this blog post.
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow outputrow)
{
foreach (Stream current in input.Split(this._row_delim))
{
using (StreamReader streamReader = new StreamReader(current, this._encoding))
{
int num = 0;
string[] array = streamReader.ReadToEnd().Split(new string[]{this._col_delim}, StringSplitOptions.None).Where(x => !String.IsNullOrWhiteSpace(x)));
for (int i = 0; i < array.Length; i++)
{
// Now write your code to convert array[i] into the extract schema
}
}
yield return outputrow.AsReadOnly();
}
}
}
You could create a custom extractor or more simply, import the data as one row then split and clean and it using c# methods available to you within U-SQL like Split and IsNullOrWhiteSpace, something like this:
My right-aligned sample data
// Import the row as one column to be split later; NB use a delimiter that will NOT be in the import file
#input =
EXTRACT rawString string
FROM "/input/input.txt"
USING Extractors.Text(delimiter : '|');
// Add a row number to the line and remove white space elements
#working =
SELECT ROW_NUMBER() OVER() AS rn, new SqlArray<string>(rawString.Split(' ').Where(x => !String.IsNullOrWhiteSpace(x))) AS columns
FROM #input;
// Prepare the output, referencing the column's position in the array
#output =
SELECT rn,
columns[0] AS id,
columns[1] AS firstName,
columns[2] AS lastName
FROM #working;
OUTPUT #output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);
My results:
HTH
Is there a built-in function to extract all characters in a string up until the first occurrence of a space?
Say the string is:
Methicillin-resistant staphylococcus aureus
I want to be able to get the substring:
Methicillin-resistant
You can do it in two functions:
newstring = mystring.Substring(0, mystring.IndexOf(" "))
Although that will fail if there's no space in mystring.
So you could pull out mystring.IndexOf(" ") into a variable and check whether it's -1 (no space found) before you try to use it in Substring.
The first solution you can use is a simple IndexOf
string GetFirstWord(string source)
{
int index = source.IndexOf(" ");
if (index == -1) return source;
else return source.Substring(0, index);
}
The second solution can be used if you want to keep all words into a string array.
string[] GetWords(string source)
{
return source.Split(' ');
}
if you only want the first word, you can use it like this :
string word = GetWords("Methicillin-resistant staphylococcus aureus")[0];
And a VB.NET solution. No, it can't be done with one built-in method; you need two:
Left(myString, InStr(myString, " ") - 1)
And like the other solutions you need to check InStr doesn't return 0 if myString may not contain a space.