finding if a line starts with a specified string in a clob and then extract - sql

I have a CLOB column that I want to search for a line that starts with '305' then extract something from that line, some of my rows will have multiple lines that start with '305' or '305 somewhere in the entire cell, so I'd only want to find the first line where it starts with '305' the entire cell content is split into lines like this
301|10500000908|
302|20171021|20171104|
303|00001|8306.7|
302|20171008|20171020|
303|00001|13174.5|
302|20170704|20171007|
303|00001|2508.7|
302|20170419|20170703|
303|00001|6962.9|
302|20170330|20170418|
303|00001|7628.2|
302|20170305|20170329|
--- my instr(dbms_lob.substr(flow_data, 4000, 1 ),'305|', 1, 1) keeps finding this line
303|00001|8489.1|
302|20170120|20170304|
303|00001|1997.9|
302|20161021|20170119|
303|00001|12359.8|
302|20160722|20161020|
303|00001|7354.0|
302|20160516|20160721|
303|00001|26.4|
304|20171105|
305|00001|5936.1|
--- i want to find this line and then extract the '5936.1' from it
304|20171021|
305|00001|5710.4|
304|20171008|
305|00001|5163.1|
304|20170704|
304|20170419|
305|00001|7390.8|
304|20170330|
305|00001|7363.2|
304|20170305|
305|00001|7181.4|
304|20170120|
305|00001|9200.2|
304|20161021|
305|00001|4791.3|
305|00001|2877.5|
304|20160516|
305|00001|4116.9|
306|0393|20160511|
307|SUPP|20160511|
310|A|20160511|
311|E|20160516|
when I use instr(dbms_lob.substr(flow_data, 4000, 1 ),'305|', 1, 1) it keeps finding the wrong line. by the way there are no gaps between the lines, I inserted them to keep the text separated.
Thanks all
Mac

If I follow you correctly, you can use regexp_substr():
select regexp_substr(flow_data, '^305\|[^|]*\|([^|]*)', 1, 1, 'm', 1) as val
from t
Argument breakdown:
flow_data: the value to search (CLOBs are allowed)
'^305\|[^|]*\|([^|]*)': the regex. We search for 305 at the beginning of a line, and capture the third value in the CSV list
1: start the search at the beginning of source string
1: return the first match
m - multiline mode : ^ matches at the begin of each line
1: return the first captured part of the match

Related

Remove the last punctuation in list of numbers in Python

I have variable of numbers and letters and want a code to remove the apostrophe between each number/letter and only keeping the first and last apostrophe for the variable. Desired output is shown below
numbers = 'V7780T103', '494368103', '003654100', '26210C104'
output should be
numbers = 'V7780T103, 494368103, 003654100, 26210C104'

Perl6 split function adding extra elements to array

my #r = split("", "hi");
say #r.elems;
--> output: 4
split is adding two extra elements to the array, one at the beginning and another at the end.
I have to do shift and pop after every split to correct for this.
Is there a better way to split a string?
If you're splitting on the empty string, you will get an empty element at the start and the end of the returned list as there is also an empty string before and after the string.
What you want is .comb without parameters, written out completely functionally:
"hi".comb.elems.say; # 2
See https://docs.raku.org/routine/comb#(Str)_routine_comb for more info.
The reason for this is when you use an empty Str “” for the delimiter it is the same as if you had used the regex /<|wb>/ which matches next to characters. So it also matches before the first character, and after the last character. Perl 5 removes these “extra” strings for you in this case (and in this case only), which is likely where the confusion lays.
What Perl 6 does instead is allow you to explicitly :skip-empty values
'hi'.split('') :skip-empty
'hi'.split('', :skip-empty)
split("", "hi") :skip-empty
split("", "hi", :skip-empty)
Or to specify what you actually want
'hi'.comb( /./ )
'hi'.comb( 1 )
'hi'.comb
comb( /./, 'hi' )
comb( 1, 'hi' )

Find Each Occurrence of X and Insert a Carriage Return

A colleague has some data he is putting into a flat file (.txt) and needs to insert a carriage return before EACH occurrence of 'POL01', 'SUB01','VEH01','MCO01'.
I did use:
For Each line1 As String In System.IO.File.ReadAllLines(BodyFileLoc)
If line1.Contains("POL01") Or line1.Contains("SUB01") Or line1.Contains("VEH01") Or line1.Contains("MCO01") Then
Writer.WriteLine(Environment.NewLine & line1)
Else
Writer.WriteLine(line1)
End If
Next
But unfortunately it turns out that the file is not formatted in 'lines' by SSIS but as one whole string.
How can I insert a carriage return before every occurrence of the above?
Test Text
POL01CALT302276F 332 NBPM 00101 20151113201511130001201611132359 2015111300010020151113000100SUB01CALT302276F 332 NBPMP01 Akl Abi-Khalil 19670131 M U33 Stoford Close SW19 6TJ 2015111300010020151113000100VEH01CALT302276F 332 NBPM001LV56 LEJ N 2006VAUXHALL CA 2015111300010020151113000100MCO01CALT302276F 332 NBPM0101 0 2015111300010020151113000100POL01CALT742569N
You can use regular expressions for this, specifically by using Regex.Replace to find and replace each occurrence of the strings you're looking for with a newline followed by the matching text:
Dim str as String = "xxxPOL01xxxSUB01xxxVEH01xxxMCO01xxx"
Dim output as String = Regex.Replace(str, "((?:POL|SUB|VEH|MCO)01)", Environment.NewLine + "$1")
'output contains:
'xxx
'POL01xxx
'SUB01xxx
'VEH01xxx
'MCO01xxx
There may be a better way to construct this regular expression, but this is a simple alternation on the different letters, followed by 01. This matched text is represented by the $1 in the replacement string.
If you're new to regular expressions, there are a number of tools that help you understand them - for example, regex101.com will show you an explanation of the one I have used here:

AWK: Ignore lines grouped by an unique value conditioned on occurrences of a specific field value

Please help revise the title and the post if needed, thanks.
In short, I would like to firstly group lines with a unique value in the first field and accumulate the occurrences of a specific value in the other field in the underlying group of lines. If the sum of occurrences doesn't meet the self-defined threshold, the lines in the group should be ignored.
Specifically, with input
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
222,1,M,1
222,1,M,0
333,1,P,0
333,1,P,1
444,1,M,1
444,1,M,1
444,0,M,0
555,1,P,1
666,1,P,0
the desired output should be
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
meaning that "because the unique values in the first field 222 and 444 don't have at least one (which can be any desired threshold) P in the third field, lines corresponding to 222 and 444 are ignored."
Furthermore, this should be done without editing the original file and have to be combined with the solved issue Split CSV to Multiple Files Containing a Set Number of Unique Field Values. By doing this, a few lines will not be involved in the resulted split files.
I believe this one-liner does what you want:
$ awk -F, '{a[$1,++c[$1]]=$0}$3=="P"{p[$1]}END{for(i in c)if(i in p)for(j=1;j<=c[i];++j)print a[i,j]}' file
111,1,P,1
111,1,P,1
111,1,P,0
111,1,M,1
333,1,P,0
333,1,P,1
555,1,P,1
666,1,P,0
Array a, keeps track of all the lines in the file, grouping them by the first field and a count c which we use later. If the third field contains a P, set a key in the p array.
After processing the entire file, loop through all the values of the first field. If a key has been set in p for the value, then print the lines from a.
You mention a threshold number of entries in your question. If by that, you mean that there must be N occurrences of "P" in order for the lines to be printed, you could change {p[$1]} to {++p[$1]}, then change if(i in p) to if(p[i]>=N) in the END block.

How to load 2D array from a text(csv) file into Octave?

Consider the following text(csv) file:
1, Some text
2, More text
3, Text with comma, more text
How to load the data into a 2D array in Octave? The number can go into the first column, and all text to the right of the first comma (including other commas) goes into the second text column.
If necessary, I can replace the first comma with a different delimiter character.
AFAIK you cannot put stings of different size into an array. You need to create a so called cell array.
A possible way to read the data from your question stored in a file Test.txt into a cell array is
t1 = textread("Test.txt", "%s", "delimiter", "\n");
for i = 1:length(t1)
j = findstr(t1{i}, ",")(1);
T{i,1} = t1{i}(1:j - 1);
T{i,2} = strtrim(t1{i}(j + 1:end));
end
Now
T{3,1} gives you 3 and
T{3,2} gives you Text with comma, more text.
After many long hours of searching and debugging, here's how I got it to work on Octave 3.2.4. Using | as the delimiter (instead of comma).
The data file now looks like:
1|Some text
2|More text
3|Text with comma, more text
Here's how to call it: data = load_data('data/data_file.csv', NUMBER_OF_LINES);
Limitation: You need to know how many lines you want to get. If you want to get all, then you will need to write a function to count the number of lines in the file in order to initialize the cell_array. It's all very clunky and primitive. So much for "high level languages like Octave".
Note: After the unpleasant exercise of getting this to work, it seems that Octave is not very useful unless you enjoy wasting your time writing code to do the simplest things. Better choices seems to be R, Python, or C#/Java with a Machine Learning or Matrix library.
function all_messages = load_data(filename, NUMBER_OF_LINES)
fid = fopen(filename, "r");
all_messages = cell (NUMBER_OF_LINES, 2 );
counter = 1;
line = fgetl(fid);
while line != -1
separator_index = index(line, '|');
all_messages {counter, 1} = substr(line, 1, separator_index - 1); % Up to the separator
all_messages {counter, 2} = substr(line, separator_index + 1, length(line) - separator_index); % After the separator
counter++;
line = fgetl(fid);
endwhile
fprintf("Processed %i lines.\n", counter -1);
fclose(fid);
end