How to split a CSV file into groups using Pentaho? - pentaho

I am new to Pentaho and am trying to read a CSV file (which I already did) and create blocks of data based on an identifier.
Eg
1|A|B|C
2|D|E|F
8|G|H|I|J|K
4|L|M
1|N|O|P
4|Q|R|S|T
5|U|V|W
I need to split and group this as such:
(each block starts when the first column is equal to '1')
Block a)
1|A|B|C
2|D|E|F
8|G|H|I|J|K
4|L|M
Block b)
1|N|O|P
4|Q|R|S|T
5|U|V|W
Eg
a |1|A|B|C
a |2|D|E|F
a |8|G|H|I|J|K
a |4|L|M
b |1|N|O|P
b |4|Q|R|S|T
b |5|U|V|W
How can this be achieved using Penatho? Thanks.
I found a similar question but answers don't really help my case
Pentaho Kettle split CSV into multiple records

I think I got the answer.
I created the transformation in this zip that can transform your "csv" file in rows almost like you described but I don't know what you intend to do next, so maybe you can give us more details. =)
I'll explain what I did:
1) First, we grab the row full text with a Text input step
When you look at configurations of Text Input step, you'll see I used a ';' has separator, when your input file uses '|' so I'm not spliting columns with the '|' but loading the whole line in one column. Grabbing the row's full text, nothing else.
2) Next we apply a regex eval to separate the ID from the rest of our string.
^(\d+)\|(.*)
Which means: in the beginning of the text I expect one or more digits followed by a pipe and anything after that. Capture the digits in the beginning of the string in one column and everything after the pipe to another column.
That gives you this output: (blue is the first capture group, red is the second)
3) Now what you need is to add a 'sequence' that only goes up if there is a row_id = 1. Which I did in the Mod JS Value with the following code:
var sequence
//if it's the first row, set sequence to 1
if(sequence == null){
sequence = 1;
}else{
//if it's not the first row, check if the row_id is equal to 1 (string)
if(row_id == '1'){
// increment the sequence
sequence++;
}else{
//nothing
}
}
And that will give you this output that seem to be what you expected: (green, the group/sequence done)
Hope it helps =)

Related

Getting specific rows in a Powershell variable/array

I hope I'm able to ask my question as simple as possible. I am very new to working with PowerShell.
Now to my question:
I use Invoke-Sqlcmd to run a query, which puts Data in a variable, let's say $Data.
In this case I query for triggers in an SQL Database.
Then I kind of split the array to get more specific information:
$Data2 = $Data | Where {$_.table -like 'dbo.sportswear'}
$Data3 = $Data2 | Where {$_.event -match "Delete"}
So in the end I have a variable with these Indexes(?), I'm not sure if they are called indexes.
table
trigger_name
activation
event
type
status
definition
Now all I want is to check something in the definition.
So I create a $Data4 = $Data3.definition, so far so good.
But now I have a big text and I want only the content of 2-3 specific rows.
When I used like $Data4[1] or $Data4[1..100], I realized that PowerShell sees every char as a line/row.
But when I just write $Data4 it shows me the content nice formatted with paragraphs, new lines and so on.
Has anyone an idea how I can get specific rows or lines of my variable?
Thank you all :)
It appears $Data4 is a formatted string. Since it is a single string, any indexed element lookups return single characters (of type System.Char). If you want indexes to return longer substrings, you will need to split your string into multiple strings somehow or come up with a more sophisticated search mechanism.
If we assume the rows you are after are actual lines separated by line feed and/or carriage return, you can just split on those newline characters and use indexes to access your lines:
# Array indexing starts at 0 for line 1. So [1] is line 2.
# Outputs lines 2,3,4
($Data4 -split '\r?\n')[1..3]
# Outputs lines 2,7,20
($Data4 -split '\r?\n')[1,6,19]
-split uses regex to match characters and perform a string split on all matches. It results in an array of substrings. \r matches a carriage return. \n matches a line feed. ? matches 0 or one character, which is needed in case there are no carriage returns preceding your line feeds.

Is there a possibility to delete a string that starts with a ">" and ends with a newline in postgresql via sql update statement?

Lets say I have a database with about 50k entries in a column called content.
This column contains strings which causes problems to my further work.
Now here is the thing I need to do it for all the rows inside of that table.
Any Ideas?
Here an example:
'user wrote:
-----------------------------------------------------
> Some text
> that vary too much and I dont need it actually
> here is end of the text
The text I actually need.'
I would like to remove all of the unnecessary part so the only thing that is left is in this case :
'The text I actually need.'
This should delete all lines that start with a >:
regexp_replace(textcol, E'^>.*\n', '', 'gn');
The g flag is needed to delete all such lines, and the n flag makes the ^ match the position right after each line break.
I use an “extended” string literal (the leading E) so that I can write a newline as \n.

Limitting character input to specific characters

I'm making a fully working add and subtract program as a nice little easy project. One thing I would love to know is if there is a way to restrict input to certain characters (such as 1 and 0 for the binary inputs and A and B for the add or subtract inputs). I could always replace all characters that aren't these with empty strings to get rid of them, but doing something like this is quite tedious.
Here is some simple code to filter out the specified characters from a user's input:
local filter = "10abAB"
local input = io.read()
input = input:gsub("[^" .. filter .. "]", "")
The filter variable is just set to whatever characters you want to be allowed in the user's input. As an example, if you want to allow c, add c: local filter = "10abcABC".
Although I assume that you get input from io.read(), it is possible that you get it from somewhere else, so you can just replace io.read() with whatever you need there.
The third line of code in my example is what actually filters out the text. It uses string:gsub to do this, meaning that it could also be written like this:
input = string.gsub(input, "[^" .. filter .. "]", "").
The benefit of writing it like this is that it's clear that input is meant to be a string.
The gsub pattern is [^10abAB], which means that any characters that aren't part of that pattern will be filtered out, due to the ^ before them and the replacement pattern, which is the empty string that is the last argument in the method call.
Bonus super-short one-liner that you probably shouldn't use:
local input = io.read():gsub("[^10abAB]", "")

Read text file line by line but only specific columns

How do we read a specific file line by line while skipping some columns in it?
For example, I have a text file which has data, sorted out in 5 columns, but I need to read only two columns out of it, they can be first two or any other random combination (I mean, need a solution which would work with any combination of columns like first and third only).
Code something like this
open(1, file=data_file)
read (1,*) ! to skip first line, with metadata
lmax = 0
do while (.true.)
! read column 1 and 3 here, either write
! that to an array or just loop through each row
end do
99 continue
close (1)
Any explanation or example would help a lot.
High Performance Mark's answer gives the essentials of simple selective column reading: one still reads the column but transfers it to a then-ignored variable.
To extend that answer, then, consider that we want to read the second and fourth columns of a five-column line:
read(*,*) junk, x, junk, y
The first value is transferred into junk, then the second into x, then the third (replacing the one just acquired a moment ago) into junk and finally the fourth into y. The fifth is ignored because we've run out of input items and the transfer statement terminates (and the next read in a loop will go to the next record).
Of course, this is fine when we know it's those columns we want. Let's generalize to when we don't know in advance:
integer col1, col2 ! The columns we require, defined somehow (assume col1<col2)
<type>, dimension(nrows) :: x, y, junk(3) ! For the number of rows
integer i
do i=1,nrows
read(*,*) junk(:col1-1), x(i), junk(:col2-col1-1), y(i)
end do
Here, we transfer a number of values (which may be zero) up to just before the first column of interest, then the value of interest. After that, more to-be-ignored values (possibly zero), then the final value of interest. The rest of the row is skipped.
This is still very basic and avoids many potential complications in requirements. To some extent, it's such a basic approach one may as well just consider:
do i=1,nrows
read(*,*) allofthem(:5)
x(i) = allofthem(col1)
y(i) = allofthem(col2)
end do
(where that variable is a row-by-row temporary) but variety and options are good.
This is very easy. You simply read 5 variables from each line and ignore the ones you have no further use for. Something like
do i = 1, 100
read(*,*) a(i), b, c(i), d, e
end do
This will overwrite the values in b, d, and e at every iteration.
Incidentally, your line
99 continue
is redundant; it's not used as the closing line for the do loop and you're not branching to it from anywhere else. If you are branching to it from unseen code you could just attach the label 99 to the next line and delete the continue statement. Generally, continue is redundant in modern Fortran; specifically it seems redundant in your code.

How does associative arrays work in awk?

I wanted to remove duplicate lines from a file based on a column. A quick search let me this page which had the following solution:
awk '!x[$1]++' filename
It works, but I am not sure how it works. I know it uses associate arrays in awk but I am not able to infer anything beyond it.
Update:
Thanks everyone for the explanation. With my new knowledge, I have wrote a blog post with further explanation of how it works.
That awk script !x[$1]++ fills an array named x. Suppose the first word ($1 refers to the first word in a line of text) in a line of text is line1. It effectively results in this operation on the array:
x["line1"]++
The "index" (the key) of the array is the text encountered in the file (line1 in this example), and the value associated with that key is an integer that is incremented by 1.
When a unique line of text is encountered, the current value of the array is zero, which is then post-incremented to 1. The not operator ! evaluates to non-zero (true) for each new unique line of text and so prints it. The next time the same value is encountered, the value in the array is non-zero and so the not operation results in zero (false), so the line is not printed.
A less "clever" way of writing the same thing (but possibly more clear and less fun) would be this:
{
if (x[$1] == 0 )
print
x[$1]++
}