Kettle - Text File Unstructured - pentaho

I have a text file that its structure is not in a single line, it is certain that the lines start with zero (0). Below is the sample:
header : TEXT<br>
header : TEXT<br>
header : TEXT<br>
line 1 : 0TEXT Name Other Field<br>
line 2 : TEXT Other Field Phone<br>
line 3 : 0TEXT Name Other Field<br>
line 4 : TEXT Other Field Phone<br>
line 5 : 0TEXT textexttexttext <br>
line 6 : 0TEXT Name Other Field<br>
line 7 : TEXT Other Field Phone<br>
line 8 : 0TEXT Name Other Field<br>
line 9 : TEXT Other Field Phone<br>
What I want to do is get through a regex evaluation the NAME and the PHONE fields and store this values.
Name, Phone
Name, Phone
The regex part is ok, I already did it.
What I need to know is how to get the values from two different lines and put it in the same register?
I found this forum http://forums.pentaho.com/showthread.php?53288-Reading-multi-line-records-from-text-file-newbie and tried to apply a javascript suggested, but it didn't work for me, maybe I did something wrong..
I really did some simple wrong and fixed it.
js..
var x;
var charInitial = line.toString().charAt(0);
if(charInitial == '0') {
x = line.toString();
}
else{
x += line.toString();
}
With this script I get the rows separated, I want to concatenate them and after apply the regex. I can concatenate all the rows that belong to the group, and with a regex I can drop that ones that are unnecessary.
Thanks

Given that you have those records in multiple rows, you have the following options:
1) Group by: as long as you can identify your rows that belong together via some set of keys, you can use a group by and create two new fields, Name and Phone, obtained by "Concatenate fields separated by" (not the "concatenate fields separated by ,", mind that). If the values are either what you want to keep or null, the concatenation works;
2) De-normalize. Same principle applies, you need a set of keys to identify records that belong together, but you will need both your Name and Phone to be in the same field (e.g, Value) and you need another field with the key (either Name or Phone).
3) Perhaps the best one: Analytic Query: Use "Lag N rows forward and get field" with N=1 and you get the phone number of the next row. After this step you have rows with a Not null name and the next row's phone number; rows with a null name and a null phone number. Filter the rows you want after and you're done.
This is just a generic idea. You have to sort out the details.

Related

How will you split a full name when his/her name consisting of 2 or more words from his/her last name in "PANDAS? Is there an easy way to do it?

I tried to concatenate the two columns firstname and lastname into one columns. Now, how about to split it into two columns when his/her firstname consisting of 2, 3 and more words from his/her lastname? Is there any easy way to do it? I tried using str.split() methods. But it says "columns should be the same length."
We can use str.extract here:
df[["firstname", "lastname"]] = df["fullname"].str.extract(r'^(\w+(?: \w+)*) (\w+)$')
The regex pattern used above assigns as many name words from the start as possible, leaving only the final component for the last name. Here is a formal explanation of the regex:
^ from the start of the name
( open first capture group \1
\w+ match the first word
(?: \w+)* then match space and another word, together zero or more times
) close first capture group
match a single space
(\w+) match and capture the last word as the last name in \2
$ end of the name

PostgreSQL - find matching line in char/string column?

How can I find matching line in char/string type column?
For example let say I have column called text and some row has content of:
12345\nabcdf\nXKJKJ
(where \n are real new lines)
Now I want to find related row if any of lines match. For example, I have value 12345,
then it should find match. But if I have value 123, It would not.
I tried using like but it finds in both cases, when I have matching value (like 12345) and partially matching value (like 123).
For example something like this, but to have boundary for checking whole line:
SELECT id
FROM my_table
WHERE text like [SOME_VALUE]
Update
Maybe its not yet clear what Im asking. But basically I want something equivalent what you can do with regular expression,
like this: https://regexr.com/5akj1
Here regular expression /^123$/m would not match my string, it would only match if it would have been with pattern /^12345$/m (when I use pattern, value is dynamic, so pattern would change depending what value I got).
You may use regexp_replace and then check that the replaced string is not equal to the original column value:
select count(*)
from dummy
where regexp_replace(mytext, '(?m)^1234$', '') <> mytext;
You have a demo here.
Bear in mind that I have used the (?m) modifier, which makes ^ and $ match begin and end of line instead of begin and end of string.
You should be able to use ~ for matching:
where mytext ~ '(\n|^)1234(\n|$)'

Consider a query to find details of research fields where the first two parts of the ID are D and 2 and the last part is one character (digit)

The ID of research fields have three parts, each part separated by a period.
Consider a query to find the details of research fields where the first two parts of the ID are D and 2, and the last part is a single character (digit).
IDs like D.2.1 and D.2.3 are in the query result whereas IDs like D.2.12 or D.2.15 are not.
The SQL query given below does not return the correct result. Explain the reason why it does not return the correct result and give the correct SQL query.
select *
from field
where ID like 'B.1._';
I have no idea why it doesnt work.
Anyone can help on this? Many thanks
D.2.1 and D.2.3 are in the query result whereas IDs like D.2.12 or D.2.15 are not.
An underscore matches any single character in a LIKE filter so B.1._ is looking for the start of the string followed by a B character followed by a . character then a 1 character then a . character then any single character then the end of the string.
You could use:
SELECT *
FROM field
WHERE ID like 'B.1._%';
The % will match any number of characters (including zero) until the end of the string and the preceding underscore will enforce that there is at least one character after the period.

Notepad++ rows to columns, in groups

I have found a ton of ways to transpose columns to text in Notepad++ and vice versa. However, where I'm struggling is that I have one column with several rows. I can't 'just' transpose these as the data winds up being in the wrong order.
Example:
RANK
COMPANY
GROWTH
REVENUE
INDUSTRY
1
Skillz
50,058.92%
$54.2m
Software
2
EnviroSolar Power
36,065.06%
$37.4m
Energy
When I transpose this, I wind up with:
RANKCOMPANYGROWTHREVENUEINDUSTRY 1Skillz50,058.92%$54.2mSoftware2EnviroSolar Power36,065.06%$37.4mEnergy
I need everything to remain in groups so I wind up with the following, noting that I also need a delimiter added:
RANK|COMPANY|GROWTH|REVENUE|INDUSTRY
1|Skillz|50,058.92%|$54.2m|Software
2|EnviroSolar Power|36,065.06|$37.4m|Energy
As you can see with the company EnviroSolar Power, there is a space between "EnviroSolar" and "Power" and anything I've tried winds up removing the spaces that should remain in tact when transposing.
I appreciate ANY help you can offer! Thank you in advance!
Assuming that your rows always start with integers (except for the header row of course) and furthermore, that only the first column contains integers you could do do that with two search replace (Ctrl+H).
Be sure to opt for 'Regular expression' search mode.
First replace all newlines with pipes. This will put everything on one line for now.
Find what: \n
Replace with: |
Next find all pure numeric fields and make them start of a line to reach the desired result.
Find what: \|([0-9]+)\|
Replace with: \n$1|
If you know the number of columns, in fact here it is 5, you could do in two steps:
First:
Ctrl+H
Find what: (?:[^\r\n]+\R){5}
Replace with: $0\n
Replace all
Explanation:
(?: : start non capture group
[^\r\n]+ : 1 or more any character but line break
\R : any kind of line break
){5} : group must occurs 5 times,
here you can give the columns number of your choice
This will add a linebreak after 5 columns.
Check regular expression
Second:
Ctrl+H
Find what: (\R)(?!\R)|(\R\R)
Replace with: (?1|:\n)
Replace all
Explanation:
(\R) : any kind of line break, in group 1
(?!\R) : negative lookahead, make sure we have not another linebreak after
| : OR
(\R\R) : 2 line break, in group 2
Replacement:
(?1 : conditional replacement, is group 1 existing
| : yes ==> a pipe
:\n : no ==> linebreak
) : end condition
This will replace a single linebreak by a pipe and 2 consecutive linebreaks by a single one
Result for given example:
RANK|COMPANY|GROWTH|REVENUE|INDUSTRY
1|Skillz|50,058.92%|$54.2m|Software
2|EnviroSolar Power|36,065.06%|$37.4m|Energy

Get a Count of a Field Including Similar Entries MS Access

Hey all I'm trying to parse out any duplicates in an access database. I want the database to be usable for the access illiterate and therefore I am trying to set up queries that can be run without any understanding of the program.
My database is setup where there are occasionally special characters attached to the entries in the Name field. I am interested in checking for duplicate entries based of the fields field1 and name. How can I include the counts for entries with special characters with their non-special character counterparts? Is this possible in a single step or do I need to add a step where I clean the data first?
Currently my code (shown below) only returns counts for entries not including special characters.
SELECT
table.[field1],
table.[Name],
Count(table.[Name]) AS [CountOfName]
FROM
table
GROUP BY
table.[field1],
table.[Name]
HAVING
(((table.[Name]) Like "*") AND ((Count(table.[Name]))>1));
I have tried adding a leading space to the Like statement (Like " *"), but that returns zero results.
P.S. I have also tried the Replace statement to replace the special characters, but that did not work.
field1 ##
1234567
1234567
4567890
4567890
name ##
brian
brian
ted
ted‡
Results
field1
1234567
name
brian
countofname
2
GROUP BY works by placing rows into groups where values are the same. So, when you run your query on your data and it groups by field1 and name, you are saying "Put these records into groups where they share a common field1 and name value". If you want 4567890, ted and 4567890, ted† to show in the same group, and thus have a count of 2, both the field1 and name have to be the same.
If you only have one or two possible special characters on the end of the names, you could potentially use Replace() or Substring() to remove all the special chars from the end of the names, but remember you must also GROUP BY the new expression you create; you can't GROUP BY the original name field or you won't get your desired count. You could also create another column that contains a sanitized name, one without any special character on the end.
I don't have Access installed, but something like this should do it:
SELECT
table.[field1],
Replace(table.[Name], "†", "") AS Name,
Count(table.[Name]) AS [CountOfName]
FROM
table
GROUP BY
table.[field1],
Replace(table.[Name], "†", "")