load data from text with pandas with a two characters as separator - pandas

I'm trying to load data with pandas from a txt table.
The column separator was defined as "|#" as you can see in the example:
LINEA DE NEGOCIO|#NOMBRE CLIENTE|#NUMERO CLIENTE|#NUMERO DE CONTRATO|#TIPO DE SEGURO
The system does not allow to use "|#" as separator.
Could you help me with this loading?
Thanks in advance.
I share the code:
df = pd.read_table('D:/Art_492/Encabezado.txt', sep='|#', index_col=0).astype(str)

The | represents OR operator in regular expression, you need to escape it using \ so updating your regex string to \|# and setting engine='python' you can get your desired result.
pd.read_table(data,sep='\|#',engine='python',index_col=0).astype(str)

Related

How to find part of a string in HL7-like freetextparameter between e.q. 2nd and 3thd | character?

In a program I created some HL7-like strings.
They look like this:
ID=1610968|EAD=02962|CNR=0|ACT=10968|ACTNAME=bijkomend honorarium voor toezicht COVID-19-patient|TIME=2/02/2023 16:21:00|EENHEID=30016|AFDCODE=KANE|AANTAL=1|URG=0|INF=0|TOPO=0|ARTS=avdbro9|SUP=avdbro9
I am looking for a SQL query in which i can split up this string in separate parts between the | character.
for instance I would like to find a way to isolate the part |ACT=10968| which always will be between the third and fourth |.
How can I do this?
thnx
c
If you are using SQL Server then try using the STRING_SPLIT function as documented here and then select the element you require from the output of this function

Create a conditional column using a specific string as a delimiter in pentaho

Im trying to create a conditional column in pentaho spliting by the delimiter "NF" on the image below...
I've tried a lot of things, like filter rows, split columns and etc, but as specif string being requested i think that is better way to do this, can someone help pls?
I've tried filter rows, split fields, and a function in the formula step
You don't state the output you are trying to get from the column with the NF delimiter, lets say you are trying to get two new columns:
IMPOSTOS
BEFORE_NF
AFTER_NF
PIS APURACAO S/NF 0001 TAG COMERCIO
PIS APURACAO S/
0001 TAG COMERCIO
COFINS APURACAO S/NF 0002 TAG COMERCIO
COFINS APURACAO S/
0002 TAG COMERCIO
To get this outcome you can use the Regex Evaluation step, that uses this regex formula to separate your column:
(.*)(NF\s)(.*)
This separates your text in 3 groups: text before "NF ", the text "NF " and the text after "NF ".
The Regex evaluation step also has the ability to create another column with a flag to indicate if the regex formula was successful (the formula match the text or not).

Search for multiple question marks in pandas

I want to search for multiple signs in my dataset with pandas. For example when I search for multiple explanation points I use this script that works:
df_double=df[df["text"].str.contains("!!")==True]
df_double
But when I want to change this script to search for multiple question marks, I get an error:
df_double=df[df["text"].str.contains("??")==True]
df_double
What is wrong with this script?
Use \ for escape ?, because special regex chars with {2} for specify 2 chars:
df1 = df[df["text"].str.contains("\?{2}", na=False)]
Or:
df1 = df[df["text"].str.contains("\?\?", na=False)]

How to escape delimiter found in value - pig script?

In pig script, I would like to find a way to escape the delimiter character in my data so that it doesn't get interpreted as extra columns. For example, if I'm using colon as a delimiter, and I have a column with value "foo:bar" I want that string interpreted as a single column without having the loader pick up the comma in the middle.
You can try http://pig.apache.org/docs/r0.12.0/func.html#regex-extract-all
A = LOAD 'somefile' AS (s:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(s, '(.*) : (.*)'));
The regex might have to be adapted.
It seems Pig takes the Input as the string its not so intelligent to identify how what is data or what is not.
The pig Storage works on the Strong Tokenizer. So if u want to do something like
a = LOAD '/abc/def/file.txt' USING PigStorage(':');
It doesn't seems to be solving your problem. But if we can write our own PigStorage() Method possibly we could come across some solution.
I will try posting the Code to resolve this.
you can use STRSPLIT(string, regex, limit); for the column split based on the delimiter.

Unable to Remove Special Characters In Pig

I have a text file that I want to Load onto my Pig Engine,
The text file have names in it in separate rows, and the data but has errors in it.....special characters....Something like this:
Ja##$s000on
J##a%^ke
T!!ina
Mel#ani
I want to remove the special characters from all the names using REGEX ....One way i found to do the job in pig and finally have the output as...
Jason
Jake
Tina
Melani
Can someone please tell me the regex that will do this job in Pig.
Also write the command that will do it as I unable to use the REGEX_EXTRACT and REGEX_EXTRACT_ALL function.
Also can someone explain what is the Significance of the number 1 that we pass to this function as Argument after defining the Regex.
Any help would be highly appreciated.
You can use REPLACE with RegEx to solve this problem.
input.txt
Ja##$s000on
J##a%^ke T!!ina Mel#ani
PigScript:
A = LOAD 'input.txt' as line;
B = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z\\s]+)','');
dump B;
Output:
(Jason)
(Jake Tina Melani)
There is no way to escape these characters when they are part of the values in a tuple, bag, or map, but there is no problem whatsoever in loading these characters in when part of a string. Just specify that field as type chararray
Please Have a look here