Remove just strings from the entries in my first column of data frame - dataframe

I have strings and numbers in my first column of a data frame:
rn
AT457
X5377
X3477
I want to remove just the strings and keep the numbers from each entry in the column called rn.
Any help is appreciated.

Use a regular expression to do this.
For example, with R :
## Sample data :
df=data.frame(rn=c("AT457","X5377","X3477"))
## Replace the letters with *nothing* ('\D' is used to identify non-digit characters)
df$rn_strip=gsub('\\D',"",df$rn)
## Output :
rn rn_strip
1 AT457 457
2 X5377 5377
3 X3477 3477

Related

return rows dosn't have specefic number of length in pandas

am clean my dataset and cleaned it but am stuck in some rows don't have the specific length must have in the column
The column (order_id) must have 16 character the column type is object, so i'dont know how i can extract all rows don't have the exact character must be in column and how to remove those rows
Thank You .
for more information
image of column
in excel i can just filter the column and show only value that has 16 character
i want to do that in pandas i want just to return rows that contain 16 character and drop all row greater or lower than 16 character .
I suppose you want to keep all rows which match this pattern [0-9A-F]{16}:
df = df[df['order_id'].str.contains(r'^[0-9A-F]{16}$')]

Select rows which contain numeric substrings in Pandas

I need to delete rows from a dataframe in which a particular column contains string which contains numeric substrings. See the shaded column of my dataframe.
rows with values like 0E as prefix or 21 (any two digit number) as suffix or 24A (any two digit number with a letter) as suffix should be deleted.
Any suggestions?
Thanks in advance.
You can use boolean indexing with a str.contains() regex:
^0E - starts with 0E
\d{2}$ - ends with 2 digits
\d{2}[A-Z]$ - ends with 2 digits and 1 capital letter
col = ... # target column
mask = df[col].str.contains(r'^0E|\d{2}$|\d{2}[A-Z]$')
df = df.loc[~mask]
#tdy gave a good answer, but only one place need to be modified if I understand it correctly.
For value ends with two digits or two digits and a capital character, the regex should be:
.*\d{2}[A-Z]?$

How to extract just numeric value with REGEXP_EXTRACT in BigQuery?

I am trying to extract just the numbers from a particular column in BigQuery.
The fields concerned have this format: value = "Livraison_21J|Relais_19J" or "RELAIS_15 DAY"
I am trying to extract the number of days for each value preceeded by the keyword "Relais".
The days range from 1 to 100.
I used this to do so:
SELECT CAST(REGEXP_EXTRACT(delivery, r"RELAIS_([0-9]+J)") as string) as relayDay
FROM TABLE
I want to be able to extract just the number of days regardless of the the string that comes after the numbers, be it "J" or "DAY".
Sample data :
RETRAIT_2H|LIVRAISON_5J|RELAIS_5J | 5J
LIVRAISON_21J|RELAIS_19J | 19J
LIVRAISON_21J|RELAIS_19J | 19J
RETRAIT_2H|LIVRAISON_3J|RELAIS_3J | 3J
You may use
REGEXP_EXTRACT(delivery, r"(?:.*\D)?(\d+)\s*(?:J|DAY)")
See the regex demo
Details
(?:.*\D)? - an optional non-capturing group that matches 0+ chars other than line break chsrs as many as possible and then a non-digit char (this pattern is required to advance the index to the location right before the last sequence of digits, not the last digit)
(\d+) - Group 1 (just what the REGEXP_EXTRACT returns): one or more digits
\s* - 0+ whitespaces
(?:J|DAY) - J or DAY substrings.

Need a way to split string pandas to colums with numbers

hi i have string in one column :
s='123. 125. 200.'
i want to split it to 3 columns(or as many numbers i have ends with .)
To separate columns and that it will be number not string !, in every column .
From what I understand, you can use:
s='123. 125. 200.'
pd.Series(s).str.rstrip('.').str.split('.',expand=True).apply(pd.to_numeric,errors='coerce')
0 1 2
0 123 125 200

Split column in hive

I am new to Hive and Hadoop framework. I am trying to write a hive query to split the column delimited by a pipe '|' character. Then I want to group up the 2 adjacent values and separate them into separate rows.
Example, I have a table
id mapper
1 a|0.1|b|0.2
2 c|0.2|d|0.3|e|0.6
3 f|0.6
I am able to split the column by using split(mapper, "\\|") which gives me the array
id mapper
1 [a,0.1,b,0.2]
2 [c,0.2,d,0.3,e,0.6]
3 [f,0.6]
Now I tried to to use the lateral view to split the mapper array into separate rows, but it will separate all the values, where as I want to separate by group.
Expected:
id mapper
1 [a,0.1]
1 [b,0.2]
2 [c,0.2]
2 [d,0.3]
2 [e,0.6]
3 [f,0.6]
Actual
id mapper
1 a
1 0.1
1 b
1 0.2
etc .......
How can I achieve this?
I would suggest you to split your pairs split(mapper, '(?<=\\d)\\|(?=\\w)'), e.g.
split('c|0.2|d|0.3|e|0.6', '(?<=\\d)\\|(?=\\w)')
results in
["c|0.2","d|0.3","e|0.6"]
then explode the resulting array and split by |.
Update:
If you have digits as well and your float numbers have only one digit after decimal marker then the regex should be extended to split(mapper, '(?<=\\.\\d)\\|(?=\\w|\\d)').
Update 2:
OK, the best way is to split on the second | as follows
split(mapper, '(?<!\\G[^\\|]+)\\|')
e.g.
split('6193439|0.0444035224643987|6186654|0.0444035224643987', '(?<!\\G[^\\|]+)\\|')
results in
["6193439|0.0444035224643987","6186654|0.0444035224643987"]