Need a way to split string pandas to colums with numbers - pandas

hi i have string in one column :
s='123. 125. 200.'
i want to split it to 3 columns(or as many numbers i have ends with .)
To separate columns and that it will be number not string !, in every column .

From what I understand, you can use:
s='123. 125. 200.'
pd.Series(s).str.rstrip('.').str.split('.',expand=True).apply(pd.to_numeric,errors='coerce')
0 1 2
0 123 125 200

Related

Split a vector into parts separated by zeros and cumulatively sum the elements in each part

I want to split a vector into several parts separated by the numeric value 0. For each part, cumulatively calculate the sum of the elements encountered so far. Negative numbers do not participate in the calculation.
For example, with the input [0,1,1,1,0,1,-1,1,1], I expect the result to be [0,1,2,3,0,1,1,2,3].
How to implement this in DolphinDB?
Use the DolphinDB built-in function cumPositiveStreak(X). Treat the negative elements in X as NULL values.
Script:
a = 1 2 -1 0 1 2 3
cumPositiveStreak(iif(a<0,NULL,a))
Execution result:
1 3 3 0 1 3 6

Remove just strings from the entries in my first column of data frame

I have strings and numbers in my first column of a data frame:
rn
AT457
X5377
X3477
I want to remove just the strings and keep the numbers from each entry in the column called rn.
Any help is appreciated.
Use a regular expression to do this.
For example, with R :
## Sample data :
df=data.frame(rn=c("AT457","X5377","X3477"))
## Replace the letters with *nothing* ('\D' is used to identify non-digit characters)
df$rn_strip=gsub('\\D',"",df$rn)
## Output :
rn rn_strip
1 AT457 457
2 X5377 5377
3 X3477 3477

Select rows where column value is a combination of numbers and letters

Having a dataset like this:
word
0 TBH46T
1 BBBB
2 5AAH
3 CAAH
4 AAB1
5 5556
Which would be the most efficient way to select the rows where column word is a combination of numbers and letters?
The output would be like this:
word
0 TBH46T
2 5AAH
4 AAB1
A possible solution would be to create a new column using apply and regex in which store if column word has the desired structure. But I'm curious about if this could be achieved in a more straightforward way.
Use Series.str.contains for chain mask for match numeric and for match non numeric with & for bitwise AND:
df = df[df['word'].str.contains('\d') & df['word'].str.contains('\D')]
print (df)
word
0 TBH46T
2 5AAH
4 AAB1

Split column in hive

I am new to Hive and Hadoop framework. I am trying to write a hive query to split the column delimited by a pipe '|' character. Then I want to group up the 2 adjacent values and separate them into separate rows.
Example, I have a table
id mapper
1 a|0.1|b|0.2
2 c|0.2|d|0.3|e|0.6
3 f|0.6
I am able to split the column by using split(mapper, "\\|") which gives me the array
id mapper
1 [a,0.1,b,0.2]
2 [c,0.2,d,0.3,e,0.6]
3 [f,0.6]
Now I tried to to use the lateral view to split the mapper array into separate rows, but it will separate all the values, where as I want to separate by group.
Expected:
id mapper
1 [a,0.1]
1 [b,0.2]
2 [c,0.2]
2 [d,0.3]
2 [e,0.6]
3 [f,0.6]
Actual
id mapper
1 a
1 0.1
1 b
1 0.2
etc .......
How can I achieve this?
I would suggest you to split your pairs split(mapper, '(?<=\\d)\\|(?=\\w)'), e.g.
split('c|0.2|d|0.3|e|0.6', '(?<=\\d)\\|(?=\\w)')
results in
["c|0.2","d|0.3","e|0.6"]
then explode the resulting array and split by |.
Update:
If you have digits as well and your float numbers have only one digit after decimal marker then the regex should be extended to split(mapper, '(?<=\\.\\d)\\|(?=\\w|\\d)').
Update 2:
OK, the best way is to split on the second | as follows
split(mapper, '(?<!\\G[^\\|]+)\\|')
e.g.
split('6193439|0.0444035224643987|6186654|0.0444035224643987', '(?<!\\G[^\\|]+)\\|')
results in
["6193439|0.0444035224643987","6186654|0.0444035224643987"]

change value (string manipulation) in Pandas DataFrame

I am reading a CSV file to Pandas DataFrame but need to be cleaned up before can be used. I need to do two things:
use regex to filter values
apply string functions such as trim, left, right, ...
For instance, DataFrame may looks like:
0 city_some_string_45
1 city_Other_string_56
2 city_another_string_77
so I need to filter (using regex) for all rows that its value start with "city" and get last two character.
the end result should looks like:
0 45
1 56
2 77
In another word, logic I want to apply is: read value of cell and if starts with city (filtering with regex ie: ^city) and replace the value of cell with its two last character of the cell (eg using right string function)
For a dataframe like this:
No city
0 0 city_some_string_45
1 1 city_Other_string_56
2 2 city_another_string_77
Filter the dataframe to keep the rows with city column starting with city
df = df[df.city.str.startswith('city')]
You can use str.extract to extract only the number
df['city'] = df.city.str.extract('(\d+)').astype(int)
The resulting df
No city
0 0 45
1 1 56
2 2 77