need to all extract the content inside the brackets in pandas dataframe - pandas

I need to extract only the content inside the brackets in pandas dataframe. I tried using str.exratct() but its not working . I need help with the extraction
DATA: ( IS IN DATA FRAME, This is a sample data from one row )
By:Chen TX (Chen Tianxu)[ 1 ] ; Tribbitt MA (Tribbitt Mark A.)[ 2 ] ; Yang Y (Yang Yi)[ 3 ] ; Li XM (Li Xiaomei)[ 4 ]

You can use the regular expression:
import pandas as pd
import re
dataset = pd.DataFrame([{'DATA': 'By:Chen TX (Chen Tianxu)[ 1 ] ; Tribbitt MA (Tribbitt Mark A.)[ 2 ] ; Yang Y (Yang Yi)[ 3 ] ; Li XM (Li Xiaomei)[ 4 ]'}])
print(dataset)
Dataframe is:
DATA
0 By:Chen TX (Chen Tianxu)[ 1 ] ; Tribbitt MA (Tribbitt Mark A.)[ 2 ] ; Yang Y (Yang Yi)[ 3 ] ; Li XM (Li Xiaomei)[ 4 ]
Then, using regular expression with lambda function, such that you extract names and save it to different column named names:
# regular expression from: https://stackoverflow.com/a/31343831/5916727
dataset['names'] = dataset['DATA'].apply(lambda x: re.findall('\((.*?)\)',x))
print(dataset['names'])
Output of names column would be:
0 [Chen Tianxu, Tribbitt Mark A., Yang Yi, Li Xiaomei]

Related

Map two data frames by common string elements from List column

I want to map two data frames if the string element from two columns match, The common column i have is string with comma separated. I tried map function by converting it to dictionary also. But it didn't worked.
df
Text
[Temp,Temp2]
[Temp4,Temp7,Temp2]
ClusterDf
Label Member
[Cluster1] [Temp,Temp8]
[Cluster2] [Temp4,Temp7]
I want output like
df
Text Label
[Temp,Temp2] [Cluster1]
[Temp4,Temp7,Temp2] [Cluster2]
Create dictionary by ClusterDf and then add new column by map with next and iter if no match:
d = {v: a[0] for a, b in zip(ClusterDf['Label'], ClusterDf['Member']) for v in b}
print (d)
{'Temp': 'Cluster1', 'Temp8': 'Cluster1', 'Temp4': 'Cluster2', 'Temp7': 'Cluster2'}
df['Label'] = df['Text'].map(lambda x: next(iter(d[y] for y in x if y in d), 'no match'))
print (df)
Text Label
0 [Temp, Temp2] Cluster1
1 [Temp4, Temp7, Temp2] Cluster2
If need list:
df['Label'] = df['Text'].map(lambda x: [next(iter(d[y] for y in x if y in d), 'no match')])
print (df)
Text Label
0 [Temp, Temp2] [Cluster1]
1 [Temp4, Temp7, Temp2] [Cluster2]
If want all matching if exist:
df['Label'] = df['Text'].map(lambda x: [d[y] for y in x if y in d])
print (df)
Text Label
0 [Temp, Temp2] [Cluster1]
1 [Temp4, Temp7, Temp2] [Cluster2, Cluster2]
Thanks #jezrael, Third solution worked for me perfectly. Thanks a lot.
You made my day
df['Label'] = df['Text'].map(lambda x: [d[y] for y in x if y in d])

How do I insert rows with row index == value of 'index' and moving the other rows down?

I want to insert dictionaries within a list of dictionaries as rows in a dataframe with the indices equal to the values found in every key 'index' while moving the rows previously occupying those indices 1 step down so they don't get overwritten.
ex.
List:
rows=
[{'Abbreviation': u'3-HYDROXY-3-METHYL-GLUTARYL-COA_m',
'Charge': -5.0,
'Charged Formula': u'C27H39N7O20P3S1',
'Neutral Formula': u'C27H44N7O20P3S1',
'index': 101},
{'Abbreviation': u'5-METHYL-THF_c',
'Charge': -2.0,
'Charged Formula': u'C20H23N7O6',
'Neutral Formula': u'C20H25N7O6',
'index': 204}]
DataFrame: df
Before:
index Abbreviation
101 foo
204 bar
After:
index Abbreviation | etc..
101 3-HYDROXY-3-METHYL-GLUTARYL-COA_m .
102 foo
204 5-METHYL-THF_c
205 bar
Any help is appreciated. Thank you very much!
Regular list insertion should work here. Posting some sample code below:
d1 = { 'index' : 101 }
d2 = { 'index' : 102 }
d4 = { 'index' : 104 }
l = [d1, d4]
# assuming elements of l have the key 'index' and
# are sorted in the ascending order of 'index'
# inserting d2 in l
for i, v in enumerate(l):
if v['index'] > d2['index']:
break
l.insert(i, d2)
List contents:
Before inserting d2
[{'index': 101}, {'index': 104}]
After inserting d2
[{'index': 101}, {'index': 102}, {'index': 104}]

Elm: Split list into multiple lists

I'd like to be able to split a list up into multiple lists.
I'm assuming this would need to be stored in a tuple - although not completely sure.
Say I have this group of 8 people
users =
["Steve", "Sally", "Barry", "Emma", "John", "Gustav", "Ankaran", "Gilly"]
I would like to split them up into a specific amount of groups.
For example, groups of 2, 3 or 4 people.
-- desired result
( ["Steve", "Sally", "Barry"]
, ["Emma", "John", "Gustav"]
, ["Ankaran", "Gilly"]
)
Part 2 of this question would be, How would you then iterate and render the results from a tuple of various lengths?
I was playing around with this example, using tuple-map
but it seems to only expect a tuple with 2 values.
import Html exposing (..)
import List
data = (
["Steve", "Sally", "Barry"]
, ["Emma", "John", "Gustav"]
, ["Ankaran", "Gilly"]
)
renderLI value =
li [] [ text value ]
renderUL list =
ul [] (List.map renderLI list)
main =
div [] (map renderUL data)
-- The following taken from zarvunk/tuple-map for examples sake
{-| Map over the tuple with two functions, one for each
element.
-}
mapEach : (a -> a') -> (b -> b') -> (a, b) -> (a', b')
mapEach f g (a, b) = (f a, g b)
{-| Apply the given function to both elements of the tuple.
-}
mapBoth : (a -> a') -> (a, a) -> (a', a')
mapBoth f = mapEach f f
{-| Synonym for `mapBoth`.
-}
map : (a -> a') -> (a, a) -> (a', a')
map = mapBoth
I'd like to be able to split a list up into multiple lists. I'm assuming this would need to be stored in a tuple - although not completely sure.
Tuples are fixed in the number of things they can carry. You can't have a function that accepts any size tuple.
It sounds like you'd like something more flexible, like a list of lists. You could define a split function like this:
import List exposing (..)
split : Int -> List a -> List (List a)
split i list =
case take i list of
[] -> []
listHead -> listHead :: split i (drop i list)
Now you've got a function that can split up any size list into a list containing lists of the requested size.
split 2 users == [["Steve","Sally"],["Barry","Emma"],["John","Gustav"],["Ankaran","Gilly"]]
split 3 users == [["Steve","Sally","Barry"],["Emma","John","Gustav"],["Ankaran","Gilly"]]
Your Html rendering now becomes simpler, since you only have to deal with lists of lists:
import Html exposing (..)
import List exposing (..)
split : Int -> List a -> List (List a)
split i list =
case take i list of
[] -> []
listHead -> listHead :: split i (drop i list)
users =
["Steve", "Sally", "Barry", "Emma", "John", "Gustav", "Ankaran", "Gilly"]
renderLI value =
li [] [ text value ]
renderUL list =
ul [] (List.map renderLI list)
main =
div [] (map renderUL <| split 3 users)
Updated answer for Elm 0.19
import List.Extra as E
E.groupsOf 3 (List.range 1 10)
--> [ [ 1, 2, 3 ], [ 4, 5, 6 ], [ 7, 8, 9 ] ]

How do I refer to variable in func argument when same is used in foreach

How can I refer to date as argument in f within the foreach loop if date is also used as block element var ? Am I obliged to rename my date var ?
f: func[data [block!] date [date!]][
foreach [date o h l c v] data [
]
]
A: simple, compose is your best friend.
f: func[data [block!] date [date!]][
foreach [date str] data compose [
print (date)
print date
]
]
>> f [2010-09-01 "first of sept" 2010-10-01 "first of october"] now
7-Sep-2010/21:19:05-4:00
1-Sep-2010
7-Sep-2010/21:19:05-4:00
1-Oct-2010
You need to either change the parameter name from date or assign it to a local variable.
You can access the date argument inside the foreach loop by binding the 'date word from the function specification to the data argument:
>> f: func[data [block!] date [date!]][
[ foreach [date o h l c v] data [
[ print last reduce bind find first :f 'date 'data
[ print date
[ ]
[ ]
>> f [1-1-10 1 2 3 4 5 2-1-10 1 2 3 4 5] 8-9-10
8-Sep-2010
1-Jan-2010
8-Sep-2010
2-Jan-2010
It makes the code very difficult to read though. I think it would be better to assign the date argument to a local variable inside the function as Graham suggested.
>> f: func [data [block!] date [date!] /local the-date ][
[ the-date: :date
[ foreach [date o h l c v] data [
[ print the-date
[ print date
[ ]
[ ]
>> f [1-1-10 1 2 3 4 5 2-1-10 1 2 3 4 5] 8-9-10
8-Sep-2010
1-Jan-2010
8-Sep-2010
2-Jan-2010

I need to generate 50 Millions Rows csv file with random data: how to optimize this program?

The program below can generate random data according to some specs (example here is for 2 columns)
It works with a few hundred of thousand lines on my PC (should depend on RAM). I need to scale to dozen of millions row.
How can I optimize the program to write directly to disk ? Subsidiarily how can I "cache" the parsing rule execution as it is always the same pattern repeated 50 Millions times ?
Note: to use the program below, just type generate-blocks and save-blocks output will be db.txt
Rebol[]
specs: [
[3 digits 4 digits 4 letters]
[2 letters 2 digits]
]
;====================================================================================================================
digits: charset "0123456789"
letters: charset "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
separator: charset ";"
block-letters: [A B C D E F G H I J K L M N O P Q R S T U V W X Y Z]
blocks: copy []
generate-row: func[][
Foreach spec specs [
rule: [
any [
[
set times integer! [['digits (
repeat n times [
block: rejoin [block random 9]
]
)
|
'letters (repeat n times [
block: rejoin [ block to-string pick block-letters random 24]
]
)
]
|
[
'letters (repeat n times [block: rejoin [ block to-string pick block-letters random 24]
]
)
|
'digits (repeat n times [block: rejoin [block random 9]]
)
]
]
|
{"} any separator {"}
]
]
to end
]
block: copy ""
parse spec rule
append blocks block
]
]
generate-blocks: func[m][
repeat num m [
generate-row
]
]
quote: func[string][
rejoin [{"} string {"}]
]
save-blocks: func[file][
if exists? to-rebol-file file [
answer: ask rejoin ["delete " file "? (Y/N): "]
if (answer = "Y") [
delete %db.txt
]
]
foreach [field1 field2] blocks [
write/lines/append %db.txt rejoin [quote field1 ";" quote field2]
]
]
Use open with /direct and /lines refinement to write directly to file without buffering the content:
file: open/direct/lines/write %myfile.txt
loop 1000 [
t: random "abcdefghi"
append file t
]
Close file
This will write 1000 random lines without buffering.
You can also prepare a block of lines (lets say 10000 rows) then write it directly to file, this will be faster than writing line-by-line.
file: open/direct/lines/write %myfile.txt
loop 100 [
b: copy []
loop 1000 [append b random "abcdef"]
append file b
]
close file
this will be much faster, 100000 rows less than a second.
Hope this will help.
Note that, you can change the number 100 and 1000 according to your needs an memory of your pc, and use b: make block! 1000 instead of b: copy [], it will be faster.