Pandas series extract with regular experession - pandas

Need to extract the following from Pandas column which has the following values
8-9 yrs
7-12 yrs
4-6 yrs
Would need 9,12,6 updated in the column .

A DataFrame with df with a column a
using re library with findall function with regex
import re
df.a.apply(lambda x : re.findall(r'-(\d+)', x))

Use str.extract with regex for get numbers after - or split with indexing, last if necessary cast to integer:
df['B1'] = df.A.str.extract('-(\d+)', expand=True)
df['B2'] = df.A.str.split(n=1).str[0].str.split('-').str[1].astype(int)
df['B3'] = df.A.str.split('-|\s+').str[1].astype(int)
print (df)
A B1 B2 B3
0 8-9 yrs 9 9 9
1 7-12 yrs 12 12 12
2 4-6 yrs 6 6 6

Related

Create Repeating N Rows at Interval N Pandas DF [duplicate]

This question already has an answer here:
Repeat Rows in Data Frame n Times [duplicate]
(1 answer)
Closed 1 year ago.
i have a df1 with shape 15,1 but I need to create a new df2 of shape 270,1 with repeating rows from each row of the rows in df1 at intervals of 18 rows 15 times (18 * 15 = 270). The df1 looks like this:
Sites
0 TULE
1 DRY LAKE I
2 PENASCAL I
3 EL CABO
4 BARTON CHAPEL
5 RUGBY
6 BARTON I
7 BLUE CREEK
8 NEW HARVEST
9 COLORADO GREEN
10 CAYUGA RIDGE
11 BUFFALO RIDGE I
12 DESERT WIND
13 BIG HORN I
14 GROTON
My df2 should look like this in abbreviated form below and thank you,
I FINALLY found the answer: convert the dataframe to a series and use repeat in the form: my_series.repeat(N) and then convert back the series to a df.

Remove a string from certain column values and then operate them Pandas

I have a dataframe with a column named months (as bellow), but it contains some vales passed as "x years". So I want to remove the word "years" and multiplicate them for 12 so all column is consistent.
index months
1 5
2 7
3 3 years
3 9
4 10 years
I tried with
if df['months'].str.contains("years")==True:
df['df'].str.rstrip('years').astype(float) * 12
But it's not working
You can create a multiplier series based on index with "years" and multiply those months by 12
multiplier = np.where(df['months'].str.contains('years'), 12,1)
df['months'] = df['months'].str.replace('years','').astype(int)*multiplier
You get
index months
0 1 5
1 2 7
2 3 36
3 3 9
4 4 120
Slice and then use replace()
indexs = df['months'].str.contains("years")
df.loc[indexs , 'months'] = df['a'].str.replace("years" , "").astype(float) * 12

Pandas: replacing part of a string from elements in different columns

I have a dataframe where numbers contained in some cells (in several columns) look like this: '$$10'
I want to replace/remove the '$$'. So far I tried this, but I does not work:
replace_char={'$$':''}
df.replace(replace_char, inplace=True)
An example close to the approach you are taking would be:
df[col_name].str.replace('\$\$', '')
Notice that this has to be done on a series so you have to select the column you would like to apply the replace to.
amt
0 $$12
1 $$34
df['amt'] = df['amt'].str.replace('\$\$', '')
df
gives:
amt
0 12
1 34
or you could apply to the full df with:
df.replace({'\$\$':''}, regex=True)
your code is (almost) right.
this will work if you had AA:
replace_char={'AA':''}
df.replace(replace_char, inplace=True)
problem is $$ is a regex and therefore you need to do it differently:
df['your_column'].replace({'\$':''}, regex = True)
example:
df = pd.DataFrame({"A":[1,2,3,4,5,'$$6'],"B":[9,9,'$$70',9,9, np.nan]})
A B
0 1 9
1 2 9
2 3 $$70
3 4 9
4 5 9
5 $$6 NaN
do
df['A'].replace({'\$':''}, regex = True)
desired result for columns A:
0 1
1 2
2 3
3 4
4 5
5 6
you can iterate to any column from this point.
You just need to specify the regex argument. Like:
replace_char={'$$':''}
df.replace(replace_char, in place = True, regex = True)
'df.replace' should replace it for all entries in the data frame.

Generate list of values summing to 1 - within groupby?

In the spirit of Generating a list of random numbers, summing to 1 from several years ago, is there a way to apply the np array result of the np.random.dirichlet result against a groupby for the dataframe?
For example, I can loop through the unique values of the letter column and apply one at a time:
df = pd.DataFrame([['a', 1], ['a', 3], ['a', 2], ['a', 6],
['b', 7],['b', 5],['b', 4],], columns=['letter', 'value'])
df['grp_sum'] = df.groupby('letter')['value'].transform('sum')
df['prop_of_total'] = np.random.dirichlet(np.ones(len(df)), size=1).tolist()[0]
for letter in df['letter'].unique():
sz=len(df[df['letter'] == letter])
df.loc[df['letter'] == letter, 'prop_of_grp'] = np.random.dirichlet(np.ones(sz), size=1).tolist()[0]
print(df)
results in:
letter value grp_sum prop_of_total prop_of_grp
0 a 1 12 0.015493 0.293481
1 a 3 12 0.114027 0.043973
2 a 2 12 0.309150 0.160818
3 a 6 12 0.033999 0.501729
4 b 7 16 0.365276 0.617484
5 b 5 16 0.144502 0.318075
6 b 4 16 0.017552 0.064442
but there's got to be a better way than iterating the unique values and filtering the dataframe for each. This is small but I'll have potentially tens of thousands of groupings of varying sizes of ~50-100 rows each, and each needs a different random distribution.
I have also considered creating a temporary dataframe for each grouping, appending to a second dataframe and finally merging the results, though that seems more convoluted than this. I have not found a solution where I can apply an array of groupby size to the groupby but I think something along those lines would do.
Thoughts? Suggestions? Solutions?
IIUC, do a transform():
def direchlet(x, size=1):
return np.array(np.random.dirichlet(np.ones(len(x)), size=size)[0])
df['prop_of_grp'] = df.groupby('letter')['value'].transform(direchlet)
Output:
letter value grp_sum prop_of_total prop_of_grp
0 a 1 12 0.102780 0.127119
1 a 3 12 0.079201 0.219648
2 a 2 12 0.341158 0.020776
3 a 6 12 0.096956 0.632456
4 b 7 16 0.193970 0.269094
5 b 5 16 0.012905 0.516035
6 b 4 16 0.173031 0.214871

how to calculate percentage changes across 2 columns in a dataframe using pct_change in Python

I have a dataframe and want to use pct_chg method to calculate the % change between only 2 of the selected columns, B and C, and put the output into a new column. the below code doesnt seem to work. can anyone help me?
df2 = pd.DataFrame(np.random.randint(0,50,size=(100, 4)), columns=list('ABCD'))
df2['new'] = df2.pct_change(axis=1)['B']['C']
Try:
df2['new'] = df2[['B','C']].pct_change(axis=1)['C']
pct_change returns pct_change across all the columns, you can select the required column and assign to a new variable.
df2['new'] = df2.pct_change(axis=1)['C']
A B C D new
0 29 4 29 5 6.250000
1 14 35 2 40 -0.942857
2 5 18 31 10 0.722222
3 17 10 42 41 3.200000
4 24 48 47 35 -0.020833
IIUC, you can just do the following:
df2['new'] = (df2['C']-df2['B'])/df2['B']