I have dataset where one of the column holds total sq.ft value.
1151
1025
2100 - 2850
1075
1760
I would like to split the 2100 - 2850 if the dataframe contains '-' and take its average(mean) as the new value. I am trying achieve this using apply method but running into error when statement containing contains is executing. Please suggest how to handle this situation.
def convert_totSqft(s):
if s.str.contains('-', regex=False) == True
<< some statements>>
else:
<< some statements>>
X['new_col'] = X['total_sqft'].apply(convert_totSqft)
Error message:
File "<ipython-input-6-af39b196879b>", line 2, in convert_totSqft
if s.str.contains('-', regex=False) == True:
AttributeError: 'str' object has no attribute 'str'
IIUC
df.col.str.split('-',expand=True).apply(pd.to_numeric).mean(1)
Out[630]:
0 1151.0
1 1025.0
2 2475.0
3 1075.0
4 1760.0
dtype: float64
IIUC, you can split by - anyway and just transform using np.mean, once the mean of a single number is just the number itself
df.col.str.split('-').transform(lambda s: np.mean([int(x.strip()) for x in s]))
0 1151.0
1 1025.0
2 2475.0
3 1075.0
4 1760.0
Alternatively, you can sum and divide by len (same thing)
df.col.str.split('-').transform(lambda s: sum([int(x.strip()) for x in s])/len(s))
If want results back necessarily as int, just wrap it with int()
df.col.str.split('-').transform(lambda s: int(np.mean([int(x.strip()) for x in s])))
0 1151
1 1025
2 2475
3 1075
4 1760
Related
Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.
Here's my data:
id medianHouseValue housingMedianAge totalBedrooms totalRooms \
0 23 113.903 31.0 543.0 2438.0
1 24 99.701 56.0 337.0 1692.0
2 26 107.500 41.0 123.0 535.0
3 27 93.803 53.0 244.0 1132.0
4 28 105.504 52.0 423.0 1899.0 households population medianIncome
0 481.0 1016.0 1.7250
1 328.0 856.0 2.1806
2 121.0 317.0 2.4038
3 241.0 607.0 2.4597
4 400.0 1104.0 1.8080
Here's what I'm trying to do:
change any values in the medianIncome column that are 0.4999 or lower to 0.4999 and change any values that are 15.0001 and higher to 15.0001.
I've tried this:
housing.loc[housing[‘medianIncome'] > 15.0001, 'medianIncome'] = 15.0001
housing.loc[housing[‘medianIncome'] < 0.4999, 'medianIncome'] = 0.4999
And get this error:
AttributeError: 'list' object has no attribute 'loc'
So then I tried this:'
housing['medianIncome'] = np.where(housing['medianIncome'] >= 15.0001, housing['medianIncome'])
housing['medianIncome'] = np.where(housing['medianIncome'] <= 0.4999, housing['medianIncome'])
And get this error:
TypeError: list indices must be integers or slices, not str
I've looked up both errors but can't seem to find a solution that will accommodate. There's a lot more rows, it's just not letting me co[y/paste them all here and I can't recall how to upload the data set.
My code uses a column called booking status that is 1 for yes and 0 for no (there are multiple other columns that information will be pulled from dependant on the booking status) - there are lots more no than yes so I would like to take a sample with all the yes and the same amount of no.
When I use
samp = rslt_df.sample(n=298, random_state=1, weights='bookingstatus')
I get the error:
ValueError: Fewer non-zero entries in p than size
Is there a way to do this sample this way?
If our entire dataset looks like this:
print(df)
c1 c2
0 1 1
1 0 2
2 0 3
3 0 4
4 0 5
5 0 6
6 0 7
7 1 8
8 0 9
9 0 10
We may decide to sample from it using the DataFrame.sample function. By default, this function will sample without replacement. Meaning, you'll receive an error by specifying a number of observations larger than the number of observations in your initial dataset:
df.sample(20)
ValueError: Cannot take a larger sample than population when 'replace=False'
In your situation, the ValueError comes from the weights parameter:
df.sample(3,weights='c1')
ValueError: Fewer non-zero entries in p than size
To paraphrase the DataFrame.sample docs, using the c1 column as our weights parameter implies that rows with a larger value in the c1 column are more likely to be sampled. Specifically, the sample function will not pick values from this column that are zero. We can fix this error using either one of the following methods.
Method 1: Set the replace parameter to be true:
m1 = df.sample(3,weights='c1', replace=True)
print(m1)
c1 c2
0 1 1
7 1 8
0 1 1
Method 2: Make sure the n parameter is equal to or less than the number of 1s in the c1 column:
m2 = df.sample(2,weights='c1')
print(m2)
c1 c2
7 1 8
0 1 1
If you decide to use this method, you won't really be sampling. You're really just filtering out any rows where the value of c1 is 0.
I was able to this in the end, here is how I did it:
bookingstatus_count = df.bookingstatus.value_counts()
print('Class 0:', bookingstatus_count[0])
print('Class 1:', bookingstatus_count[1])
print('Proportion:', round(bookingstatus_count[0] / bookingstatus_count[1], 2), ': 1')
# Class count
count_class_0, count_class_1 = df.bookingstatus.value_counts()
# Divide by class
df_class_0 = df[df['bookingstatus'] == 0]
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([f_class_0_under, df_class_1], axis=0)
df_class_1 = df[df['bookingstatus'] == 1]
based on this https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Thanks everyone
I would like to have the spatial standard deviation for a variable (let's say temperature). in other words, does GrADS have a "astd" (similarly to aave) command I could use?
There is no command like this in GRADS. But you can actually compute the standard deviation in two ways:
[1] Compute manually. For example:
*compute the mean
x1 = ave(ts1.1,t=1,t=120)
*compute stdev
s1 = sqrt(ave(pow(ts1.1-x1,2),t=1,t=120)*(n1/(n1-1)))
n here is the number of samples.
[2] You can use the built in function 'stat' in GRADS.
You can use 'set stat on' or 'set gxout stat'
These commands will give you statics such as the following:
Data Type = grid
Dimensions = 0 1
I Dimension = 1 to 73 Linear 0 5
J Dimension = 1 to 46 Linear -90 4
Sizes = 73 46 3358
Undef value = -2.56e+33
Undef count = 1763 Valid count = 1595
Min, Max = 243.008 302.818
Cmin, cmax, cint = 245 300 5
Stats[sum,sumsqr,root(sumsqr),n]: 452778 1.29046e+08 11359.8 1595
Stats[(sum,sumsqr,root(sumsqr))/n]: 283.874 80906.7 284.441
Stats[(sum,sumsqr,root(sumsqr))/(n-1)]: 284.052 80957.4 284.53
Stats[(sigma,var)(n)]: 17.9565 322.437
Stats[(sigma,var)(n-1)]: 17.9622 322.64
Contouring: 245 to 300 interval 5
Sigma here is the standard deviation and Var is variance.
Is this what you are looking for?
From my "Id" Column I want to remove the one and zero's from the left.
That is
1000003 becomes 3
1000005 becomes 5
1000011 becomes 11 and so on
Ignore -1, 10 and 1000000, they will be handled as special cases. but from the remaining rows I want to remove the "1" followed by zeros.
Well you can use modulus to get the end of the numbers (they will be the remainder). So just exclude the rows with ids of [-1,10,1000000] and then compute the modulus of 1000000:
print df
Id
0 -1
1 10
2 1000000
3 1000003
4 1000005
5 1000007
6 1000009
7 1000011
keep = df.Id.isin([-1,10,1000000])
df.Id[~keep] = df.Id[~keep] % 1000000
print df
Id
0 -1
1 10
2 1000000
3 3
4 5
5 7
6 9
7 11
Edit: Here is a fully vectorized string slice version as an alternative (Like Alex' method but takes advantage of pandas' vectorized string methods):
keep = df.Id.isin([-1,10,1000000])
df.Id[~keep] = df.Id[~keep].astype(str).str[1:].astype(int)
print df
Id
0 -1
1 10
2 1000000
3 3
4 5
5 7
6 9
7 11
Here is another way you could try to do it:
def f(x):
"""convert the value to a string, then select only the characters
after the first one in the string, which is 1. For example,
100005 would be 00005 and I believe it's returning 00005.0 from
dataframe, which is why the float() is there. Then just convert
it to an int, and you'll have 5, etc.
"""
return int(float(str(x)[1:]))
# apply the function "f" to the dataframe and pass in the column 'Id'
df.apply(lambda row: f(row['Id']), axis=1)
I get that this question is satisfactory answered. But for future visitors, what I like about alex' answer is that it does not depend on there to be exactly four zeros. The accepted answer will fail if you sometimes have 10005, sometimes 1000005 and whatever.
However, to add something more to the way we think about it. If you know it's always going to be 10000, you can do
# backup all values
foo = df.id
#now, some will be negative or zero
df.id = df.id - 10000
#back in those that are negative or zero (here, first three rows)
df.if[df.if <= 0] = foo[df.id <= 0]
It gives you the same as Karl's answer, but I typically prefer these kind of methods for their readability.