How to create a new column in a Pandas DataFrame using pandas.cut method? - pandas

I have a column with house prices that looks like this:
0 0.0
1 1480000.0
2 1035000.0
3 0.0
4 1465000.0
5 850000.0
6 1600000.0
7 0.0
8 0.0
9 0.0
Name: Price, dtype: float64
and I want to create a new column called data['PriceRanges'] which sets each price in a given range. This is what my code looks like:
data = pd.read_csv("Melbourne_housing_FULL.csv")
data.fillna(0, inplace=True)
for i in range(0, 12000000, 50000):
bins = np.array(i)
labels = np.array(str(i))
data['PriceRange'] = pd.cut(data.Price, bins=bins, labels=labels, right=True)
And I get this Error message:
TypeError: len() of unsized object
I've been trying different approaches and seem to be stuck here. I'd really appreciate some help.
Thanks,
Hugo

There is problem you overwrite bins and labels in loop, so there is only last value.
for i in range(0, 12000000, 50000):
bins = np.array(i)
labels = np.array(str(i))
print (bins)
11950000
print (labels)
11950000
There is no necessary loop, only instead range use numpy alternative arange and for labels create ranges. Last add parameter include_lowest=True to cut for include first value of bins (0) to first group.
bins = np.arange(0, 12000000, 50000)
labels = ['{} - {}'.format(i + 1, j) for i, j in zip(bins[:-1], bins[1:])]
#correct first value
labels[0] = '0 - 50000'
print (labels[:10])
['0 - 50000', '50001 - 100000', '100001 - 150000', '150001 - 200000',
'200001 - 250000', '250001 - 300000', '300001 - 350000', '350001 - 400000',
'400001 - 450000', '450001 - 500000']
data['PriceRange'] = pd.cut(data.Price,
bins=bins,
labels=labels,
right=True,
include_lowest=True)
print (data)
Price PriceRange
0 0.0 0 - 50000
1 1480000.0 1450001 - 1500000
2 1035000.0 1000001 - 1050000
3 0.0 0 - 50000
4 1465000.0 1450001 - 1500000
5 850000.0 800001 - 850000
6 1600000.0 1550001 - 1600000
7 0.0 0 - 50000
8 0.0 0 - 50000
9 0.0 0 - 50000

Related

average on dataframe segments

In the following picture, I have DataFrame that renders zero after each cycle of operation (the cycle has random length). I want to calculate the average (or perform other operations) for each patch. For example, the average of [0.762, 0.766] alone, and [0.66, 1.37, 2.11, 2.29] alone and so forth till the end of the DataFrame.
So I worked with this data :
random_value
0 0
1 0
2 1
3 2
4 3
5 0
6 4
7 4
8 0
9 1
There is probably a way better solution, but here is what I came with :
def avg_function(df):
avg_list = []
value_list = list(df["random_value"])
temp_list = []
for i in range(len(value_list)):
if value_list[i] == 0:
if temp_list:
avg_list.append(sum(temp_list) / len(temp_list))
temp_list = []
else:
temp_list.append(value_list[i])
if temp_list: # for the last values
avg_list.append(sum(temp_list) / len(temp_list))
return avg_list
test_list = avg_function(df=df)
test_list
[Out] : [2.0, 4.0, 1.0]
Edit: since requested in the comments, here is a way to add the means to the dataframe. I dont know if there is a way to do that with pandas (and there might be!), but I came up with this :
def add_mean(df, mean_list):
temp_mean_list = []
list_index = 0 # will be the index for the value of mean_list
df["random_value_shifted"] = df["random_value"].shift(1).fillna(0)
random_value = list(df["random_value"])
random_value_shifted = list(df["random_value_shifted"])
for i in range(df.shape[0]):
if random_value[i] == 0 and random_value_shifted[i] == 0:
temp_mean_list.append(0)
elif random_value[i] == 0 and random_value_shifted[i] != 0:
temp_mean_list.append(0)
list_index += 1
else:
temp_mean_list.append(mean_list[list_index])
df = df.drop(["random_value_shifted"], axis=1)
df["mean"] = temp_mean_list
return df
df = add_mean(df=df, mean_list=mean_list
Which gave me :
df
[Out] :
random_value mean
0 0 0
1 0 0
2 1 2
3 2 2
4 3 2
5 0 0
6 4 4
7 4 4
8 0 0
9 1 1

python pandas divide dataframe in method chain

I want to divide a dataframe by a number:
df = df/10
Is there a way to do this in a method chain?
# idea:
df = df.filter(['a','b']).query("a>100").assign(**divide by 10)
We can use DataFrame.div here:
df = df[['a','b']].query("a>100").div(10)
a b
0 40.0 0.7
1 50.0 0.8
5 70.0 0.3
Use DataFrame.pipe with lambda function for use some function for all data of DataFrame:
df = pd.DataFrame({
'a':[400,500,40,50,5,700],
'b':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'd':[5,3,6,9,2,4]
})
df = df.filter(['a','b']).query("a>100").pipe(lambda x: x / 10)
print (df)
a b
0 40.0 0.7
1 50.0 0.8
5 70.0 0.3
Here if use apply all columns are divided separately:
df = df.filter(['a','b']).query("a>100").apply(lambda x: x / 10)
You can see difference with print:
df1 = df.filter(['a','b']).query("a>100").pipe(lambda x: print (x))
a b
0 400 7
1 500 8
5 700 3
df2 = df.filter(['a','b']).query("a>100").apply(lambda x: print (x))
0 400
1 500
5 700
Name: a, dtype: int64
0 7
1 8
5 3
Name: b, dtype: int64

'float' object has no attribute 'split'

I have a pandas data-frame with a column with float numbers. I tried to split each item in a column by dot '.'. Then I want to add first items to second items. I don't know why this sample code is not working.
data=
0 28.47000
1 28.45000
2 28.16000
3 28.29000
4 28.38000
5 28.49000
6 28.21000
7 29.03000
8 29.11000
9 28.11000
new_array = []
df = list(data)
for i in np.arange(len(data)):
df1 = df[i].split('.')
df2 = df1[0]+df[1]/60
new_array=np.append(new_array,df2)
Use numpy.modf with DataFrame constructor:
arr = np.modf(data.values)
df = pd.DataFrame({'a':data, 'b':arr[1] + arr[0] / 60})
print (df)
a b
0 28.47 28.007833
1 28.45 28.007500
2 28.16 28.002667
3 28.29 28.004833
4 28.38 28.006333
5 28.49 28.008167
6 28.21 28.003500
7 29.03 29.000500
8 29.11 29.001833
9 28.11 28.001833
Detail:
arr = np.modf(data.values)
print(arr)
(array([ 0.47, 0.45, 0.16, 0.29, 0.38, 0.49, 0.21, 0.03, 0.11, 0.11]),
array([ 28., 28., 28., 28., 28., 28., 28., 29., 29., 28.]))
print(arr[0] / 60)
[ 0.00783333 0.0075 0.00266667 0.00483333 0.00633333 0.00816667
0.0035 0.0005 0.00183333 0.00183333]
EDIT:
df = pd.DataFrame({'a':data, 'b':arr[1] + arr[0]*5/3 })
print (df)
a b
0 28.47 28.783333
1 28.45 28.750000
2 28.16 28.266667
3 28.29 28.483333
4 28.38 28.633333
5 28.49 28.816667
6 28.21 28.350000
7 29.03 29.050000
8 29.11 29.183333
9 28.11 28.183333
Your data types are floats, not strings, and so cannot be .split() (this is a string method). Instead you can look to use math.modf to 'split' a float into fractional and decimal parts
https://docs.python.org/3.6/library/math.html
import math
def process(x:float, divisor:int=60) -> float:
"""
Convert a float to its constituent parts. Divide the fractional part by the divisor, and then recombine creating a 'scaled fractional' part,
"""
b, a = math.modf(x)
c = a + b/divisor
return c
df['data'].apply(process)
Out[17]:
0 28.007833
1 28.007500
2 28.002667
3 28.004833
4 28.006333
5 28.008167
6 28.003500
7 29.000500
8 29.001833
9 28.001833
Name: data=, dtype: float64
Your other option is to convert them to strings, split, convert to ints and floats again, do some maths and then combine the floats. I'd rather keep the object as it is personally.

How do I aggregate sub-dataframes in pandas?

Suppose I have two-leveled multi-indexed dataframe
In [1]: index = pd.MultiIndex.from_tuples([(i,j) for i in range(3)
: for j in range(1+i)], names=list('ij') )
: df = pd.DataFrame(0.1*np.arange(2*len(index)).reshape(-1,2),
: columns=list('xy'), index=index )
: df
Out[1]:
x y
i j
0 0 0.0 0.1
1 0 0.2 0.3
1 0.4 0.5
2 0 0.6 0.7
1 0.8 0.9
2 1.0 1.1
And I want to run a custom function on every sub-dataframe:
In [2]: def my_aggr_func(subdf):
: return subdf['x'].mean() / subdf['y'].mean()
:
: level0 = df.index.levels[0].values
: pd.DataFrame({'mean_ratio': [my_aggr_func(df.loc[i]) for i in level0]},
: index=pd.Index(level0, name=index.names[0]) )
Out[2]:
mean_ratio
i
0 0.000000
1 0.750000
2 0.888889
Is there an elegant way to do it with df.groupby('i').agg(__something__) or something similar?
Need GroupBy.apply, which working with DataFrame:
df1 = df.groupby('i').apply(my_aggr_func).to_frame('mean_ratio')
print (df1)
mean_ratio
i
0 0.000000
1 0.750000
2 0.888889
You don't need the custom function. You can calculate the 'within group means' with agg then perform an eval to get the ratio you want.
df.groupby('i').agg('mean').eval('x / y')
i
0 0.000000
1 0.750000
2 0.888889
dtype: float64

Using pandas to plot - array error

I have a file that looks like this:
> loc.38167 h3k4me1 1.8299 1.5343 0.0 0.0 1.8299 1.5343 0.0 ....
> loc.08652 h3k4me3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ....
I want to plot 500 random 'loc.' points on a graph. Each loc. has 100 values. I use the following python script:
file = open('h3k4me3.tab.data')
data = {}
for line in file:
cols = line.strip().split('\t')
vals = map(float,cols[2:])
data[cols[0]] = vals
file.close
randomA = data.keys()[:500]
window = int(math.ceil(5000.0 / 100))
xticks = range(-2500,2500,window)
sns.tsplot([data[k] for k in randomA],time=xticks)
However, I get
ValueError: arrays must all be same length