Replace values inside list by generic numbers to group and reference for statistical computing - pandas

I usually use "${:,.2f}". format(prices) to round numbers before commas, but what I'm looking for is different, I want to change values numbers to group them and reference them by mode:
Let say I have this list:
0 34,123.45
1 34,456.78
2 34,567.89
3 33,222.22
4 30,123.45
And the replace function will turn the list to:
0 34,500.00
1 34,500.00
2 34,500.00
3 33,200.00
4 30,100.00
Like this when I use stats.mode(prices_rounded) it will show as a result:
Mode Value = 34500.00
Mode Count = 3
Is there a conversion function already available that does the job? I did search for days without luck...
EDIT - WORKING CODE:
#create list
df3 = df_array
print('########## df3: ',df3)
#convert to float
df4 = df3.astype(float)
print('########## df4: ',df4)
#convert list to string
#df5 = ''.join(map(str, df4))
#print('########## df5: ',df5)
#round values
df6 = np.round(df4 /100) * 100
print('######df6',df6)
#get mode stats
df7 = stats.mode(df6)
print('######df7',df7)
#get mode value
df8 = df7[0][0]
print('######df8',df8)
#convert to integer
df9 = int(df8)
print('######df9',df9)
This is exactly what I wanted, thanks!

You can use:
>>> sr
0 34123.45 # <- why 34500.00?
1 34456.78
2 34567.89 # <- why 34500.00?
3 33222.22
4 30123.45
dtype: float64
>>> np.round(sr / 100) * 100
0 34100.0
1 34500.0
2 34600.0
3 33200.0
4 30100.0
dtype: float64

Related

ValueError: could not convert string to float: '1.598.248'

how do i convert a number dataframe column in millions to double or float
0 1.598.248
1 1.323.373
2 1.628.781
3 1.551.707
4 1.790.930
5 1.877.985
6 1.484.103
0 15982480.0
1 13233730.0
2 16287810.0
3 15517070.0
4 17909300.0
5 18779850.0
6 14841030.0
You will need to remove the full stops. You can use pandas replace method then convert it into a float:
df['col'] = df['col'].replace('\.', '', regex=True).astype('float')
Example
>>> df = pd.DataFrame({'A': ['1.1.1', '2.1.2', '3.1.3', '4.1.4']})
>>> df
A
0 1.1.1
1 2.1.2
2 3.1.3
3 4.1.4
>>> df['col'] = df['col'].replace('\.', '', regex=True).astype('float')
>>> df['A']
A
0 111.0
1 212.0
2 313.0
3 414.0
>>> df['A'].dtype
float64
I'm assuming that because you have two full stops that the data is of type string. However, this should work even if you have some integers or floats in that column as well.
my_col_name
0 1.598.248
1 1.323.373
2 1.628.781
3 1.551.707
4 1.790.930
5 1.877.985
6 1.484.103
With the df above, you can try below code, with 3 steps: (1) change column type to string, (2) do string replace character, (3) change column type to float
col = 'my_col_name'
df[col] = df[col].astype('str')
df[col] = df[col].str.replace('.','')
df[col] = df[col].astype('float')
print(df)
Please note the above will result in a warning: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
So you could use below code with regex=True, also, I've combined in 1 line:
df[col] = df[col].astype('str').str.replace('.','', regex=True).astype('float')
print(df)
Output
my_col_name
0 15982480.0
1 13233730.0
2 16287810.0
3 15517070.0
4 17909300.0
5 18779850.0
6 14841030.0

Classifying pandas columns according to range limits

I have a dataframe with several numeric columns and their range goes either from 1 to 5 or 1 to 10
I want to create two lists of these columns names this way:
names_1to5 = list of all columns in df with numbers ranging from 1 to 5
names_1to10 = list of all columns in df with numbers from 1 to 10
Example:
IP track batch size type
1 2 3 5 A
9 1 2 8 B
10 5 5 10 C
from the dataframe above:
names_1to5 = ['track', 'batch']
names_1to10 = ['ip', 'size']
I want to use a function that gets a dataframe and perform the above transformation only on columns with numbers within those ranges.
I know that if the column 'max()' is 5 than it's 1to5 same idea when max() is 10
What I already did:
def test(df):
list_1to5 = []
list_1to10 = []
for col in df:
if df[col].max() == 5:
list_1to5.append(col)
else:
list_1to10.append(col)
return list_1to5, list_1to10
I tried the above but it's returning the following error msg:
'>=' not supported between instances of 'float' and 'str'
The type of the columns is 'object' maybe this is the reason. If this is the reason, how can I fix the function without the need to cast these columns to float as there are several, sometimes hundreds of these columns and if I run:
df['column'].max() I get 10 or 5
What's the best way to create this this function?
Use:
string = """alpha IP track batch size
A 1 2 3 5
B 9 1 2 8
C 10 5 5 10"""
temp = [x.split() for x in string.split('\n')]
cols = temp[0]
data = temp[1:]
def test(df):
list_1to5 = []
list_1to10 = []
for col in df.columns:
if df[col].dtype!='O':
if df[col].max() == 5:
list_1to5.append(col)
else:
list_1to10.append(col)
return list_1to5, list_1to10
df = pd.DataFrame(data, columns = cols, dtype=float)
Output:
(['track', 'batch'], ['IP', 'size'])

How to use the diff() function in pandas but enter the difference values in a new column?

I have a dataframe df:
df
x-value
frame
1 15
2 20
3 19
How can I get:
df
x-value delta-x
frame
1 15 0
2 20 5
3 19 -1
Not to say there is anything wrong with what #Wen posted as a comment, but I want to post a more complete answer.
The Problem
There are 3 things going on that need to be addressed:
Calculating the values that are the differences from one row to the next.
Handling the fact that the "difference" will be one less value than the original length of the dataframe and we'll have to fill in a value for the missing bit.
How do we assign this to a new column.
Option #1
The most natural way to do the diff would be to use pd.Series.diff (as #Wen suggested). But in order to produce the stated results, which are integers, I recommend using the pd.Series.fillna parameter, downcast='infer'. Finally, I don't like editing the dataframe unless there is a need for it. So I use pd.DataFrame.assign to produce a new dataframe that is a copy of the old one with a new column associated.
df.assign(**{'delta-x': df['x-value'].diff().fillna(0, downcast='infer')})
x-value delta-x
frame
1 15 0
2 20 5
3 19 -1
Option #2
Similar to #1 but I'll use numpy.diff to preserve int type in addition to picking up some performance.
df.assign(**{'delta-x': np.append(0, np.diff(df['x-value'].values))})
x-value delta-x
frame
1 15 0
2 20 5
3 19 -1
Testing
pir1 = lambda d: d.assign(**{'delta-x': d['x-value'].diff().fillna(0, downcast='infer')})
pir2 = lambda d: d.assign(**{'delta-x': np.append(0, np.diff(d['x-value'].values))})
res = pd.DataFrame(
index=[10, 300, 1000, 3000, 10000, 30000],
columns=['pir1', 'pir2'], dtype=float)
for i in res.index:
d = pd.concat([df] * i, ignore_index=True)
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
res.at[i, j] = timeit(stmt, setp, number=1000)
res.plot(loglog=True)
res.div(res.min(1), 0)
pir1 pir2
10 2.069498 1.0
300 2.123017 1.0
1000 2.397373 1.0
3000 2.804214 1.0
10000 4.559525 1.0
30000 7.058344 1.0

Division between two numbers in a Dataframe

I am trying to calculate a percent change between 2 numbers in one column when a signal from another column is triggered.
The trigger can be found with np.where() but what I am having trouble with is the percent change. .pct_change does not work because if you .pct_change(-5) you get 16.03/20.35 and I want the number the opposite way 20.35/16.03. See table below. I have tried returning the array from the index in the np.where and adding it to an .iloc from the 'Close' column but it says I can't use that array to get an .iloc position. Can anyone help me solve this problem. Thank you.
IdxNum | Close | Signal (1s)
==============================
0 21.45 0
1 21.41 0
2 21.52 0
3 21.71 0
4 20.8 0
5 20.35 0
6 20.44 0
7 16.99 0
8 17.02 0
9 16.69 0
10 16.03 1<< 26.9% <<< 20.35/16.03-1 (df.Close[5]/df.Close[10]-1)
11 15.67 0
12 15.6 0
You can try this code block:
#Create DataFrame
df = pd.DataFrame({'IdxNum':range(13),
'Close':[21.45,21.41,21.52,21.71,20.8,20.35,20.44,16.99,17.02,16.69,16.03,15.67,15.6],
'Signal':[0] * 13})
df.ix[10,'Signal']=1
#Create a function that calculates the reqd diff
def cal_diff(row):
if(row['Signal']==1):
signal_index = int(row['IdxNum'])
row['diff'] = df.Close[signal_index-5]/df.Close[signal_index]-1
return row
#Create a column and apply that difference
df['diff'] = 0
df = df.apply(lambda x:cal_diff(x),axis=1)
In case you don't have IdxNum column, you can use the index to calculate difference
#Create DataFrame
df = pd.DataFrame({
'Close':[21.45,21.41,21.52,21.71,20.8,20.35,20.44,16.99,17.02,16.69,16.03,15.67,15.6],
'Signal':[0] * 13})
df.ix[10,'Signal']=1
#Calculate the reqd difference
df['diff'] = 0
signal_index = df[df['Signal']==1].index[0]
df.ix[signal_index,'diff'] = df.Close[signal_index-5]/df.Close[signal_index]-1

How to turn Pandas' DataFrame.groupby() result into MultiIndex

Suppose I have a set of measurements that were obtained by varying two parameters, knob_b and knob_2 (in practice there are a lot more):
data = np.empty((6,3), dtype=np.float)
data[:,0] = [3,4,5,3,4,5]
data[:,1] = [1,1,1,2,2,2]
data[:,2] = np.random.random(6)
df = pd.DataFrame(data, columns=['knob_1', 'knob_2', 'signal'])
i.e., df is
knob_1 knob_2 signal
0 3 1 0.076571
1 4 1 0.488965
2 5 1 0.506059
3 3 2 0.415414
4 4 2 0.771212
5 5 2 0.502188
Now, considering each parameter on its own, I want to find the minimum value that was measured for each setting of this parameter (ignoring the settings of all other parameters). The pedestrian way of doing this is:
new_index = []
new_data = []
for param in df.columns:
if param == 'signal':
continue
group = df.groupby(param)['signal'].min()
for (k,v) in group.items():
new_index.append((param, k))
new_data.append(v)
new_index = pd.MultiIndex.from_tuples(new_index,
names=('parameter', 'value'))
df2 = pd.Series(index=new_index, data=new_data)
resulting df2 being:
parameter value
knob_1 3 0.495674
4 0.277030
5 0.398806
knob_2 1 0.485933
2 0.277030
dtype: float64
Is there a better way to do this, in particular to get rid of the inner loop?
It seems to me that the result of the df.groupby operation already has everything I need - if only there was a way to somehow create a MultiIndex from it without going through the list of tuples.
Use the keys argument of pd.concat():
pd.concat([df.groupby('knob_1')['signal'].min(),
df.groupby('knob_2')['signal'].min()],
keys=['knob_1', 'knob_2'],
names=['parameter', 'value'])