Division between two numbers in a Dataframe - pandas

I am trying to calculate a percent change between 2 numbers in one column when a signal from another column is triggered.
The trigger can be found with np.where() but what I am having trouble with is the percent change. .pct_change does not work because if you .pct_change(-5) you get 16.03/20.35 and I want the number the opposite way 20.35/16.03. See table below. I have tried returning the array from the index in the np.where and adding it to an .iloc from the 'Close' column but it says I can't use that array to get an .iloc position. Can anyone help me solve this problem. Thank you.
IdxNum | Close | Signal (1s)
==============================
0 21.45 0
1 21.41 0
2 21.52 0
3 21.71 0
4 20.8 0
5 20.35 0
6 20.44 0
7 16.99 0
8 17.02 0
9 16.69 0
10 16.03 1<< 26.9% <<< 20.35/16.03-1 (df.Close[5]/df.Close[10]-1)
11 15.67 0
12 15.6 0

You can try this code block:
#Create DataFrame
df = pd.DataFrame({'IdxNum':range(13),
'Close':[21.45,21.41,21.52,21.71,20.8,20.35,20.44,16.99,17.02,16.69,16.03,15.67,15.6],
'Signal':[0] * 13})
df.ix[10,'Signal']=1
#Create a function that calculates the reqd diff
def cal_diff(row):
if(row['Signal']==1):
signal_index = int(row['IdxNum'])
row['diff'] = df.Close[signal_index-5]/df.Close[signal_index]-1
return row
#Create a column and apply that difference
df['diff'] = 0
df = df.apply(lambda x:cal_diff(x),axis=1)
In case you don't have IdxNum column, you can use the index to calculate difference
#Create DataFrame
df = pd.DataFrame({
'Close':[21.45,21.41,21.52,21.71,20.8,20.35,20.44,16.99,17.02,16.69,16.03,15.67,15.6],
'Signal':[0] * 13})
df.ix[10,'Signal']=1
#Calculate the reqd difference
df['diff'] = 0
signal_index = df[df['Signal']==1].index[0]
df.ix[signal_index,'diff'] = df.Close[signal_index-5]/df.Close[signal_index]-1

Related

Classifying pandas columns according to range limits

I have a dataframe with several numeric columns and their range goes either from 1 to 5 or 1 to 10
I want to create two lists of these columns names this way:
names_1to5 = list of all columns in df with numbers ranging from 1 to 5
names_1to10 = list of all columns in df with numbers from 1 to 10
Example:
IP track batch size type
1 2 3 5 A
9 1 2 8 B
10 5 5 10 C
from the dataframe above:
names_1to5 = ['track', 'batch']
names_1to10 = ['ip', 'size']
I want to use a function that gets a dataframe and perform the above transformation only on columns with numbers within those ranges.
I know that if the column 'max()' is 5 than it's 1to5 same idea when max() is 10
What I already did:
def test(df):
list_1to5 = []
list_1to10 = []
for col in df:
if df[col].max() == 5:
list_1to5.append(col)
else:
list_1to10.append(col)
return list_1to5, list_1to10
I tried the above but it's returning the following error msg:
'>=' not supported between instances of 'float' and 'str'
The type of the columns is 'object' maybe this is the reason. If this is the reason, how can I fix the function without the need to cast these columns to float as there are several, sometimes hundreds of these columns and if I run:
df['column'].max() I get 10 or 5
What's the best way to create this this function?
Use:
string = """alpha IP track batch size
A 1 2 3 5
B 9 1 2 8
C 10 5 5 10"""
temp = [x.split() for x in string.split('\n')]
cols = temp[0]
data = temp[1:]
def test(df):
list_1to5 = []
list_1to10 = []
for col in df.columns:
if df[col].dtype!='O':
if df[col].max() == 5:
list_1to5.append(col)
else:
list_1to10.append(col)
return list_1to5, list_1to10
df = pd.DataFrame(data, columns = cols, dtype=float)
Output:
(['track', 'batch'], ['IP', 'size'])

Replace values inside list by generic numbers to group and reference for statistical computing

I usually use "${:,.2f}". format(prices) to round numbers before commas, but what I'm looking for is different, I want to change values numbers to group them and reference them by mode:
Let say I have this list:
0 34,123.45
1 34,456.78
2 34,567.89
3 33,222.22
4 30,123.45
And the replace function will turn the list to:
0 34,500.00
1 34,500.00
2 34,500.00
3 33,200.00
4 30,100.00
Like this when I use stats.mode(prices_rounded) it will show as a result:
Mode Value = 34500.00
Mode Count = 3
Is there a conversion function already available that does the job? I did search for days without luck...
EDIT - WORKING CODE:
#create list
df3 = df_array
print('########## df3: ',df3)
#convert to float
df4 = df3.astype(float)
print('########## df4: ',df4)
#convert list to string
#df5 = ''.join(map(str, df4))
#print('########## df5: ',df5)
#round values
df6 = np.round(df4 /100) * 100
print('######df6',df6)
#get mode stats
df7 = stats.mode(df6)
print('######df7',df7)
#get mode value
df8 = df7[0][0]
print('######df8',df8)
#convert to integer
df9 = int(df8)
print('######df9',df9)
This is exactly what I wanted, thanks!
You can use:
>>> sr
0 34123.45 # <- why 34500.00?
1 34456.78
2 34567.89 # <- why 34500.00?
3 33222.22
4 30123.45
dtype: float64
>>> np.round(sr / 100) * 100
0 34100.0
1 34500.0
2 34600.0
3 33200.0
4 30100.0
dtype: float64

Comparing string values from sequential rows in pandas series

I am trying to count common string values in sequential rows of a panda series using a user defined function and to write an output into a new column. I figured out individual steps, but when I put them together, I get a wrong result. Could you please tell me the best way to do this? I am a very beginner Pythonista!
My pandas df is:
df = pd.DataFrame({"Code": ['d7e', '8e0d', 'ft1', '176', 'trk', 'tr71']})
My string comparison loop is:
x='d7e'
y='8e0d'
s=0
for i in y:
b=str(i)
if b not in x:
s+=0
else:
s+=1
print(s)
the right result for these particular strings is 2
Note, when I do def func(x,y): something happens to s counter and it doesn't produce the right result. I think I need to reset it to 0 every time the loop runs.
Then, I use df.shift to specify the position of y and x in a series:
x = df["Code"]
y = df["Code"].shift(periods=-1, axis=0)
And finally, I use df.apply() method to run the function:
df["R1SB"] = df.apply(func, axis=0)
and I get None values in my new column "R1SB"
My correct output would be:
"Code" "R1SB"
0 d7e None
1 8e0d 2
2 ft1 0
3 176 1
4 trk 0
5 tr71 2
Thank you for your help!
TRY:
df['R1SB'] = df.assign(temp=df.Code.shift(1)).apply(
lambda x: np.NAN
if pd.isna(x['temp'])
else sum(i in str(x['temp']) for i in str(x['Code'])),
1,
)
OUTPUT:
Code R1SB
0 d7e NaN
1 8e0d 2.0
2 ft1 0.0
3 176 1.0
4 trk 0.0
5 tr71 2.0

filter on pandas array

I'm doing this kind of code to find if a value belongs to the array a inside a dataframe:
Solution 1
df = pd.DataFrame([{'a':[1,2,3], 'b':4},{'a':[5,6], 'b':7},])
df = df.explode('a')
df[df['a'] == 1]
will give the output:
a b
0 1 4
Problem
This can go worst if there are repetitions:
df = pd.DataFrame([{'a':[1,2,1,3], 'b':4},{'a':[5,6], 'b':7},])
df = df.explode('a')
df[df['a'] == 1]
will give the output:
a b
0 1 4
0 1 4
Solution 2
Another solution could go like:
df = pd.DataFrame([{'a':[1,2,1,3], 'b':4},{'a':[5,6], 'b':7},])
df = df[df['a'].map(lambda row: 1 in row)]
Problem
That Lambda can't go fast if the Dataframe is Big.
Question
As a first goal, I want all the lines where the value 1 belongs to a:
without using Python, since it is slow
with high performance
avoiding memory issues
...
So I'm trying to understand what may I do with the arrays inside Pandas. Is there some documentation on how to use this type efficiently?
IIUC, you are trying to do:
df[df['a'].eq(1).groupby(level=0).transform('any')
Output:
a b
0 1 4
0 2 4
0 3 4
Nothing is wrong. This is normal behavior of pandas.explode().
To check whether a value belongs to values in a you may use this:
if x in df.a.explode()
where x is what you test for.
I think you can convert arrays to scalars with DataFrame constructor and then test value with DataFrame.eq and DataFrame.any:
df = df[pd.DataFrame(df['a'].tolist()).eq(1).any(axis=1)]
print (df)
a b
0 [1, 2, 1, 3] 4
Details:
print (pd.DataFrame(df['a'].tolist()))
0 1 2 3
0 1 2 1.0 3.0
1 5 6 NaN NaN
print (pd.DataFrame(df['a'].tolist()).eq(1))
0 1 2 3
0 True False True False
1 False False False False
So I'm trying to understand what may I do with the arrays inside Pandas. Is there some documentation on how to use this type efficiently?
I think working with lists in pandas is not good idea.

How to efficiently remove duplicate rows from a DataFrame

I'm dealing with a very large Data Frame and I'm using pandas to do the analysis.
The data frame is structured as follows
import pandas as pd
df = pd.read_csv("data.csv")
df.head()
Source Target Weight
0 0 25846 1
1 0 1916 1
2 25846 0 1
3 0 4748 1
4 0 16856 1
The issue is that I want to remove all the "duplicates". In the sense that if I already have a row that contains a Source and a Target I do not want this information to be repeated on another row.
For instance, rows number 0 and 2 are "duplicate" in this sense and only one of them should be retained.
A simple way to get rid of all the "duplicates" is
for index, row in df.iterrows():
df = df[~((df.Source==row.Target)&(df.Target==row.Source))]
However, this approach is horribly slow since my data frame has about 3 million rows. Do you think there's a better way of doing this?
Create two temp columns to save minimum(df.Source, df.Target) and maximum(df.Source, df.Target), and then check duplicated rows by duplicated() method:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0, 5, (20, 2)), columns=["Source", "Target"])
df["T1"] = np.minimum(df.Source, df.Target)
df["T2"] = np.maximum(df.Source, df.Target)
df[~df[["T1", "T2"]].duplicated()]
No need (as usual) to use a loop with a dataframe. Use the Series.isin method:
So start with this:
df = pandas.DataFrame({
'src': [0, 0, 25, 0, 0],
'tgt': [25, 12, 0, 85, 363]
})
print(df)
src tgt
0 0 25
1 0 12
2 25 0
3 0 85
4 0 363
Then select all of the where src is not in tgt:
df[~(df['src'].isin(df['tgt']) & df['tgt'].isin(df['src']))]
src tgt
1 0 12
3 0 85
4 0 363
Your Source and Targets appear to be mutually exclusive (i.e. you can have one, but not both). Why not add them together (e.g. 25846 + 0) to get the unique identifier. You can then delete the unneeded Target column (reducing memory), and then drop duplicates. In the event your weights are not the same, it will take the first one by default.
df.Source += df.Target
df.drop('Target', axis=1, inplace=True)
df.drop_duplicates(inplace=True)
>>> df
Source Weight
0 25846 1
1 1916 1
3 4748 1
4 16856 1