Pandas columns by given value in last row - pandas

Below my dataframe "df" made of 34 columns (pairs of stocks) and 530 rows (their respective cumulative returns). 'Date' is the index
Now, my target is to consider last row (Date=3 Febraury 2021). I want to plot ONLY those columns (pair stocks) that have a positive return on last Date.
I started with:
n=list()
for i in range(len(df.columns)):
if df.iloc[-1,i] >0:
n.append(i)
Output: [3, 11, 12, 22, 23, 25, 27, 28, 30]
Now, final step is to create a subset dataframe of 'df' containing only columns belonging to those numbers in this list. This is where I have problems. Have you any idea? Thanks

Does this solve your problem?
n = []
for i, col in enumerate(df.columns):
if df.iloc[-1,i] > 0:
n.append(col)
df[n]
Here you are ;)

sample df:
a b c
date
2017-04-01 0.5 -0.7 -0.6
2017-04-02 1.0 1.0 1.3
df1.loc[df1.index.astype(str) == '2017-04-02', df1.ge(1.2).any()]
c
date
2017-04-02 1.3
the logic will be same for your case also.

If I understand correctly, you want columns with IDs [3, 11, 12, 22, 23, 25, 27, 28, 30], am I right?
You should use DataFrame.iloc:
column_ids = [3, 11, 12, 22, 23, 25, 27, 28, 30]
df_subset = df.iloc[:, column_ids].copy()
The ":" on the left side of df.iloc means "all rows". I suggest using copy method in case you want to perform additional operations on df_subset without the risk to affect the original df, or raising Warnings.
If instead of a list of column IDs, you have a list of column names, you should just replace .iloc with .loc.

Related

Generate & List all Lucky numbers from 1 to n using Numpy

A lucky number is found by listing all numbers up to n.
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32
And then remove every second number so we get: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
Now the next number after 1 here is 3 so now remove every third number:
1,3,7,9,13,15,19,21,25,27,29
Now the next number after 3 is 7, so now remove every seventh number:
1,3,7,9,13,15,21,25,27,29
And the next number after 7 in our list is 9 so now remove every ninth number.
etc
The remaining numbers are lucky numbers: 1,3,7,9,13,15,21,25,31
Hello, I am a relatively new Python programmer who is trying to figure this out.
I did not even come close to solving this, and I want them up to the 100 billions so an advice of the best way to go about this is welcome. here is my best try to get this done in Numpy:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])
b = a[::2] #using skip of 2 on our array
c = b[::3] #using skip of 3 on our array
d = c[::7] #using skip of 7 on our array
e = d[::9] #using skip of 9 on our array
print(e)
It returns only 1 so this requires more advanced programming to find the lucky numbers,
I need some clever programming in order to automatically find the next skip also since I can't input millions of skips like I have done here with the skips of 2, 3, 7 & 9.
IIUC, one way using while loop with checker:
def find_lucky(n):
arr = list(range(1, n+1))
done = set()
ind = 1
while len(arr) >= (i:= arr[ind]):
if i in done:
ind += 1
else:
del arr[i-1::i]
done.add(i)
return arr
Output:
find_lucky(32)
# [1, 3, 7, 9, 13, 15, 21, 25, 31]

How to get values from my datframe with two values?

I'm trying to get the values after the commas on a dataframe that has two values in each column
My dataframe looks like this.
Thanks!
P.S. I started programming today
>>df = pd.DataFrame({'c1': [[10, 11], [12,13]], 'c2': [[100, 110], [120,130]]})
>>df
c1 c2
0 [10, 11] [100, 110]
1 [12, 13] [120, 130]
>>> for index,row in df.iterrows():
... print(row['c1'][1])
...
11
13
I suppose this is what you want

Percentile of every column and row in a dataframe

I have a csv that looks like the image below. I want to calculate the percentile(10,50,90) of each row starting from B2 to X2 and adding that final percentile in a new column. Essentially, I want to find the 10th percetile of the average(std, cv, sp_tim.....) value over the entire period of record available.
I have created the following code line to read it in python as a dataframe format so far.
da = pd.read_csv('Project/11433300_annual_flow_matrix.csv', index_col=0, parse_dates=True)
If I have understood your question correctly then below code might be helpful for you:
I have Used some Dummy data, and given similar kind of treatment on it which you are looking for
aq = [1, 2, 2, 3, 3, 4, 4, 5, 7, 8, 10, 11]
aw = [91, 25, 13, 53, 95, 94, 75, 35, 57, 88, 111, 12]
df = pd.DataFrame({'aq': aq, 'aw': aw})
n = df.shape[0]
p = 0.1 #for 10th percentile
position = np.ceil(n*p)
position = int(position)
df.iloc[position,]
Kindly have a look and let me know if this is works for you.

How to ensure get label for zero counts in python pandas pd.cut

I am analyzing a DataFrame and getting timing counts which I want to put into specific buckets (0-10 seconds, 10-30 seconds, etc).
Here is a simplified example:
import pandas as pd
filter_values = [0, 10, 20, 30] # Bucket Values for pd.cut
#Sample Times
df1 = pd.DataFrame([1, 3, 8, 20], columns = ['filtercol'])
#Use cut to get counts for each bucket
out = pd.cut(df1.filtercol, bins = filter_values)
counts = pd.value_counts(out)
print counts
The above prints:
(0, 10] 3
(10, 20] 1
dtype: int64
You will notice it does not show any values for (20, 30]. This is a problem because I want to put this into my output as zero. I can handle it using the following code:
bucket1=bucket2=bucket3=0
if '(0, 10]' in counts:
bucket1=counts['(0, 10]']
if '(10, 20]' in counts:
bucket2=counts['(10, 30]']
if '(20, 30]' in counts:
bucket3=counts['(30, 60]']
print bucket1, bucket2, bucket3
But I want a simpler cleaner approach where I can use:
print counts['(0, 10]'], counts['(10, 30]'], counts['(30, 60]']
Ideally where the print is based on the values in filter_values so they are only in one place in the code. Yes I know I can change the print to use filter_values[0]...
Lastly when using cut is there a way to specify infinity so the last bucket is all values greater than say 60?
Cheers,
Stephen
You can reindex by the categorical's levels:
In [11]: pd.value_counts(out).reindex(out.levels, fill_value=0)
Out[11]:
(0, 10] 3
(10, 20] 1
(20, 30] 0
dtype: int64

Numpy rebinning a 2D array

I am looking for a fast formulation to do a numerical binning of a 2D numpy array. By binning I mean calculate submatrix averages or cumulative values. For ex. x = numpy.arange(16).reshape(4, 4) would have been splitted in 4 submatrix of 2x2 each and gives numpy.array([[2.5,4.5],[10.5,12.5]]) where 2.5=numpy.average([0,1,4,5]) etc...
How to perform such an operation in an efficient way... I don't have really any ideay how to perform this ...
Many thanks...
You can use a higher dimensional view of your array and take the average along the extra dimensions:
In [12]: a = np.arange(36).reshape(6, 6)
In [13]: a
Out[13]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
In [14]: a_view = a.reshape(3, 2, 3, 2)
In [15]: a_view.mean(axis=3).mean(axis=1)
Out[15]:
array([[ 3.5, 5.5, 7.5],
[ 15.5, 17.5, 19.5],
[ 27.5, 29.5, 31.5]])
In general, if you want bins of shape (a, b) for an array of (rows, cols), your reshaping of it should be .reshape(rows // a, a, cols // b, b). Note also that the order of the .mean is important, e.g. a_view.mean(axis=1).mean(axis=3) will raise an error, because a_view.mean(axis=1) only has three dimensions, although a_view.mean(axis=1).mean(axis=2) will work fine, but it makes it harder to understand what is going on.
As is, the above code only works if you can fit an integer number of bins inside your array, i.e. if a divides rows and b divides cols. There are ways to deal with other cases, but you will have to define the behavior you want then.
See the SciPy Cookbook on rebinning, which provides this snippet:
def rebin(a, *args):
'''rebin ndarray data into a smaller ndarray of the same rank whose dimensions
are factors of the original dimensions. eg. An array with 6 columns and 4 rows
can be reduced to have 6,3,2 or 1 columns and 4,2 or 1 rows.
example usages:
>>> a=rand(6,4); b=rebin(a,3,2)
>>> a=rand(6); b=rebin(a,2)
'''
shape = a.shape
lenShape = len(shape)
factor = asarray(shape)/asarray(args)
evList = ['a.reshape('] + \
['args[%d],factor[%d],'%(i,i) for i in range(lenShape)] + \
[')'] + ['.sum(%d)'%(i+1) for i in range(lenShape)] + \
['/factor[%d]'%i for i in range(lenShape)]
print ''.join(evList)
return eval(''.join(evList))
I assume that you only want to know how to generally build a function that performs well and does something with arrays, just like numpy.reshape in your example. So if performance really matters and you're already using numpy, you can write your own C code for that, like numpy does. For example, the implementation of arange is completely in C. Almost everything with numpy which matters in terms of performance is implemented in C.
However, before doing so you should try to implement the code in python and see if the performance is good enough. Try do make the python code as efficient as possible. If it still doesn't suit your performance needs, go the C way.
You may read about that in the docs.