Related
I have a dataframe comprising the data and another dataframe, containing a single row carrying indices.
data = {'col_1': [4, 5, 6, 7], 'col_2': [3, 4, 9, 8],'col_3': [5, 5, 6, 9],'col_4': [8, 7, 6, 5]}
df = pd.DataFrame(data)
ind = {'ind_1': [2], 'ind_2': [1],'ind_3': [3],'ind_4': [2]}
ind = pd.DataFrame(ind)
Both have the same number of columns. I want to extract the values of df corresponding to the index stored in ind so that I get a single row at the end.
For this data it should be: [6, 4, 9, 6]. I tried df.loc[ind.loc[0]] but that of course gives me four different rows, not one.
The other idea I have is to zip columns and rows and iterate over them. But I feel there should be a simpler way.
you can go to NumPy domain and index there:
In [14]: df.to_numpy()[ind, np.arange(len(df.columns))]
Out[14]: array([[6, 4, 9, 6]], dtype=int64)
this pairs up 2, 1, 3, 2 from ind and 0, 1, 2, 3 from 0 to number of columns - 1; so we get the values at [2, 0], [1, 1] and so on.
There's also df.lookup but it's being deprecated, so...
In [19]: df.lookup(ind.iloc[0], df.columns)
~\Anaconda3\Scripts\ipython:1: FutureWarning: The 'lookup' method is deprecated and will beremoved in a future version.You can use DataFrame.melt and DataFrame.locas a substitute.
Out[19]: array([6, 4, 9, 6], dtype=int64)
Below my dataframe "df" made of 34 columns (pairs of stocks) and 530 rows (their respective cumulative returns). 'Date' is the index
Now, my target is to consider last row (Date=3 Febraury 2021). I want to plot ONLY those columns (pair stocks) that have a positive return on last Date.
I started with:
n=list()
for i in range(len(df.columns)):
if df.iloc[-1,i] >0:
n.append(i)
Output: [3, 11, 12, 22, 23, 25, 27, 28, 30]
Now, final step is to create a subset dataframe of 'df' containing only columns belonging to those numbers in this list. This is where I have problems. Have you any idea? Thanks
Does this solve your problem?
n = []
for i, col in enumerate(df.columns):
if df.iloc[-1,i] > 0:
n.append(col)
df[n]
Here you are ;)
sample df:
a b c
date
2017-04-01 0.5 -0.7 -0.6
2017-04-02 1.0 1.0 1.3
df1.loc[df1.index.astype(str) == '2017-04-02', df1.ge(1.2).any()]
c
date
2017-04-02 1.3
the logic will be same for your case also.
If I understand correctly, you want columns with IDs [3, 11, 12, 22, 23, 25, 27, 28, 30], am I right?
You should use DataFrame.iloc:
column_ids = [3, 11, 12, 22, 23, 25, 27, 28, 30]
df_subset = df.iloc[:, column_ids].copy()
The ":" on the left side of df.iloc means "all rows". I suggest using copy method in case you want to perform additional operations on df_subset without the risk to affect the original df, or raising Warnings.
If instead of a list of column IDs, you have a list of column names, you should just replace .iloc with .loc.
I have a dataframe that has dates as most of the columns with the following structure:
df1 = pd.DataFrame({'State':['NY', 'CA'], '3/1/20' :[5, 10], '3/2/20': [11, 13], '3/3/20': [4, 12]})
and I want to 'pivot' the dataframe so it is now in this format:
df2 = pd.DataFrame({'Date':['3/1/20','3/1/20','3/2/20','3/2/20','3/3/20','3/1/20'], 'State':['NY', 'CA', 'NY', 'CA','NY', 'CA'], 'Values':[5,10,11,13,4,12]})
Does anyone have any suggestions on how to do this? Thanks!
Use pd.melt
df2 = pd.melt(df1, id_vars=['State']).rename(columns={'variable':'Date','value':'number'})
I am trying to skim the size of the data array in one of columns by using the condition array from another columns. For example, I have my data like below df :
df= pd.DataFrame({'nHit':[4,3,5],'hit_id':[[10,20,30,50],[20,40,50],[30,50,60,70,80]],'hit_val':[[1,2,3,4],[5,6,7],[8,9,10,11,12]]},index=[0,1,2])
I want to know if there is a way to move all the values in hit_val columns based on the condition of hit_id array(such as only keep the relevant values of the same position of hit_id= 30 or 50).
The output I suppose to get is something like below df :
df= pd.DataFrame({'nHit':[2,1,2],'hit_id':[[30,50],[50],[30,50]],'hit_val':[[3,4],[7],[8,9,10]]},index=[0,1,2])
My thought is to create a condition array from hit_id columns by using df.apply() and then use it to filter hit_val, does anyone know how to implement?
From what i understand , starting from the original df, you can explode both cols and the filter the condition , then groupby with agg as list:
l = [30,50]
m = pd.concat([df[i].explode() for i in ['hit_id','hit_val']],axis=1)
out = m[m['hit_id'].isin(l)].groupby(level=0).agg(list)
out.insert(0,'nHit',out['hit_id'].str.len())
print(out)
nHit hit_id hit_val
0 2 [30, 50] [3, 4]
1 1 [50] [7]
2 2 [30, 50] [8, 9]
Using a copy-n-paste of the two expressions (thanks), here are their displays (which should help us visualize the desired action:
In [247]: df
Out[247]:
nHit hit_id hit_val
0 4 [10, 20, 30, 50] [1, 2, 3, 4]
1 3 [20, 40, 50] [5, 6, 7]
2 5 [30, 50, 60, 70, 80] [8, 9, 10, 11, 12]
In [249]: df1
Out[249]:
nHit hit_id hit_val
0 2 [30, 50] [3, 4]
1 1 [50] [7]
2 2 [30, 50] [8, 9, 10]
I'd like to count the unique groups from the result of a Pandas group-by operation. For instance here is an example data frame.
In [98]: df = pd.DataFrame({'A': [1,2,3,1,2,3], 'B': [10,10,11,10,10,15]})
In [99]: df.groupby('A').groups
Out[99]: {1: [0, 3], 2: [1, 4], 3: [2, 5]}
The conceptual groups are {1: [10, 10], 2: [10, 10], 3: [11, 15]} where the index locations in the groups above are substituded with the values from column B, but the first problem I've run into is how to convert those positions (e.g. [0, 3]) into values from the B column.
Given the ability to convert the groups into the value groups from column B I can compute the unique groups by hand, but a secondary question here is if Pandas has a built-in routine for this, which I haven't seen.
Edit updated with target output:
This is the output I would be looking for in the simplest case:
{1: [10, 10], 2: [10, 10], 3: [11, 15]}
And counting the unique groups would produce something equivalent to:
{[10, 10]: 2, [11, 15]: 1}
How about:
>>> df = pd.DataFrame({'A': [1,2,3,1,2,3], 'B': [10,10,11,10,10,15]})
>>> df.groupby("A")["B"].apply(tuple).value_counts()
(10, 10) 2
(11, 15) 1
dtype: int64
or maybe
>>> df.groupby("A")["B"].apply(lambda x: tuple(sorted(x))).value_counts()
(10, 10) 2
(11, 15) 1
dtype: int64
if you don't care about the order within the group.
You can trivially call .to_dict() if you'd like, e.g.
>>> df.groupby("A")["B"].apply(tuple).value_counts().to_dict()
{(11, 15): 1, (10, 10): 2}
maybe:
>>> df.groupby('A')['B'].aggregate(lambda ts: list(ts.values)).to_dict()
{1: [10, 10], 2: [10, 10], 3: [11, 15]}
for counting the groups you need to convert to tuple because lists are not hashable:
>>> ts = df.groupby('A')['B'].aggregate(lambda ts: tuple(ts.values))
>>> ts.value_counts().to_dict()
{(11, 15): 1, (10, 10): 2}