Delete rows dataframe basing on column values - dataframe

I'm new with python, so I apologize in advance for syntax mistakes or inaccuracies.
I'm working with this dataframe:
data = [[1, 5, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 2, 4], [1, 1, 1, 1, 1, 1 ,1 , 1, 1, 1, 1, 1, 1, 1],
[1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3], [6, 6, 4, 5, 4, 5, 4, 5, 4, 5, 4, 5, 6, 6]]
data = np.transpose(data)
dfe_filtered_bp= pd.DataFrame(data, columns=['nt','bcol','IDb','IDp'])
print('sample dfe: ')
print(dfe_filtered_bp)
I'm trying to delete some lines from the dataframe. I tried with this for loop:
timestep_size = 1
rows_to_drop = []
for index_i, row_i in dfe_filtered_bp.iterrows():
for index_j, row_j in dfe_filtered_bp.iterrows():
if row_i['nt'] == row_j['nt'] + timestep_size & row_i['IDb'] == row_j['IDb'] & row_i['IDp'] == row_j['IDp']:
rows_to_drop.append(row_j)
else:
pass
basically, I want to store all the rows that must be deleted.
A row must be deleted if:
it has the same nt column value plus one of the previous line (row_i['nt'] == row_j['nt'] + timestep_size), AND
it has the same IDb and IDp column values of the previous line;
This is the expected output:
# nt bcol IDb IDp
# 1.0 1.0 1.0 6.0
# 5.0 1.0 1.0 6.0
# 1.0 1.0 2.0 4.0
# 1.0 1.0 2.0 5.0
# 2.0 1.0 3.0 6.0
# 4.0 1.0 3.0 6.0
I have some problems with the code syntax.
In addition, I do not know how to delete all the rows contained in "rows_to_drop", from the "dfe_filtered_bp".
Can anyone help me? Thank you in advance

Related

Pandas sort and bin data within dataframe to make pivottable

I have a dataframe with (random) observations for height, period and zones as such:
Height = [1, 4, 3, 3, 3, 2, 4, 2, 3, 3, 3, 1, 4, 3, 3, 4, 1, 4, 2, 2]
Period = [5, 4, 2, 4, 2, 2, 3, 3, 5, 2, 4, 5, 4, 2, 4, 4, 3, 5, 4, 3]
Zone = [1,1,3,1,4,1,1,1,1,4,1,3,2,1,4,2,4,4,2,4]
Direction = [292.5, 22.5, 202.5, 337.5, 292.5, 337.5, 337.5, 337.5, 22.5, 292.5, 22.5, 157.5, 112.5, 337.5, 292.5, 112.5, 247.5, 247.5,
112.5, 292.5]
I want to make a table with the zones on the indices, the unique periods on the columns and then for each combination of index-column I want to have the maximum of the height as such:
Any idea how to do this?
You can try
out = (df.pivot_table(index='Zone', columns='Period', values='Height', aggfunc='max')
.rename(index=lambda x: f'Zone={x}', columns=lambda x: f'Period={x}'))
print(out)
Period Period=2 Period=3 Period=4 Period=5
Zone
Zone=1 3.0 4.0 4.0 3.0
Zone=2 NaN NaN 4.0 NaN
Zone=3 3.0 NaN NaN 1.0
Zone=4 3.0 2.0 3.0 4.0

Pandas rolling mean only for non-NaNs

If have a DataFrame:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]
'A1': [1, 1, 2, 2, 2]
'A2': [1, 2, 3, 3, 3]})
I want to create a grouped-by on columns "A1" and "A2" and then apply a rolling-mean on "B" with window 3. If less values are available, that is fine, the mean should still be computed. But I do not want any values if there is no original entry.
Result should be:
pd.DataFrame({'B': [0, 1, 2, np.nan, 3]})
Applying df.rolling(3, min_periods=1).mean() yields:
pd.DataFrame({'B': [0, 1, 2, 2, 3]})
Any ideas?
Reason is for mean with widows=3 is ouput some scalars, not NaNs, possible solution is set NaN manually after rolling:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4],
'A': [1, 1, 2, 2, 2]})
df['C'] = df['B'].rolling(3, min_periods=1).mean().mask(df['B'].isna())
df['D'] = df.groupby('A')['B'].rolling(3, min_periods=1).mean().droplevel(0).mask(df['B'].isna())
print (df)
B A C D
0 0.0 1 0.0 0.0
1 1.0 1 0.5 0.5
2 2.0 2 1.0 2.0
3 NaN 2 NaN NaN
4 4.0 2 3.0 3.0
EDIT: For multiple grouping columns remove levels in Series.droplevel:
df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4],
'A1': [1, 1, 2, 2, 2],
'A2': [1, 2, 3, 3, 3]})
df['D'] = df.groupby(['A1','A2'])['B'].rolling(3, min_periods=1).mean().droplevel(['A1','A2']).mask(df['B'].isna())
print (df)
B A1 A2 D
0 0.0 1 1 0.0
1 1.0 1 2 1.0
2 2.0 2 3 2.0
3 NaN 2 3 NaN
4 4.0 2 3 3.0

Apply kmeans on in each group in pandas DataFrame and save the clusters in a new column in the same DataFrame

I have a dataframe containing some embeddings in column D. I would like to first groupby the data by column A and then apply kmeans on each group. Each group might contain nan values, so in the apply function I consider number of clusters as the number of non-nan values in column D devided by 2 (n_clusters = int(not_na_mask.sum()/2)).
In the apply function I return df['cluster'].values.tolist(). I printed this values and it's correct for each group, but after running the whole script df_test['clusters'] only contains nan in all the rows.
Sample DataFrame:
df_test = pd.DataFrame({'A' : ['aa', 'bb', 'aa', 'bb','aa', 'bb', 'aa', 'cc', 'aa', 'aa', 'bb', 'bb', 'bb','cc', 'bb', 'aa', 'cc', 'aa'],
'B' : [1, 2, np.nan, 4, 6, np.nan, 7, 8, np.nan, 1, 4, 3, 4, 7, 5, 7, 9, np.nan],
'D' : [[2, 0, 1, 5, 4, 0], np.nan, [4, 7, 0, 1, 0, 2], [1., 1, 1, 2, 0, 5], np.nan , [1, 6, 3, 2, 1, 9], [4, 2, 1, 0, 0, 0], [3, 5, 6, 8, 8, 0], np.nan,
np.nan, [2, 5, 1, 7, 4, 0] , [4, 2, 0, 4, 0, 0], [1., 0, 1, 8, 0, 9], [1, 0, 7, 2, 1, 0], np.nan , [1, 1, 5, 0, 8, 0], [4, 1, 6, 1, 1, 0], np.nan]})
df_test:
A B D
0 aa 1.0 [2, 0, 1, 5, 4, 0]
1 bb 2.0 NaN
2 aa NaN [4, 7, 0, 1, 0, 2]
3 bb 4.0 [1.0, 1, 1, 2, 0, 5]
4 aa 6.0 NaN
5 bb NaN [1, 6, 3, 2, 1, 9]
6 aa 7.0 [4, 2, 1, 0, 0, 0]
7 cc 8.0 [3, 5, 6, 8, 8, 0]
8 aa NaN NaN
9 aa 1.0 NaN
10 bb 4.0 [2, 5, 1, 7, 4, 0]
11 bb 3.0 [4, 2, 0, 4, 0, 0]
12 bb 4.0 [1.0, 0, 1, 8, 0, 9]
13 cc 7.0 [1, 0, 7, 2, 1, 0]
14 bb 5.0 NaN
15 aa 7.0 [1, 1, 5, 0, 8, 0]
16 cc 9.0 [4, 1, 6, 1, 1, 0]
17 aa NaN NaN
My approach for calculating kmeans:
def apply_kmeans_on_each_category(df):
not_na_mask = df['D'].notna()
embedding = df[not_na_mask]['D']
n_clusters = int(not_na_mask.sum()/2)
if n_clusters > 1:
df['cluster'] = np.nan
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embedding.tolist())
df.loc[not_na_mask, 'cluster'] = kmeans.labels_
return df['cluster'].values.tolist()
else:
return [np.nan] * len(df)
df_test['clusters'] = df_test.groupby('A').apply(apply_kmeans_on_each_category)
result:
df_test['clusters']:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
Name: clusters, dtype: object
Made some slight changes. Meat of the change is to use transform instead of apply. Also, no need to pass the entire Grouper df, you can just pass column D directly as that's the only column you are using -
def apply_kmeans_on_each_category(df):
not_na_mask = df.notna()
embedding = df.loc[not_na_mask]
n_clusters = int(not_na_mask.sum()/2)
op = pd.Series([np.nan] * len(df), index=df.index)
if n_clusters > 1:
df['cluster'] = np.nan
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embedding.tolist())
op.loc[not_na_mask] = kmeans.labels_.tolist()
return op
df_test['clusters'] = df_test.groupby('A')['D'].transform(apply_kmeans_on_each_category)
Output
A B D clusters
0 aa 1.0 [2, 0, 1, 5, 4, 0] 0.0
1 bb 2.0 NaN NaN
2 aa NaN [4, 7, 0, 1, 0, 2] 1.0
3 bb 4.0 [1.0, 1, 1, 2, 0, 5] 0.0
4 aa 6.0 NaN NaN
5 bb NaN [1, 6, 3, 2, 1, 9] 0.0
6 aa 7.0 [4, 2, 1, 0, 0, 0] 1.0
7 cc 8.0 [3, 5, 6, 8, 8, 0] NaN
8 aa NaN NaN NaN
9 aa 1.0 NaN NaN
10 bb 4.0 [2, 5, 1, 7, 4, 0] 1.0
11 bb 3.0 [4, 2, 0, 4, 0, 0] 1.0
12 bb 4.0 [1.0, 0, 1, 8, 0, 9] 0.0
13 cc 7.0 [1, 0, 7, 2, 1, 0] NaN
14 bb 5.0 NaN NaN
15 aa 7.0 [1, 1, 5, 0, 8, 0] 0.0
16 cc 9.0 [4, 1, 6, 1, 1, 0] NaN
17 aa NaN NaN NaN

Some array indexing in numpy

lookup = np.array([60, 40, 50, 60, 90])
The values in the following arrays are equal to indices of lookup.
a = np.array([1, 2, 0, 4, 3, 2, 4, 2, 0])
b = np.array([0, 1, 2, 3, 3, 4, 1, 2, 1])
c = np.array([4, 2, 1, 4, 4, 0, 4, 4, 2])
array 1st column elements lookup value
a 1 --> 40
b 0 --> 60
c 4 --> 90
Maximum is 90.
So, first element of result is 4.
This way,
expected result = array([4, 2, 0, 4, 4, 4, 4, 4, 0])
How to get it?
I tried as:
d = np.vstack([a, b, c])
print (d)
res = lookup[d]
res = np.max(res, axis = 0)
print (d[enumerate(lookup)])
I got error
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
Do you want this:
d = np.vstack([a,b,c])
# option 1
rows = lookup[d].argmax(0)
d[rows, np.arange(d.shape[1])]
# option 2
(lookup[:,None] == lookup[d].max(0)).argmax(0)
Output:
array([4, 2, 0, 4, 4, 4, 4, 4, 0])

How to express int array[6] in YAML

I have a type int arr[6], and the value is {1,2,3,4,5,6}. How should I express this data using YAML?
[1, 2, 3, 4, 5, 6]
or
- 1
- 2
- 3
- 4
- 5
- 6
You should simply use a list:
[1, 2, 3, 4, 5, 6]
It throws IllegalArgumentException in my spring application, when I use [1, 2, 3, 4, 5, 6] in application.yaml. However it works when i use 1,2,3,4,5,6
my annotation code is
#Value("${app.groupIds}")
private int[] groupIds;