Removing duplicates based on matching column values with boolean indexing - pandas

After merging two DF's I have the following dataset:
DB_ID
x_val
y_val
x01
405
407
x01
405
405
x02
308
306
x02
308
308
x03
658
658
x03
658
660
x04
None
658
x04
None
660
x05
658
660
x06
660
660
The y table contains multiple values for the left join variable (not included in table), resulting in multiple rows per unique DB_ID (string variable, not in df index).
The issue is that only one row is correct, where x_val and y_val match. I tried removing the duplicates with the following code:
df= df[~df['DB_ID'].duplicated() | combined['x_val'] != combined['y_val']]
This however doesn't work. I am looking for a solution to achieve the following result:
DB_ID
x_val
y_val
x01
405
405
x02
308
308
x03
658
658
x04
None
658
x05
658
660
x06
660
660

Idea is compare both column for not equal, then sorting and reove duplicates by DB_ID:
df = (df.assign(new = df['x_val'].ne(df['y_val']))
.sort_values(['DB_ID','new'])
.drop_duplicates('DB_ID')
.drop('new', axis=1))
print (df)
DB_ID x_val y_val
1 x01 405 405
3 x02 308 308
4 x03 658 658
6 x04 None 658
8 x05 658 660
9 x06 660 660
If need equal NaNs or Nones use:
df = (df.assign(new = df['x_val'].fillna('same').ne(df['y_val'].fillna('same')))
.sort_values(['DB_ID','new'])
.drop_duplicates('DB_ID')
.drop('new', axis=1))

Maybe, you can simply use:
df = df[df['x_val'] == df['y_val']]
print(df)
# Output
DB_ID x_val y_val
1 x01 405 405
3 x02 308 308
4 x03 658 658
I think you don't need drop_duplicates or duplicated but if you want to ensure there remains only one instance of each DB_ID, you can append .drop_duplicates('DB_ID')
df = df[df['x_val'] == df['y_val']].drop_duplicates('DB_ID')
print(df)
# Output
DB_ID x_val y_val
1 x01 405 405
3 x02 308 308
4 x03 658 658

Related

How to solve an error, "module 'numpy' has no attribute 'float'"?

Circumstance
WSL2
Docker
Virtualenv
Python 3.8.16
jupyterlab 3.5.2
numpy 1.24.1
prophet 1.1.1
fbprophet 0.7.1
Cython 0.29.33
ipython 8.8.0
pmdarima 2.0.2
plotly 5.11.0
pip 22.3.1
pystan 2.19.1.1
scikit-learn 1.2.0
konlpy 0.6.0 (just in the case)
nodejs 0.1.1 (just in the case)
pandas 1.5.2 (just in the case)
Error
main error message
AttributeError: module 'numpy' has no attribute 'float'
entire error message
INFO:fbprophet:Disabling yearly seasonality. Run prophet with yearly_seasonality=True to override this.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[33], line 4
1 # Prophet() 모델을 읽어와서
2 # fit로 학습한다.
3 model_revenue = Prophet()
----> 4 model_revenue.fit(revenue_serial)
File /home/.venv/lib/python3.8/site-packages/fbprophet/forecaster.py:1115, in Prophet.fit(self, df, **kwargs)
1112 self.history = history
1113 self.set_auto_seasonalities()
1114 seasonal_features, prior_scales, component_cols, modes = (
-> 1115 self.make_all_seasonality_features(history))
1116 self.train_component_cols = component_cols
1117 self.component_modes = modes
File /home/.venv/lib/python3.8/site-packages/fbprophet/forecaster.py:765, in Prophet.make_all_seasonality_features(self, df)
763 # Seasonality features
764 for name, props in self.seasonalities.items():
--> 765 features = self.make_seasonality_features(
766 df['ds'],
767 props['period'],
768 props['fourier_order'],
769 name,
770 )
771 if props['condition_name'] is not None:
772 features[~df[props['condition_name']]] = 0
File /home/.venv/lib/python3.8/site-packages/fbprophet/forecaster.py:458, in Prophet.make_seasonality_features(cls, dates, period, series_order, prefix)
442 #classmethod
443 def make_seasonality_features(cls, dates, period, series_order, prefix):
444 """Data frame with seasonality features.
445
446 Parameters
(...)
456 pd.DataFrame with seasonality features.
457 """
--> 458 features = cls.fourier_series(dates, period, series_order)
459 columns = [
460 '{}_delim_{}'.format(prefix, i + 1)
461 for i in range(features.shape[1])
462 ]
463 return pd.DataFrame(features, columns=columns)
File /home/.venv/lib/python3.8/site-packages/fbprophet/forecaster.py:434, in Prophet.fourier_series(dates, period, series_order)
417 """Provides Fourier series components with the specified frequency
418 and order.
419
(...)
428 Matrix with seasonality features.
429 """
430 # convert to days since epoch
431 t = np.array(
432 (dates - datetime(1970, 1, 1))
433 .dt.total_seconds()
--> 434 .astype(np.float)
435 ) / (3600 * 24.)
436 return np.column_stack([
437 fun((2.0 * (i + 1) * np.pi * t / period))
438 for i in range(series_order)
439 for fun in (np.sin, np.cos)
440 ])
File /home/.venv/lib/python3.8/site-packages/numpy/__init__.py:284, in __getattr__(attr)
281 from .testing import Tester
282 return Tester
--> 284 raise AttributeError("module {!r} has no attribute "
285 "{!r}".format(__name__, attr))
AttributeError: module 'numpy' has no attribute 'float'
Example of dataset
ds y
0 2022-09-01 13:00:00 762
1 2022-09-01 15:00:00 746
2 2022-09-01 17:00:00 848
3 2022-09-01 19:00:00 866
4 2022-09-01 21:00:00 632
... ... ...
1881 2022-10-31 13:00:00 684
1882 2022-10-31 15:00:00 749
1883 2022-10-31 17:00:00 779
1884 2022-10-31 19:00:00 573
1885 2022-10-31 21:00:00 510
Type of variable
visitors_serial
ds datetime64[ns]
y int64
dtype: object
Short code
...
revenue_serial = pd.DataFrame(pd.to_datetime(df_active_time['START_DATE'], format="%Y%m%d %H:%M:%S"))
revenue_serial['객단가(원)']=df_active_time['객단가(원)']
revenue_serial = revenue_serial.reset_index(drop= True)
revenue_serial = revenue_serial.rename(columns={'START_DATE':'ds', '객단가(원)':'y'})
model_revenue = Prophet().
model_revenue.fit(revenue_serial)
I expected if I do upgrade the version of numpy module, it would be solved. It doesn't happend to solve
you could see it in your code the error numpy actually has no attribute float your code is t = np.array((dates - datetime(1970, 1, 1)).dt.total_seconds().astype(np.float) it should be
t = np.array(
(dates - datetime(1970, 1, 1))
.dt.total_seconds()
.astype(np.float32)
The alias numpy.float was deprecated in NumPy 1.20 and was removed in NumPy 1.24.
You can change it to numpy.float_, numpy.float64, or numpy.double. They all mean the same thing.
For your dependency prophet, the actual issue was already fixed in #1850 (March 2021), and it does appear to be fixed in v1.1.1 so it looks like you're not running the version you think you are.

Loss nan problem when using TFBertForSequenceClassification

I have a problem when training a model for multi-label text classification.
I'm working at Colab as follows:
def create_sentiment_bert():
config = BertConfig.from_pretrained("monologg/kobert", num_labels=52)
model = TFBertForSequenceClassification.from_pretrained("monologg/kobert", config=config, from_pt=True)
opt = tf.keras.optimizers.Adam(learning_rate=4.0e-6)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
metric = tf.keras.metrics.SparseCategoricalAccuracy("accuracy")
model.compile(optimizer=opt, loss=loss, metrics=[metric])
return model
sentiment_model = create_sentiment_bert()
sentiment_model.fit(train_x, train_y, epochs=2, shuffle=True, batch_size=250, validation_data=(test_x, test_y))
The result is as follows:
Epoch 1/2
739/14065 [>.............................] - ETA: 35:31 - loss: nan - accuracy: 0.0000e+00
I have checked out my data: no nan or null or invalid values.
I tried different optimizers, # of epochs, learning rate, but had the same problem.
The number of labels is 52 and the distribution is as follows:
[Label] [Count]
501 694624
601 651306
401 257665
210 250352
307 170665
301 153318
306 147948
201 141382
302 113917
402 102040
606 101434
506 73492
305 69876
604 62056
403 57956
104 56800
107 55503
607 40293
503 36272
505 34757
303 26884
308 24539
304 22135
205 20744
509 19465
206 16665
508 15334
208 13335
603 13240
504 12299
602 10684
202 10366
209 8267
106 6564
502 5880
211 5804
207 2794
507 1967
108 1860
204 1633
105 1545
109 682
605 426
102 276
101 274
405 268
212 204
213 153
103 103
203 90
404 65
608 37
I'm a beginner in this area. Please help me. Thanks in advance!
Why do you have from_logits = False? The classifier head returns logits so unless you put a softmax activation within your model, you need to calculate the loss from logits.

How to plot using timstamp and coordinates?

I have logs of mouse movement that is coordinates and timestamp .I want to plot the mouse movement using this log how can I do this I have no idea what API or what can be used to do the same.I want to know how start with if there is some way which exist.
My log is as follows
Date hr:min:sec ms x y
13/6/2020 13:13:33 521 291 283
13/6/2020 13:13:33 638 273 234
13/6/2020 13:13:33 647 272 233
13/6/2020 13:13:33 657 271 231
13/6/2020 13:13:33 667 269 230
13/6/2020 13:13:33 677 268 229
13/6/2020 13:13:33 687 267 228
13/6/2020 13:13:33 697 264 226
You're looking for geom_path() from ggplot2. The geom will connect a line between all your observations based on the order they appear in the dataframe. So, here's some x,y data that's expanded a bit:
df <- data.frame(
x=c(291,273,272,271,269,268,267,264,262,261,261,265,268,280,290),
y=c(283,234,233,231,230,229,228,226,230,235,237,248,252,246,235)
)
And some code to make a simple plot using geom_path():
p <- ggplot(df, aes(x=x,y=y)) + theme_classic() +
geom_path(color='blue') + geom_point()
p
If you want, you can even save that as an animation based on your time points. See the code below using the gganimate package:
library(gganimate)
df$time <- 1:15
a <- p + transition_reveal(time)
animate(a, fps=20)

row data to column data

I am a newbie in python. I have data looks like this:
ID Annotation X Y
ID_1 first 767 942
ID_1 last 768 943
ID_2 first 769 944
ID_2 last 770 945
I want to make new column first XY and last for XY. my expected result:
ID X_first Y_first X_last Y_last
ID_1 767 942 768 943
ID_2 769 944 770 945
thank you for your help
I am using unstack for the pivot problem
s=df.set_index(['ID','Annotation']).unstack()
s.columns=s.columns.map('_'.join) # columns flatten
s.reset_index(inplace=True)
s
Out[353]:
ID X_first X_last Y_first Y_last
0 ID_1 767 768 942 943
1 ID_2 769 770 944 945

How to sort date index of a pandas dataframe so that all the newer year dates are on one side on the X-axis label when plotted on graph

I have a pandas dataframe with dates as the indexes (indices). When I plot the values, the indexes(dates, i.e. the X-axis labels) do not show up in a proper sequence on the X-axis of the plotted graph. For example, instead of all the 2018 dates (e.g. 2018/02/15, 2018/03/10, 2018/10/12 ... 2019/01/07, 2019/01/10, 2019/03/16 ...), I would have these dates showing up on the X-axis in a mismatch order. For example 2019/01/07, 2019/01/10, 2018/02/15, 2018/03/10, 2019/03/16 ... even though I have applied sorting to the indexes (i.e. the dates). How do I handle this issue? Thank you in advance.
I tried to sort the indexes but this did not work.
DTT_data = miniBid_data.groupby(['Mini_Bid_Date_2'])['New_Cost_Per_Load','Volume'].aggregate([np.mean])
# sort the data
DTT_data.sort_index(inplace=True, ascending=True)
fig, ax = plt.subplots()
color1 = 'tab:red'
DTT_data.plot(kind='line', figsize=(12,8), legend=False, ax=ax, logy=True, marker='*')
ax.set_title('Trends of Selected Variables')
ax.set_ylabel('Log10 of Values', color=color1)
ax.legend(loc='upper left')
ax.set_xlabel('Event Dates')
ax.tick_params(axis='y', labelcolor=color1)
#ax.legend(loc='upper left')
ax1 = ax.twinx()
color2 = 'tab:blue'
DTT_data2 = miniBid_data.groupby(['Mini_Bid_Date_2'])['Carrier_Code'].count()
DTT_data2.plot(kind='bar', figsize=(12,8), legend=False, ax=ax1, color=color2)
DTT_data2.sort_index(inplace=True, ascending=False)
ax1.set_ylabel('Log10 of Values', color=color2)
ax1.set_yscale('log')
ax1.tick_params(axis='y', labelcolor=color2)
ax1.legend(loc='upper right')
fig.autofmt_xdate()
fig.tight_layout()
plt.show()
Sample Data:
a) DTT_data =
Mini_Bid_Date_2 New_Cost_Per_Load Volume
01/07/2019 1604.3570393487105 1.6431478968792401
02/25/2018 1816.1534797297306 2.831081081081081
10/22/2018 1865.5403827160494 2.074074074074074
10/29/2018 1945.3011032028478 1.9023576512455516
01/08/2019 1947.7562972972971 1.162162162162162
02/11/2019 2062.7133737931017 2.3424827586206916
11/05/2018 2095.531836956521 1.7753623188405796
12/08/2018 2155.48935907859 1.437825203252031
02/04/2019 2169.209245791246 2.2669696969696966
02/04/2018 2189.3693333333335 5.0
01/14/2019 2313.3854211711728 1.1587162162162181
01/20/2019 2380.9063928571427 1.0
01/21/2019 2631.0407864661634 1.3657894736842129
12/03/2018 2684.0808513089005 4.402827225130894
02/25/2019 2844.047048492792 1.89397116644823
11/12/2018 3011.510282722513 2.147905759162304
10/08/2018 3042.3035776536312 1.8130726256983247
11/19/2018 3063.736631460676 1.7407865168539327
02/18/2019 3148.531689480355 6.798162230671736
10/01/2018 3248.0486851851842 2.1951388888888905
01/19/2019 3291.1334154589376 1.4626086956521749
10/15/2018 11881.90527833753 1.779911838790932
01/28/2019 13786.149445804196 1.6329195804195813
03/04/2019 14313.741501103752 1.5459455481972018
12/10/2018 100686.89588865546 3.051260504201676
b) DTT_data =
Mini_Bid_Date_2 Carrier_Code
12/08/2018 1476
03/04/2019 1359
02/04/2019 1188
10/29/2018 1124
12/03/2018 955
10/08/2018 895
11/19/2018 890
10/15/2018 794
02/18/2019 789
02/25/2019 763
01/07/2019 737
02/11/2019 725
01/21/2019 665
10/01/2018 648
02/25/2018 592
01/28/2019 572
12/10/2018 476
01/14/2019 444
11/12/2018 382
10/22/2018 324
11/05/2018 276
01/19/2019 207
01/20/2019 56
01/08/2019 37
02/04/2018 30
My expectation is to have dates (indexes) in this case show up in sequential order, for example, 2019/01/07, 2019/01/10, 2018/02/15, 2018/03/10, 2019/03/16 ... on as labels on the X-axis.