This question already has answers here:
Pandas Merging 101
(8 answers)
pandas: fillna with data from another dataframe, based on the same ID
(2 answers)
Closed last year.
I have two datasets:
First Dataset:
Customer_Key Incentive_Amount
3434 32
5635 56
6565 NaN
3453 45
Second Dataset:
Customer_Key Incentive_Amount
3425 87
6565 22
1474 46
9842 29
First Dataset has many rows where incentive_amount value is NaN. but it is present in second dataset. For example, See customer_Key = 6565, it's incentive_amount is missing in dataset_1 but present in dataset_2. So, For all NaN values of incentive_amount in dataset_1, copy the incentive_amount value from dataset_2 based on matching customer_key.
Psuedocode will be like:
df_1['incentive_amount'] = np.where(df_1['incentive_incentive']='NaN',
(df_1['incentive_amount'].fillna(df_2['incentive_amount'])
if
df_1['customer_key'] = df_2['customer_key']),
df_1['incentive_amount'])
There are many ways to do this. Please do some reading
combine_first
update
merge
Related
I have a dataframe of surgical activity with admission dates (ADMIDATE) and discharge dates (DISDATE). It is 600k rows by 78 columns but I have filtered it for a particular surgery. I want to calculate the length of stay and add it as a further column.
Usually I use
df["los"] = (df["DISDATE"] - df["ADMIDATE"]).dt.days
I recently had to clean the data and must have done it in a different way to previously because I am now getting a negative los, eg.
DISDATE.
. ADMIDATE.
. los.
2019-12-24
2019-12-08
-43805.
2019-05-15
. 2019-03-26
50.
2019-10-11
. 2019-10-07
4.
2019-06-20
2019-06-16
4
2019-04-11
2019-04-08
3
df.info()
df.info()
<class '`pandas`.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 78 columns):
5 ADMIDATE 5 non-null datetime64[ns]
28 DISDATE 5 non-null datetime64[ns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 78 columns):
5 ADMIDATE 5 non-null datetime64[ns]
28 DISDATE 5 non-null datetime64[ns]
I am not sure how to ask the right questions to the problem, - and why its only affecting some rows. In cleansing the data some of the DISDATES had to be populated from another column (also a date column) becuase they were incomplete, and I wonder if it is these which are negative due to some retention of the orig data soemhow even though printing the new DISDATE looks ok.
Your sample works well with the right output (16 days for the first row)
Can you try that and check if the problem persists:
import io
data = df[['DISDATE', 'ADMIDATE']].to_csv()
test = pd.read_csv(io.StringIO(data), index_col=0,
parse_dates=['DISDATE', 'ADMIDATE'])
print(test['DISDATE'].sub(test['ADMIDATE']).dt.days)
Output:
0 16
1 50
2 4
3 4
4 3
dtype: int64
Update
To debug your bad dates, try:
df.loc[pd.to_datetime(df['ADMIDATE'], errors='coerce').isna(), 'ADMIDATE']
You should see rows where values are not a right date.
Input Data frame:
1. 0th col 1st_col 2nd_col
2. 23 46 6
3. 33 56 3
4. 243 2 21
The output data frame should be like:
Index
1. 0th col 1st_col 2nd_col
2. 6 23 46
3. 3 33 56
4. 2 21 243
The rows have to be sorted in ascending or descending order, Independent of columns, Means values for columns can be interchanged within the same row, to sort the row. Sorting rows in the following unique manner.
Please Help, I am in the middle of something very important.
Convert DataFrame to numpy array and sort by np.sort with axis=1, then create DataFrame by constructor:
df1 = pd.DataFrame(np.sort(df.to_numpy(), axis=1),
index=df.index,
columns=df.columns)
print (df1)
0th col 1st_col 2nd_col
1 6 23 46
2 3 33 56
3 2 21 243
This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
In the following pandas dataframe,
In the following, I want the data for ID 1 only and drop rest of the data. How to achieve that?
ID name s1 s2 s3 s4
1 Joe rd fd NaN aa
1 Joe NaN hg kk NaN
2 Ann jg hg zt uz
2 Mya rd fd NaN aa
1 Uri gg r er rt
4 Ron hr t yt rt
All dataframes have a drop method which can do what you want.
For your specific case, if you really want to drop the data instead of filtering it to a different dataframe, then the following snippet:
df.drop(df[df['ID'] != 1].index, inplace = True)
Should work.
Use boolean índexing
df.loc[df['ID'].eq(1)]
There is a some seemingly inconsistent behaviour observed when removing duplicates in pandas.
Problem set up: I have a dataframe with three columns and 3330 timeseries observations as shown below:
data.describe()
Mean Buy Sell
count 3330 3330 3330
Checking if the data contains any duplicates, shows there are duplicate indices.
data.index.duplicated().any()
True
How many duplicates are in the data
data.loc[data.index.duplicated()].count()
Mean 38
Buy 38
Sell 38
The duplicates can be visually inspected too
`data[data.index.duplicated()]`
Dilemma: Clearly, there are duplicates in the data and it seems they are 38 of them per column. However, when I use the DataFrame's drop_duplicates(), it seems more data is dropped than expected.
`data.drop_duplicates().count()`
Mean 3241
Buy 3241
Sell 3241
dtype: int64
`data.count() - data.drop_duplicates().count()`
Mean 89
Buy 89
Sell 89
Any ideas on what is the cause of this disparity or the detail I'm missing would be appreciated. Note: It is possible to have similar entries of data but dates should not be duplicated hence the reasonable way to clean the data is to remove the duplicate days.
If I understand you correctly, you want to keep only the first occurrence (row / record) where there are duplicates in your index?
This will accomplish that.
import pandas as pd
df = pd.DataFrame({'IDX':[1,2,2,2,3,4,5,5,6],
'Mean':[1,2,3,4,5,6,7,8,9]}).set_index('IDX')
df
Mean
IDX
1 1
2 2
2 3
2 4
3 5
4 6
5 7
5 8
6 9
duplicates = df.index.duplicated()
duplicates
array([False, False, True, True, False, False, False, True, False])
keep = duplicates == False
df.loc[keep,:]
Mean
IDX
1 1
2 2
3 5
4 6
5 7
6 9
I used the pandas groupby method to get the following dataframe. How do I select an entire column from this dataframe, say the column named EventID or Value.
df['Value'] gives the entire dataFrame back, not just the Value column.
Value
Realization Occurrence EventID
1 207 2023378 20
213 2012388 25
291 2012612 28
324 2036783 12
357 2255910 45
399 2166643 64
420 2022922 19
2 207 2010673 56
249 2018319 77
282 2166809 43
df['Value'] is just the Value column. The reason why there is so much other data attached is because df['Value'] has a MultiIndex with three levels.
To drop the MultiIndex, you could use
df['Value'].reset_index(drop=True)
or, you could get a NumPy array of the underlying data using
df['Value'].values