How to merge two pandas dataframes/ series - pandas

I'm trying to merge multiple pandas data frames into one. I have 1 main frame with the locations of measurements. The other data frames contain multiple measurements for one location. Like below:
df 1: Location ID | X | Y | Z
1 |1| 2 |3
2 |3| 2 |1
n
df 2: Location ID | Date | Measurement
1 |January 1 12:30 | 1
1 |January 16 12 :30 | 4
1 ...
df 2: Location ID | Date | Measurement
2 January 1 12:30 3
2 January 16 12 :30 9
2 ...
df n: Location ID | Date | Measurement
n January 1 12:30 4
n January 16 12 :30 6
n January 20 11:30 7 ...
I'm trying to create a data frame like this:
df_final: Location ID | X | Y | Z | january 1 12:00 | January 16 12 :30| January 20 11:30 etc.
1 1 2 3 1 4 NaN
2 3 2 1 3 9 NaN
n 2 5 7 4 6 7
The dates are already datetime objects and the Location ID is the index of both dataframes.
I tried to use the append, the merge and the concat functions both using two frames and converting the frame to a list by List = frame['measurements'] before adding it.
The problem is that either rows are added under the first data frame, while the measured values should be added in new columns on an existing row( the location ID resp.), or the dates end op to be new rows while new columns with location IDs are created.
I'm sorry my question lay-out is not so nice, but I'm new to this forum.

Found it myself.
I used frame. pivot to reshape df2-n and then used concat to ad it to the locations df.

Related

Pandas: Drop duplicates that appear within a time interval pandas

We have a dataframe containing an 'ID' and 'DAY' columns, which shows when a specific customer made a complaint. We need to drop duplicates from the 'ID' column, but only if the duplicates happened 30 days apart, tops. Please see the example below:
Current Dataset:
ID DAY
0 1 22.03.2020
1 1 18.04.2020
2 2 10.05.2020
3 2 13.01.2020
4 3 30.03.2020
5 3 31.03.2020
6 3 24.02.2021
Goal:
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021
Any suggestions? I have tried groupby and then creating a loop to calculate the difference between each combination, but because the dataframe has millions of rows this would take forever...
You can compute the difference between successive dates per group and use it to form a mask to remove days that are less than 30 days apart:
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
mask = (df
.sort_values(by=['ID', 'DAY'])
.groupby('ID')['DAY']
.diff().lt('30d')
.sort_index()
)
df[~mask]
NB. the potential drawback of this approach is that if the customer makes a new complaint within the 30days, this restarts the threshold for the next complaint
output:
ID DAY
0 1 2020-03-22
2 2 2020-10-05
3 2 2020-01-13
4 3 2020-03-30
6 3 2021-02-24
Thus another approach might be to resample the data per group to 30days:
(df
.groupby('ID')
.resample('30d', on='DAY').first()
.dropna()
.convert_dtypes()
.reset_index(drop=True)
)
output:
ID DAY
0 1 2020-03-22
1 2 2020-01-13
2 2 2020-10-05
3 3 2020-03-30
4 3 2021-02-24
You can try group by ID column and diff the DAY column in each group
df['DAY'] = pd.to_datetime(df['DAY'], dayfirst=True)
from datetime import timedelta
m = timedelta(days=30)
out = df.groupby('ID').apply(lambda group: group[~group['DAY'].diff().abs().le(m)]).reset_index(drop=True)
print(out)
ID DAY
0 1 2020-03-22
1 2 2020-05-10
2 2 2020-01-13
3 3 2020-03-30
4 3 2021-02-24
To convert to original date format, you can use dt.strftime
out['DAY'] = out['DAY'].dt.strftime('%d.%m.%Y')
print(out)
ID DAY
0 1 22.03.2020
1 2 10.05.2020
2 2 13.01.2020
3 3 30.03.2020
4 3 24.02.2021

How could I remove duplicates if duplicates mean less than 30days?

(using sql or pandas)
I want to delete records if the Date difference between two records is less than 30 days.
But first record of ID must be remained.
#example
ROW ID DATE
1 A 2020-01-01 -- first
2 A 2020-01-03
3 A 2020-01-31
4 A 2020-02-05
5 A 2020-02-28
6 A 2020-03-09
7 B 2020-03-06 -- first
8 B 2020-05-07
9 B 2020-06-02
#expected results
ROW ID DATE
1 A 2020-01-01
4 A 2020-02-05
6 A 2020-03-09
7 B 2020-03-06
8 B 2020-05-07
ROW 2,3 are within 30 days from ROW 1
ROW 5 is within 30 days from ROW 4
ROW 9 is within 30 days from ROW 8
To cope with your task it is not possible to call any
vectorized methods.
The cause is that after a row is recognized as a duplicate, then
this row "does not count" when you check further rows.
E.g. after rows 2020-01-03 and 2020-01-31 were deleted (as
"too close" to the previous row) then 2020-02-05 row should be
left, because now the distance to the previous row (2020-01-01)
is big enough.
So I came up with a solution based on a "function with memory":
def isDupl(elem):
if isDupl.prev is None:
isDupl.prev = elem
return False
dDiff = (elem - isDupl.prev).days
rv = dDiff <= 30
if not rv:
isDupl.prev = elem
return rv
This function should be invoked for each DATE in the
current group (with same ID) but before that isDupl.prev
must be set to None.
So the function to apply to each group of rows is:
def isDuplGrp(grp):
isDupl.prev = None
return grp.DATE.apply(isDupl)
And to get the expected result, run:
df[~(df.groupby('ID').apply(isDuplGrp).reset_index(level=0, drop=True))]
(you may save it back to df).
The result is:
ROW ID DATE
0 1 A 2020-01-01
3 4 A 2020-02-05
5 6 A 2020-03-09
6 7 B 2020-03-06
7 8 B 2020-05-07
And finally, a remark about the other solution:
It contains rows:
3 4 A 2020-02-05
4 5 A 2020-02-28
which are only 23 days apart, so this solution is wrong.
The same pertains to rows:
5 A 2020-02-28
6 A 2020-03-09
which are also too close in time.
You can try this:
Convert date to datetime64
Get the first date from each group df.groupby('ID')['DATE'].transform('first')
Add a filter to keep only dates greater than 30 days
Append the first date of each group to the dataframe
Code:
df['DATE'] = pd.to_datetime(df['DATE'])
df1 = df[(df['DATE'] - df.groupby('ID')['DATE'].transform('first')) >= pd.Timedelta(30, unit='D')]
df1 = df1.append(df.groupby('ID', as_index=False).agg('first')).sort_values(by=['ID', 'DATE'])
print(df1)
ROW ID DATE
0 1 A 2020-01-01
2 3 A 2020-01-31
3 4 A 2020-02-05
4 5 A 2020-02-28
5 6 A 2020-03-09
1 7 B 2020-03-06
7 8 B 2020-05-07
8 9 B 2020-06-02

Python/Pandas: Transformation of column within a list of columns

I'd like to select a subset of columns from a DataFrame while applying a transformation to some of those columns at the same time. Is it possible to transform a column when that column is selected as one in a list of columns?
For example, I have a column StartDate that is of type np.datetime[64] that I'd like to extract the month from.
When dealing with that Series on its own, I'd do something like
print(df['StartDate'].transform(lambda x: x.month))
to see the transformed data. Can I accomplish the same thing when the above expression is part of a list of columns? Something like:
print(df[['ColumnA', 'ColumnB', 'StartDate'.transform(lambda x: x.month)]])
Of course the above gives the error
AttributeError: 'str' object has no attribute 'month'
So, if my data looks like:
Metadata | Metadata | 2020-01-01
Metadata | Metadata | 2020-02-06
Metadata | Metadata | 2020-02-25
I'd like to see:
Metadata | Metadata | 1
Metadata | Metadata | 2
Metadata | Metadata | 2
Without appending a new separate "Month" column to the DataFrame. Is this possible?
If you have some data like below
df = pd.DataFrame({'col1' : np.random.randint(10, size = 366), 'col2': np.random.randint(10, size = 366),'StartDate' : pd.date_range('2018', '2019')})
which looks like
col1 col2 StartDate
0 0 2 2018-01-01
1 8 0 2018-01-02
2 0 5 2018-01-03
3 3 4 2018-01-04
4 8 6 2018-01-05
... ... ... ...
361 8 8 2018-12-28
362 9 9 2018-12-29
363 4 1 2018-12-30
364 2 4 2018-12-31
365 0 9 2019-01-01
You could redefine the column, or you could assign and create a temporary view, like.
df.assign(StartDate = df['StartDate'].dt.month)
which outputs.
col1 col2 StartDate
0 0 2 1
1 8 0 1
2 0 5 1
3 3 4 1
4 8 6 1
... ... ... ...
361 8 8 12
362 9 9 12
363 4 1 12
364 2 4 12
365 0 9 1
This also doesn't change the original dataframe. If you want to create a permanent version, then just reassign.
df = df.assign(StartDate = df['StartDate'].dt.month)
You could also take this further, such as.
df.assign(StartDate = df['StartDate'].dt.month, col1 = df['col1'] + 100)[['col1', 'StartDate']]
You can apply whatever transform you need and then access any columns you want after assigning these transforms.
col1 StartDate
0 105 1
1 109 1
2 108 1
3 101 1
4 108 1
... ... ...
361 104 12
362 102 12
363 109 12
364 102 12
365 100 1
I guess you could use the attribute name of the Series.
Something like:
dt_to_month = lambda x: [d.month for d in x] if x.name == 'StartDate' else x
df[['ColumnA', 'ColumnB', 'StartDate']].apply(dt_to_month)
will do the trick.

Selecting all the previous 6 months data records from occurrence of a particular value in a column in pandas

I want to select all the previous 6 months records for a customer whenever a particular transaction is done by the customer.
Data looks like:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
1 01/01/2017 8 Y
2 10/01/2018 6 Moved
2 02/01/2018 12 Z
Here, I want to see for the Description "Moved" and then select all the last 6 months for every Cust_ID.
Output should look like:
Cust_ID Transaction_Date Amount Description
1 08/01/2017 12 Moved
1 03/01/2017 15 X
2 10/01/2018 6 Moved
I want to do this in python. Please help.
Idea is created Series of datetimes filtered by Moved and shifted by MonthOffset, last filter by Series.map values less like this offsets:
EDIT: Get all datetimes for each Moved values:
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
df = df.sort_values(['Cust_ID','Transaction_Date'])
df['g'] = df['Description'].iloc[::-1].eq('Moved').cumsum()
s = (df[df['Description'].eq('Moved')]
.set_index(['Cust_ID','g'])['Transaction_Date'] - pd.offsets.MonthOffset(6))
mask = df.join(s.rename('a'), on=['Cust_ID','g'])['a'] < df['Transaction_Date']
df1 = df[mask].drop('g', axis=1)
EDIT1: Get all datetimes for Moved with minimal datetimes per groups, another Moved per groups are removed:
print (df)
Cust_ID Transaction_Date Amount Description
0 1 10/01/2017 12 X
1 1 01/23/2017 15 Moved
2 1 03/01/2017 8 Y
3 1 08/08/2017 12 Moved
4 2 10/01/2018 6 Moved
5 2 02/01/2018 12 Z
#convert to datetimes
df['Transaction_Date'] = pd.to_datetime(df['Transaction_Date'])
#mask for filter Moved rows
mask = df['Description'].eq('Moved')
#filter and sorting this rows
df1 = df[mask].sort_values(['Cust_ID','Transaction_Date'])
print (df1)
Cust_ID Transaction_Date Amount Description
1 1 2017-01-23 15 Moved
3 1 2017-08-08 12 Moved
4 2 2018-10-01 6 Moved
#get duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date'] - pd.offsets.MonthOffset(6)
print (s)
Cust_ID
1 2016-07-23
2 2018-04-01
Name: Transaction_Date, dtype: datetime64[ns]
#create mask for filter out another Moved (get only first for each group)
m2 = ~mask.reindex(df.index, fill_value=False)
df1 = df[(df['Cust_ID'].map(s) < df['Transaction_Date']) & m2]
print (df1)
Cust_ID Transaction_Date Amount Description
0 1 2017-10-01 12 X
1 1 2017-01-23 15 Moved
2 1 2017-03-01 8 Y
4 2 2018-10-01 6 Moved
EDIT2:
#get last duplicated filtered rows in df1
mask = df1.duplicated('Cust_ID', keep='last')
#create Series for map
s = df1[~mask].set_index('Cust_ID')['Transaction_Date']
print (s)
Cust_ID
1 2017-08-08
2 2018-10-01
Name: Transaction_Date, dtype: datetime64[ns]
m2 = ~mask.reindex(df.index, fill_value=False)
#filter by between Moved and next 6 months
df3 = df[df['Transaction_Date'].between(df['Cust_ID'].map(s), df['Cust_ID'].map(s + pd.offsets.MonthOffset(6))) & m2]
print (df3)
Cust_ID Transaction_Date Amount Description
3 1 2017-08-08 12 Moved
0 1 2017-10-01 12 X
4 2 2018-10-01 6 Moved

How may I copy the same rows for each value in one column?

How may I copy rows for each value in one column
Let's say that we have column like:
---------------------
record time id
---------------------
1 12:00 [1,2,3]
2 12:01 [4,5,6,7]
3 12:07 [8,9]
And I would like to get result like:
---------------------
record time id
---------------------
1 12:00 1
2 12:00 2
3 12:00 3
4 12:01 4
5 12:01 5
...
9 12:07 9
I need to do this in Postgresql or R
One option is separate_rows if the 'id' is string
library(tidyverse)
df1 %>%
separate_rows(id) %>%
filter(id != "") %>%
mutate(record = row_number())
# record time id
#1 1 12:00 1
#2 2 12:00 2
#3 3 12:00 3
#4 4 12:01 4
#5 5 12:01 5
#6 6 12:01 6
#7 7 12:01 7
#8 8 12:07 8
#9 9 12:07 9
If the 'id' is a list
df1 %>%
unnest
data
df1 <- structure(list(record = 1:3, time = c("12:00", "12:01", "12:07"
), id = c("[1,2,3]", "[4,5,6,7]", "[8,9]")), class = "data.frame",
row.names = c(NA, -3L))
You could do it in PostgreSQL like so:
SELECT DISTINCT record, time, unnest(translate(id, '[]', '{}'):: int[]) AS ids
FROM tbl
ORDER BY record, time, ids;
Basically, you create an array out of your text field, and then use unnest to get your desired result.
record | time | ids
--------+----------+-----
1 | 12:00:00 | 1
1 | 12:00:00 | 2
1 | 12:00:00 | 3
2 | 12:01:00 | 4
2 | 12:01:00 | 5
2 | 12:01:00 | 6
2 | 12:01:00 | 7
3 | 12:07:00 | 8
3 | 12:07:00 | 9
DEMO
In Postgres, you would just use unnest():
select t.record, t.time, unnest(t.id) as id
from t;
This assumes that your column is actually stored as an array in Postgres. If it is a string, you can do something similar, but it requires more string manipulation.