Duplicated IDs pandas - pandas

I have the following dataframes (df1,df2):
ID
Q1
111
2
111
3
112
1
ID
Q2
111
1
111
5
112
7
Since the IDs are duplicated, I want to reinitialize them, using the following code:
df1.sort_values('ID',inplace=True)
df1['ID_new'] = range(len(df1))
df2.sort_values('ID',inplace=True)
df2['ID_new'] = range(len(df2))
in order to have smth like this:
ID_new
ID
Q1
0
111
2
1
111
3
2
112
1
ID_new
ID
Q2
0
111
1
1
111
5
2
112
7
The question is: are we sure that the ID_new will be the same for df1 and df2?
For example:
is it possible that ID_new = 1 corresponds to the first ID=111 in df1 and to the second ID = 111 in df2?
If yes, there is another way to reinitialize it in a more robust way?

Related

Pandas split dataframe by sessions

I've got the next DataFrame:
id
sec
1
45
2
1
3
176
1
19
1
876
3
123
I want to split it to groups by id by sessions, or create multiple dataframes of this sessions. Like I want to have sessions of each id (session is When more than 30 seconds have passed between user actions)
For example:
sessions for id 1: [45, 19], [876]
I tried gruopby and cat, but I have no idea how to implement this
To identify the session you can use:
df['session'] = (df.sort_values(by=['id', 'sec'])
.groupby('id')['sec']
.apply(lambda s: s.diff().gt(30).cumsum().add(1))
)
Output:
id sec session
0 1 45 1
1 2 1 1
2 3 176 2
3 1 19 1
4 1 876 2
5 3 123 1

Python/Pandas: Transformation of column within a list of columns

I'd like to select a subset of columns from a DataFrame while applying a transformation to some of those columns at the same time. Is it possible to transform a column when that column is selected as one in a list of columns?
For example, I have a column StartDate that is of type np.datetime[64] that I'd like to extract the month from.
When dealing with that Series on its own, I'd do something like
print(df['StartDate'].transform(lambda x: x.month))
to see the transformed data. Can I accomplish the same thing when the above expression is part of a list of columns? Something like:
print(df[['ColumnA', 'ColumnB', 'StartDate'.transform(lambda x: x.month)]])
Of course the above gives the error
AttributeError: 'str' object has no attribute 'month'
So, if my data looks like:
Metadata | Metadata | 2020-01-01
Metadata | Metadata | 2020-02-06
Metadata | Metadata | 2020-02-25
I'd like to see:
Metadata | Metadata | 1
Metadata | Metadata | 2
Metadata | Metadata | 2
Without appending a new separate "Month" column to the DataFrame. Is this possible?
If you have some data like below
df = pd.DataFrame({'col1' : np.random.randint(10, size = 366), 'col2': np.random.randint(10, size = 366),'StartDate' : pd.date_range('2018', '2019')})
which looks like
col1 col2 StartDate
0 0 2 2018-01-01
1 8 0 2018-01-02
2 0 5 2018-01-03
3 3 4 2018-01-04
4 8 6 2018-01-05
... ... ... ...
361 8 8 2018-12-28
362 9 9 2018-12-29
363 4 1 2018-12-30
364 2 4 2018-12-31
365 0 9 2019-01-01
You could redefine the column, or you could assign and create a temporary view, like.
df.assign(StartDate = df['StartDate'].dt.month)
which outputs.
col1 col2 StartDate
0 0 2 1
1 8 0 1
2 0 5 1
3 3 4 1
4 8 6 1
... ... ... ...
361 8 8 12
362 9 9 12
363 4 1 12
364 2 4 12
365 0 9 1
This also doesn't change the original dataframe. If you want to create a permanent version, then just reassign.
df = df.assign(StartDate = df['StartDate'].dt.month)
You could also take this further, such as.
df.assign(StartDate = df['StartDate'].dt.month, col1 = df['col1'] + 100)[['col1', 'StartDate']]
You can apply whatever transform you need and then access any columns you want after assigning these transforms.
col1 StartDate
0 105 1
1 109 1
2 108 1
3 101 1
4 108 1
... ... ...
361 104 12
362 102 12
363 109 12
364 102 12
365 100 1
I guess you could use the attribute name of the Series.
Something like:
dt_to_month = lambda x: [d.month for d in x] if x.name == 'StartDate' else x
df[['ColumnA', 'ColumnB', 'StartDate']].apply(dt_to_month)
will do the trick.

How to assign the multiple values of an output to new multiple columns of a dataframe?

I have the following function:
def sum(x):
oneS = x.iloc[0:len(x)//10].agg('sum')
twoS = x.iloc[len(x)//10:2*len(x)//10].agg('sum')
threeS = x.iloc[2*len(x)//10:3*len(x)//10].agg('sum')
fourS = x.iloc[3*len(x)//10:4*len(x)//10].agg('sum')
fiveS = x.iloc[4*len(x)//10:5*len(x)//10].agg('sum')
sixS = x.iloc[5*len(x)//10:6*len(x)//10].agg('sum')
sevenS = x.iloc[6*len(x)//10:7*len(x)//10].agg('sum')
eightS = x.iloc[7*len(x)//10:8*len(x)//10].agg('sum')
nineS = x.iloc[8*len(x)//10:9*len(x)//10].agg('sum')
tenS = x.iloc[9*len(x)//10:len(x)//10].agg('sum')
return [oneS,twoS,threeS,fourS,fiveS,sixS,sevenS,eightS,nineS,tenS]
How to assign the outputs of this function to columns of dataframe (which already exists)
The dataframe I am applying the function is as below
Cycle Type Time
1 1 101
1 1 102
1 1 103
1 1 104
1 1 105
1 1 106
9 1 101
9 1 102
9 1 103
9 1 104
9 1 105
9 1 106
The dataframe I want to add the columns is something like below & the new columns Ones, TwoS..... Should be added like shown & filled with the results of the function.
Cycle Type OneS TwoS ThreeS
1 1
9 1
8 1
10 1
3 1
5 2
6 2
7 2
If I write a function for just one value and apply it like the following, it is possible:
grouped_data['fm']= data_train_bel1800.groupby(['Cycle', 'Type'])['Time'].apply( lambda x: fm(x))
But I want to do it all at once so that it is neat and clear.
You can use:
def f(x):
out = []
for i in range(10):
out.append(x.iloc[i*len(x)//10:(i+1)*len(x)//10].agg('sum'))
return pd.Series(out)
df1 = (data_train_bel1800.groupby(['Cycle', 'Type'])['Time']
.apply(f)
.unstack()
.add_prefix('new_')
.reset_index())
print (df1)
Cycle Type new_0 new_1 new_2 new_3 new_4 new_5 new_6 new_7 new_8 \
0 1 1 0 101 102 205 207 209 315 211 211
1 9 1 0 101 102 205 207 209 315 211 211
new_9
0 106
1 106

Generate multiple rows from row with bitmask

Lets have table with 3 columns: key, value, and bitmask (as varchar; of unknown maximum length):
abc | 23 | 101
xyz | 56 | 000101
Is it possible to write query, where on the output I will get one row for every combination of key, value, and 1 in bitmask, with index of that 1 as integer column (doesnt matter if starting from 0 or 1)? So for example above:
abc | 23 | 1
abc | 23 | 3
xyz | 56 | 4
xyz | 56 | 6
Thanks for any ideas!
I think you might be better off choosing a maximum length for your varchar.
SELECT * FROM
table
INNER JOIN
generate_series(1,1000) s(n)
ON
s.n <= char_length(bitmask) and
substring(bitmask from s.n for 1) = '1'
We generate a list of numbers:
s.n
---
1
2
3
4
...
And join it to the table in a way that causes repeated table rows:
s.n bitmask
--- -------
1 000101
2 000101
3 000101
4 000101
5 000101
6 000101
1 101
2 101
3 101
Then use the s.n to substring the bitmask, and look for being equal to 1:
s.n bitmask substr
--- ------- ------
1 000101 --substring('000101' from 1 for 1) = '1'? no
2 000101 --substring('000101' from 2 for 1) = '1'? no
3 000101 --substring('000101' from 3 for 1) = '1'? no
4 000101 --substring('000101' from 4 for 1) = '1'? yes...
5 000101
6 000101
1 101
2 101
3 101
So the s.n gives us the number in the last column of your desired output, and the where filters to only rows where the string substring works out

pandas pivoting start and end

I need help with pivoting my df to get the start and end day.
Id Day Value
111 6 a
111 5 a
111 4 a
111 2 a
111 1 a
222 3 a
222 2 a
222 1 a
333 1 a
The desired result would be:
Id StartDay EndDay
111 4 6
111 1 2 (since 111 skips day 3)
222 1 3
333 1 1
Thanks a bunch!
So, my first thought was just :
df.groupby('Id').Day.agg(['min','max'])
But then I noticed your stipulation "(since 111 skips day 3)", which means we have to make an identifier which tells us if the current row is in the same 'block' as the previous (same Id, contiguous Day). So, we sort:
df.sort_values(['Id','Day'], inplace=True)
Then define the block:
df['block'] = ((df.Day!=(df.shift(1).Day+1).fillna(0).astype(int))).astype(int).cumsum()
(adapted from top answer to this question: Finding consecutive segments in a pandas data frame)
then group by Id and block:
df.groupby(['Id','block']).Day.agg(['min','max'])
Giving:
Id block min max
111 1 1 2
111 2 4 6
222 3 1 3
333 4 1 1