Add 2nd column index to pandas dataframe using another dataframe column - dataframe

I have these 2 dataframes
edfmonthtradedays
Out[57]:
Instrument AAPL.O AMZN.O FB.OQ GOOG.OQ GOOGL.OQ BHP.AX JPM.N MSFT.O \
Date
2016-04-30 21 21 21 21 21 21 21 21
2016-05-31 21 21 21 21 21 21 21 21
2016-06-30 22 22 22 22 22 22 22 22
and
rics
Out[60]:
0 1
0 AAPL.O US
1 MSFT.O US
2 AMZN.O US
3 BHP.AX AU
I am trying to add a second column index to df1 using column[1] in df2, such that AAPL.O column would also have 'US' as a column index, BHP.AX would have 'AU', etc? I am new to python and programming but have tried for some time to get this working without luck.
I have tried,
dfmonthtradedays.columns = pd.MultiIndex.from_arrays(dfmonthtradedays.columns, rics[1].tolist())
number columns in df1 = number rows in df2
Regards

Related

Print Pandas Unique Rows by Column Condition

I am trying to print the rows whereby a data condition is met in a pandas DF based on the unique values in the DF. For example, I have data that looks like this:
DF:
site temp month day
A 15 7 18
A 11 6 12
A 22 9 3
B 9 4 23
B 3 2 11
B -1 5 18
I need the result to print the rows where the max in the 'temp' column occurs such as this for the final result:
A 15
B 9
I have tried this but it is not working correctly:
for i in DF['site'].unique():
print(DF.temp.max())
I get the same answer of:
22
22
but the answer should be:
site temp month day
A 22 9 3
B 9 4 23
thank you!
A possible solution:
df.groupby('site', as_index=False).max()
Output:
site temp
0 A 22
1 B 9
In case you want to use a for loop:
for i in df['site'].unique():
print(df.loc[df['site'].eq(i), 'temp'].max())
Output:
22
9
df.groupby('site').max()
output:
temp month day
site
A 22 9 18
B 9 5 23
Let us do sort_values + drop_duplicates
df = df.sort_values('temp',ascending=False).drop_duplicates('site')
Out[190]:
site temp month day
2 A 22 9 3
3 B 9 4 23

How to convert wide dataframe to long based on similar column

I have a pandas dataframe like this
and i want to convert it to below dataframe
i am not sure how to use pd.wide_to_long function here
below is the dataset for creating dataframe:
Date, IN:Male teacher ,IN:Male engineer, IN: Male Atronaut , IN:female teacher ,IN:female engineer, IN: female Atronaut ,GB:Male teacher ,GB:Male engineer, GB: Male Atronaut,GB:female teacher ,GB:female engineer, GB: female Atronaut
20220405,25,29,5,41,23,23,12,23,34,11,22,34
20220404,21,29,4,40,23,22,12,23,32,10,23,34
Convert Date column to index and for all another columns remove possible traling spaces by str.strip, then replace spaces to : and last split by one or more : to MultiIndex, so possible reshape by DataFrame.stack with DataFrame.rename_axis for new columns names created by DataFrame.reset_index:
df1 = df.set_index('Date')
df1.columns = df1.columns.str.strip().str.replace('\s+', ':').str.split('[:]+', expand=True)
df1 = df1.stack([0,1]).rename_axis(['Date','Symbol','Gender']).reset_index()
print (df1)
Date Symbol Gender Atronaut engineer teacher
0 20220405 GB Male 34 23 12
1 20220405 GB female 34 22 11
2 20220405 IN Male 5 29 25
3 20220405 IN female 23 23 41
4 20220404 GB Male 32 23 12
5 20220404 GB female 34 23 10
6 20220404 IN Male 4 29 21
7 20220404 IN female 22 23 40
pivot_longer from pyjanitor offers an easy way to abstract the reshaping; in this case it can be solved with a regular expression:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(
index = 'Date',
names_to = ('symbol', 'gender', '.value'),
names_pattern = r"(.+):\s*(.+)\s+(.+)",
sort_by_appearance = True)
Date symbol gender teacher engineer Atronaut
0 20220405 IN Male 25 29 5
1 20220405 IN female 41 23 23
2 20220405 GB Male 12 23 34
3 20220405 GB female 11 22 34
4 20220404 IN Male 21 29 4
5 20220404 IN female 40 23 22
6 20220404 GB Male 12 23 32
7 20220404 GB female 10 23 34
The regular expression has capture groups, any group paired with .value stays as a header, the rest become column values.

Pandas: drop both rows if one column matches same and another don't

I want to drop both rows in a pandas data frame where the value in one column(account) is not duplicate and the value in some other column (recharge_number) is duplicate given A. An illustrative example:
data = {'account': [43,43,43,43,45,45],
'recharge_number': [17777, 17777, 17999, 17888, 17222, 17999] ,
'year': [2021,2021,2021,2021,2020,2020],
'month': [2,3,5,6,2,9]}
account recharge_number year month
43 17777 2021 2
43 17777 2021 3
43 17999 2021 5
43 17888 2021 6
45 17222 2020 2
45 17999 2020 9
input data
output:
account recharge_number year month
43 17777 2021 2
43 17777 2021 3
43 17888 2021 6
45 17222 2020 2
output data
Another method is to drop rows instead of keep them:
>>> df.drop(df[~df.duplicated(['id', 'number'], keep=False)
& df.duplicated('number', keep=False)].index)
id number
0 5 10
1 5 10
3 6 20
5 7 40
The first condition protect all duplicate ('id', 'number') records. The second condition remove all records where 'number' are the same.
Basically, you want "the full row (or the two columns if larger dataframe) is duplicated" or "number is not duplicated"
You can use duplicated:
df[df['id', 'number'].duplicated(keep=False)|~df['number'].duplicated(keep=False)]
Output:
id number
0 5 10
1 5 10
3 6 20
5 7 40
Solution with .crosstab:
mask = pd.crosstab(df["account"], df["recharge_number"]).ne(0).sum().gt(1)
print(df[~df["recharge_number"].isin(mask[mask].index)])
Prints:
account recharge_number year month
0 43 17777 2021 2
1 43 17777 2021 3
3 43 17888 2021 6
4 45 17222 2020 2

How to create a partially filled column in pandas

I have a df_trg with, say 10 rows numbered 0-9.
I get from various sources values for an additional column foo which contains only a subset of rows, e.g. S1 has 0-3, 7, 9 and S2 has 4, 6.
I would like to get a data frame with a single new column foo where some rows may remain NaN.
Is there a "nicer" way other than:
df_trg['foo'] = np.nan
for src in sources:
df_trg['foo'][df_trg.index.isin(src.index)] = src
for example, using join or merge?
Let's create the source DataFrame (df), s1 and s2 (Series objects with
updating data) and a list of them (sources):
df = pd.DataFrame(np.arange(1, 51).reshape((5, -1)).T)
s1 = pd.Series([11, 12, 13, 14, 15, 16], index=[0, 1, 2, 3, 7, 9])
s2 = pd.Series([27, 28], index=[4, 6])
sources = [s1, s2]
Start the computation from adding foo column, initially filled with
an empty string:
df = df.assign(foo='')
Then run the following "updating" loop:
for src in sources:
df.foo.update(other=src)
The result is:
0 1 2 3 4 foo
0 1 11 21 31 41 11
1 2 12 22 32 42 12
2 3 13 23 33 43 13
3 4 14 24 34 44 14
4 5 15 25 35 45 27
5 6 16 26 36 46
6 7 17 27 37 47 28
7 8 18 28 38 48 15
8 9 19 29 39 49
9 10 20 30 40 50 16
In my opinion, this solution is (at least a little) nicer than yours and
shorter.
Alternative: Fill foo column initially with NaN, but this time
updating values will be converted to float (side effect of using NaN).

Cumulative value counts of categorical data, with group by

In my data frame, I have a text column group with the group name, and column drop_week, holding a categorical value in range [1,4]. I want to store, for each group, the cumulative count of values 1 to 4 of drop week. I'm doing this:
drop_data = all_data[['group', 'drop_week']].groupby('group')['drop_week'] \
.value_counts().unstack().transpose().fillna(0).cumsum().transpose()
and it works. But since it took me 2 hours of googling to come up with this solution, I was wondering if there is a better way to do it.
You could use pd.crosstab to create the frequency table. Then use cumsum(axis=1) to compute the cumulative sum across each row:
pd.crosstab(index=all_data['group'], columns=all_data['drop_week']).cumsum(axis=1)
# drop_week 1 2 3 4
# group
# 0 12 17 21 27
# 1 7 13 18 25
# 2 9 14 22 26
# 3 5 11 16 22
which agrees with
drop_data = (all_data[['group', 'drop_week']].groupby('group')['drop_week']
.value_counts().unstack().transpose().fillna(0).cumsum().transpose())
# drop_week 1 2 3 4
# group
# 0 12 17 21 27
# 1 7 13 18 25
# 2 9 14 22 26
# 3 5 11 16 22
The setup I used for this was:
import numpy as np
import pandas as pd
np.random.seed(2019)
N = 100
all_data = pd.DataFrame({'group':np.random.randint(4, size=N),
'drop_week':np.random.randint(1,5, size=N)})
drop_data = (all_data[['group', 'drop_week']].groupby('group')['drop_week']
.value_counts().unstack().transpose().fillna(0).cumsum().transpose())