How to make a normalizated series from pandas dataframe? - pandas

I have the following code:
df = pd.DataFrame({
'FR': [4.0405, 4.0963, 4.3149, 4.500],
'GR': [1.7246, 1.7482, 1.8519, 4.100],
'IT': [804.74, 810.01, 860.13, 872.01]},
index=['1980-04-01', '1980-03-01', '1980-02-01', '1980-01-01'])
df = df.iloc[::-1]
df2 = df.pct_change()
df2 = df2.iloc[::-1]
df = df.iloc[::-1]
last=100
serie = []
serie.append(last)
for i in list(df.index.values[::-1][1:]):
last = last*(1+df2['FR'][i])
serie.append(last)
serie
I got what i expected:
[100, 95.88666666666667, 91.02888888888891, 89.7888888888889]
but i look for a more simple way to do that.
thanks

Try with cumprod:
df.iloc[::-1].pct_change().add(1).fillna(1).cumprod()
Output:
FR GR IT
1980-01-01 1.000000 1.000000 1.000000
1980-02-01 0.958867 0.451683 0.986376
1980-03-01 0.910289 0.426390 0.928900
1980-04-01 0.897889 0.420634 0.922856

Related

How to method chaining in pandas to aggregate a DataFrame?

I want to aggregate a pandas DataFrame using method chaining. I don't know how to start with the DataFrame and just refer to it when aggregating (using method chaining). Consider the following example that illustrates my intention:
Having this data:
import pandas as pd
my_df = pd.DataFrame({
'name': ['john', 'diana', 'rachel', 'chris'],
'favorite_color': ['red', 'blue', 'green', 'red']
})
my_df
#> name favorite_color
#> 0 john red
#> 1 diana blue
#> 2 rachel green
#> 3 chris red
and I want to end up with this summary table:
#> total_people total_ppl_who_like_red
#> 0 4 2
Of course there are so many ways to do it. One way, for instance, would be to build a new DataFrame:
desired_output_via_building_new_df = pd.DataFrame({
'total_people': [len(my_df)],
'total_ppl_who_like_red': [my_df.favorite_color.eq('red').sum()]
})
desired_output_via_building_new_df
#> total_people total_ppl_who_like_red
#> 0 4 2
However, I'm looking for a way to use "method chaining"; starting with my_df and working my way forward. Something along the lines of
# pseudo-code; not really working
my_df.agg({
'total_people': lambda x: len(x),
'total_ppl_who_like_red': lambda x: x.favorite_color.eq('red').sum()
})
I can only borrow inspiration from R/dplyr code:
library(dplyr, warn.conflicts = FALSE)
my_df <-
data.frame(name = c("john", "diana", "rachel", "chris"),
favorite_color = c("red", "blue", "green", "red")
)
my_df |>
summarise(total_people = n(), ## in the context of `summarise()`,
total_ppl_who_like_red = sum(favorite_color == "red")) ## both `n()` and `sum()` refer to `my_df` because we start with `my_df` and pipe it "forward" to `summarise()`
#> total_people total_ppl_who_like_red
#> 1 4 2
Solution for processing one Series:
df = my_df.favorite_color.apply({
'total_people': lambda x: x.count(),
'total_ppl_who_like_red': lambda x: x.eq('red').sum()
}).to_frame(name=0).T
print (df)
total_people total_ppl_who_like_red
0 4 2
General solution for processing DataFrame with DataFrame.pipe - then pandas processing input DataFrame, if use apply or agg processing columns separately:
df = (my_df.pipe(lambda x: pd.Series({'total_people': len(x),
'total_ppl_who_like_red':
x.favorite_color.eq('red').sum()}))
.to_frame(name=0).T)
print (df)
total_people total_ppl_who_like_red
0 4 2
df = my_df2.pipe(lambda x: pd.Series({'total_people': len(x),
'total_ppl_who_like_red':
x.favorite_color.eq('red').sum(),
'max_age':x.age.max()
}).to_frame(name=0).T)
print (df)
total_people total_ppl_who_like_red max_age
0 4 2 41

grouper day and cumsum speed

I have the following df:
I want to group this df on the first column(ID) and on the second column(key), from there to build a cumsum for each day. The cumsum should be on the last column(speed).
I tried this with the following code :
df = pd.read_csv('df.csv')
df['Time'] = pd.to_datetime(df['Time'], format='%Y-%m-%d %H:%M:%S')
df = df.sort_values(['ID','key'])
grouped = df.groupby(['ID','key'])
test = pd.DataFrame()
test2 = pd.DataFrame()
for name, group in grouped:
test = group.groupby(pd.Grouper(key='Time', freq='1d'))['Speed'].cumsum()
test = test.reset_index()
test['ID'] = ''
test['ID'] = name[0]
test['key'] = ''
test['key'] = name[1]
test2 = test2.append(test)
But the result seem quite off, there are more rows than 5. For each day one row with the cumsum of each ID and key.
Does anyone see the reason for my problem ?
thanks in advance
Friendly reminder, it's useful to include a runable example
import pandas as pd
data = [{"cid":33613,"key":14855,"ts":1550577600000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550579340000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550584800000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550682000000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550685900000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550773380000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550858400000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550941200000,"value":25.0},
{"cid":33613,"key":14855,"ts":1550978400000,"value":50.0}]
df = pd.DataFrame(data)
df['ts'] = pd.to_datetime(df['ts'], unit='ms')
I believe what you need can be accomplished as follows:
df.set_index('ts').groupby(['cid', 'key'])['value'].resample('D').sum().cumsum()
Result:
cid key ts
33613 14855 2019-02-19 150.0
2019-02-20 250.0
2019-02-21 300.0
2019-02-22 350.0
2019-02-23 375.0
2019-02-24 425.0
Name: value, dtype: float64

Join 2 data frame with special columns matching new

I want to join two dataframes and get result as below. I tried many ways, but it fails.
I want only texts on df2 ['A'] which contain text on df1 ['A']. What do I need to change in my code?
I want:
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3
import pandas as pd
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
})
df2 = pd.DataFrame(
{ "A": ["A0_link0", "A1_link1", "A2_link2", "A3_link3", "A4_link4", 'An_linkn'],
"B" : ["B0_link0", "B1_link1", "B2_link2", "B3_link3", "B4_link4", 'Bn_linkn']
})
result = pd.concat([df1, df2], ignore_index=True, join= "inner", sort=False)
print(result)
Create an intermediate dataframe and map:
d = (df2.assign(key=df2['A'].str.extract(r'([^_]+)'))
.set_index('key'))
df1['A'].map(d['A'])
Output:
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3
Name: A, dtype: object
Or merge if you want several columns from df2 (df1.merge(d, left_on='A', right_index=True))
You can set the index as An and pd.concat on columns
result = (pd.concat([df1.set_index(df1['A']),
df2.set_index(df2['A'].str.split('_').str[0])],
axis=1, join="inner", sort=False)
.reset_index(drop=True))
print(result)
A A B
0 A0 A0_link0 B0_link0
1 A1 A1_link1 B1_link1
2 A2 A2_link2 B2_link2
3 A3 A3_link3 B3_link3
df2.A.loc[df2.A.str.split('_',expand=True).iloc[:,0].isin(df1.A)]
0 A0_link0
1 A1_link1
2 A2_link2
3 A3_link3

How do I drop columns in a pandas dataframe that exist in another dataframe?

How do I drop columns in raw_clin if the same columns already exist in raw_clinical_sample? Using isin raised a cannot compute isin with a duplicate axis error.
Explanation of the code:
I want to merge raw_clinical_patient and raw_clinical_sample dataframes. However, the SAMPLE_ID column in raw_clinical_sample should be relabeled as PATIENT_ID before the merge (because it was wrongly labelled). I want the new PATIENT_ID to be the index of raw_clin.
import pandas as pd
# Clinical patient info
raw_clinical_patient = pd.read_csv("./gbm_tcga/data_clinical_patient.txt", sep="\t", header=4)
raw_clinical_patient["PATIENT_ID"] = raw_clinical_patient["PATIENT_ID"].replace()
raw_clinical_patient.set_index("PATIENT_ID", inplace=True)
raw_clinical_patient.sort_index()
# Clinical sample info
raw_clinical_sample = pd.read_csv("./gbm_tcga/data_clinical_sample.txt", sep="\t", header=4)
raw_clinical_sample.set_index("PATIENT_ID", inplace=True)
raw_clinical_sample = raw_clinical_sample[raw_clinical_sample.index.isin(raw_clinical_patient.index)]
# Get the actual patient ID from the `raw_clinical_sample` dataframe
# Drop "PATIENT_ID" and rename "SAMPLE_ID" as "PATIENT_ID" and set as index
raw_clin = raw_clinical_patient.merge(raw_clinical_sample, on="PATIENT_ID", how="left").reset_index().drop(["PATIENT_ID"], axis=1)
raw_clin.rename(columns={'SAMPLE_ID':'PATIENT_ID'}, inplace=True)
raw_clin.set_index('PATIENT_ID', inplace=True)
Now, I want to drop all the columns in raw_clinical_sample since the only columns that are needed were the PATIENT_ID and SAMPLE_ID columns.
# Drop columns that exist in `raw_clinical_sample`
raw_clin = raw_clin[~raw_clin.isin(raw_clinical_sample)]
Traceback:
ValueError Traceback (most recent call last)
<ipython-input-60-45e2e83ddc00> in <module>()
18
19 # Drop columns that exist in `raw_clinical_sample`
---> 20 raw_clin = raw_clin[~raw_clin.isin(raw_clinical_sample)]
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in isin(self, values)
10514 elif isinstance(values, DataFrame):
10515 if not (values.columns.is_unique and values.index.is_unique):
> 10516 raise ValueError("cannot compute isin with a duplicate axis.")
10517 return self.eq(values.reindex_like(self))
10518 else:
ValueError: cannot compute isin with a duplicate axis.
You have many ways to do this.
For example using isin:
new_df1 = df1.loc[:, ~df1.columns.isin(df2.columns)]
or with drop:
new_df1 = df1.drop(columns=df1.columns.intersection(df2.columns))
example input:
df1 = pd.DataFrame(columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(columns=['B', 'E'])
output:
pd.DataFrame(columns=['A', 'C', 'D'])
You can use set operations for your application like this:
df1 = pd.DataFrame()
df1['string'] = ['Hello', 'Hi', 'Hola']
df1['number'] = [1, 2, 3]
df2 = pd.DataFrame()
df2['string'] = ['Hello', 'Hola']
df2['number'] = [1, 5]
ds1 = set(map(tuple, df1.values))
ds2 = set(map(tuple, df2.values))
df_out = pd.DataFrame(list(ds1.difference(ds2)))
df_out.columns = df1.columns
print(df_out)
Output:
string number
0 Hola 3
1 Hi 2
Inspired by: https://stackoverflow.com/a/18184990/7509907
Edit:
Sorry I didn't notice you need to drop the columns. For that, you can use the following: (using mozway's dummy example)
df1 = pd.DataFrame(columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(columns=['B', 'E'])
ds1 = set(df1.columns)
ds2 = set(df2.columns)
cols = ds1.difference(ds2)
df = df1[cols]
print(df)
Output:
Empty DataFrame
Columns: [C, A, D]
Index: []

Concatenating 2 dataframes vertically with empty row in middle

I have a multindex dataframe df1 as:
node A1 A2
bkt B1 B2
Month
1 0.15 -0.83
2 0.06 -0.12
bs.columns
MultiIndex([( 'A1', 'B1'),
( 'A2', 'B2')],
names=[node, 'bkt'])
and another similar multiindex dataframe df2 as:
node A1 A2
bkt B1 B2
Month
1 -0.02 -0.15
2 0 0
3 -0.01 -0.01
4 -0.06 -0.11
I want to concat them vertically so that resulting dataframe df3 looks as following:
df3 = pd.concat([df1, df2], axis=0)
While concatenating I want to introduce 2 blank row between dataframes df1 and df2. In addition I want to introduce two strings Basis Mean and Basis P25 in df3 as shown below.
print(df3)
Basis Mean
node A1 A2
bkt B1 B2
Month
1 0.15 -0.83
2 0.06 -0.12
Basis P25
node A1 A2
bkt B1 B2
Month
1 -0.02 -0.15
2 0 0
3 -0.01 -0.01
4 -0.06 -0.11
I don't know whether there is anyway of doing the above.
I don't think that that is an actual concatenation you are talking about.
The following could already do the trick:
print('Basis Mean')
print(df1.to_string())
print('\n')
print('Basis P25')
print(df2.to_string())
This isn't usually how DataFrames are used, but perhaps you wish to append rows of empty strings in between df1 and df2, along with rows containing your titles?
df1 = pd.concat([pd.DataFrame([["Basis","Mean",""]],columns=df1.columns), df1], axis=0)
df1 = df1.append(pd.Series("", index=df1.columns), ignore_index=True)
df1 = df1.append(pd.Series("", index=df1.columns), ignore_index=True)
df1 = df1.append(pd.Series(["Basis","P25",""], index=df1.columns),ignore_index=True)
df3 = pd.concat([df1, df2], axis=0)
Author clarified in the comment that he wants to make it easy to print to an excel file. It can be achieved using pd.ExcelWriter.
Below is an example of how to do it.
from dataclasses import dataclass
from typing import Any, Dict, List, Optional
import pandas as pd
#dataclass
class SaveTask:
df: pd.DataFrame
header: Optional[str]
extra_pd_settings: Optional[Dict[str, Any]] = None
def fill_xlsx(
save_tasks: List[SaveTask],
writer: pd.ExcelWriter,
sheet_name: str = "Sheet1",
n_rows_between_blocks: int = 2,
) -> None:
current_row = 0
for save_task in save_tasks:
extra_pd_settings = save_task.extra_pd_settings or {}
if "startrow" in extra_pd_settings:
raise ValueError(
"You should not use parameter 'startrow' in extra_pd_settings"
)
save_task.df.to_excel(
writer,
sheet_name=sheet_name,
startrow=current_row + 1,
**extra_pd_settings
)
worksheet = writer.sheets[sheet_name]
worksheet.write(current_row, 0, save_task.header)
has_header = extra_pd_settings.get("header", True)
current_row += (
1 + save_task.df.shape[0] + n_rows_between_blocks + int(has_header)
)
if __name__ == "__main__":
# INPUTS
df1 = pd.DataFrame(
{"hello": [1, 2, 3, 4], "world": [0.55, 1.12313, 23.12, 0.0]}
)
df2 = pd.DataFrame(
{"foo": [3, 4]},
index=pd.MultiIndex.from_tuples([("foo", "bar"), ("baz", "qux")]),
)
# Xlsx creation
writer = pd.ExcelWriter("test.xlsx", engine="xlsxwriter")
fill_xlsx(
[
SaveTask(
df1,
"Hello World Table",
{"index": False, "float_format": "%.3f"},
),
SaveTask(df2, "Foo Table with MultiIndex"),
],
writer,
)
writer.save()
As an extra bonus, pd.ExcelWriter allows to save data on different sheets in Excel and choose their names right from Python code.