Replacing column names in a pandas dataframe based on a lookup - pandas

Hi I have several dataframes with column headings that vary slightly. An example of a header of a dataframe will be:
A1 B1 C1
In other dataframes the first row is called A2 or A3 etc. A1..., B1...C1 represent multi character words/labels and are not the literal column names. I want to replace the column headings based on a mapping that I have between the A1, A2, Ax, B1, B2, Bx, C1, C2, Cx etc. and A, B and C.
What is the best way of doing this?
Thanks in advance.

I think here is possible use indexing with str for replace by first letter:
df.columns = df.columns.str[0]
Another possible solution is create dictionary for replace, e.g.:
d = {x:x[0] for x in df.columns}
df = df.rename(columns=d)

Related

Dataframe to multiIndex for sktime format

I have a multivariate time series data which is in this format(pd.Dataframe with index on Time),
I am trying to use sktime, which requires the data to be in multi index format. On the above if i want to use a rolling window of 3 on above data. It requires it in this format. Here pd.Dataframe has multi-index on (instance,time)
I was thinking if it is possible to transform it to new format.
Edit here's a more straightforward and probably faster solution using row indexing
df = pd.DataFrame({
'time':range(5),
'a':[f'a{i}' for i in range(5)],
'b':[f'b{i}' for i in range(5)],
})
w = 3
w_starts = range(0,len(df)-(w-1)) #start positions of each window
#iterate through the overlapping windows to create 'instance' col and concat
roll_df = pd.concat(
df[s:s+w].assign(instance=i) for (i,s) in enumerate(w_starts)
).set_index(['instance','time'])
print(roll_df)
Output
a b
instance time
0 0 a0 b0
1 a1 b1
2 a2 b2
1 1 a1 b1
2 a2 b2
3 a3 b3
2 2 a2 b2
3 a3 b3
4 a4 b4
Here's one way to achieve the desired result:
# Create the instance column
instance = np.repeat(range(len(df) - 2), 3)
# Repeat the Time column for each value in A and B
time = np.concatenate([df.Time[i:i+3].values for i in range(len(df) - 2)])
# Repeat the A column for each value in the rolling window
a = np.concatenate([df.A[i:i+3].values for i in range(len(df) - 2)])
# Repeat the B column for each value in the rolling window
b = np.concatenate([df.B[i:i+3].values for i in range(len(df) - 2)])
# Create a new DataFrame with the desired format
new_df = pd.DataFrame({'Instance': instance, 'Time': time, 'A': a, 'B': b})
# Set the MultiIndex on the new DataFrame
new_df.set_index(['Instance', 'Time'], inplace=True)
new_df

How can I get value from one column in another based on a condition

I have 2 columns A and B, I want to filter the data in the dataframe and keep the values of column A where it is 'A3' and the values of column A that have a date in column B.
so if it has a date in column B then keep its data in column A
The first part of the code is working as follows, what can I do about the second condition?
Df = Data[Data["Reason"].isna(['A3'])] ....?
df = pd.DataFrame({'A': {0: 'a', 1: 'a3', 7: 'a3', 8: 'a3', 9: 'a9'}, 'B': {2: '2018-05-01', 3: '2018-05-02', 4: '2018-07-05', 5: '2018-11-02',6: '2019-01-02',1: '-13'}})
df
I think there was a bit of confusion without data.
I want to keep all the a3 rows in column A is irrelevant of any condition in column B. (see above code)
The empty rows in column A have a date in column B. So I want to keep those empty ones in A based on value of B and fill it with a value like B3.
So when I analyze, A3 will mean accepted and B3 will mean rejected. So both a3 and b3 in one column
If I understood correctly, you want all rows where column B isn't empty and all those rows where column B is empty but it has "A3" in column A.
# First lets take all rows where Column A is "A3" but column B is empty.
a3_data = Data[Data["ColumnA"].isna(['A3']) & Data["Column B"].isnull()]
# Then we take all rows where column B is empty
data_without_date = df[df["Column B"].astype(bool)]
# And in the end we combine these two data frames
final_data = data_without_date.append(a3_data)

How to select rows that have missing values in columns depending on conditions for dataframes?

I have a dataframe extracted from excel sheet.
I am looking for NOT legit rows.
A legit row is such that it meets ANY of the following conditions:
exactly 1 column filled in but the other columns are empty or null
exactly 2 columns are filled in but the other columns are empty or null
exactly all 8 columns are filled in
SO a NON legit row is the opposite of the above such as:
7 of the 8 columns are filled in but one is empty
6 of the 8 columns are filled in but any of the two is empty
and so on...
The 8 columns i am interested in are: columns A, B, D, E, F, G, I, L.
I only want to return those rows that are NOT legit.
I know how to find rows which are empty in specific columns but not sure how to find the non legit rows based on the above conditions.
empty_A = sheet[sheet[sheet.columns[0]].isnull()]
empty_B = sheet[sheet[sheet.columns[1]].isnull()]
empty_D = sheet[sheet[sheet.columns[3]].isnull()]
empty_E = sheet[sheet[sheet.columns[4]].isnull()]
empty_F = sheet[sheet[sheet.columns[5]].isnull()]
empty_G = sheet[sheet[sheet.columns[6]].isnull()]
empty_I = sheet[sheet[sheet.columns[8]].isnull()]
empty_L = sheet[sheet[sheet.columns[11]].isnull()]
print(empty_G)
UPDATE:
I solved using list comprehension
If you already have populated your dataframe then you can do it like this
import numpy as np
import pandas as pd
## Generate Random Data
raw_data=np.random.choice([None,1], (50,8))
raw_data= np.r_[raw_data, np.random.choice([None, 1,2,3], (50,8))]
## Create dataframe from random data
df = pd.DataFrame(raw_data, columns="A, B, D, E, F, G, I, L".split(","))
notnull_counts = (~df.isnull()).sum(axis=1)
## filter rows with your condition
legit_rows = df[((notnull_counts==1) | (notnull_counts==2) | (notnull_counts==8))]
non_legit_rows = df[~((notnull_counts==1) | (notnull_counts==2) | (notnull_counts==8))]
display(legit_rows)
It seems like you want to count the number of null values in these 8 particular columns and select rows based on how many nulls are found. That phrasing suggests summing and selecting based on that sum. Most pandas operations default to performing columnwise operations, so you need to tell sum() to perform the sum for each row by using axis="columns", like so:
# This is a series indexed like df.
# It counts the number of null values in the given columns.
n_null = df[["A", "B", "D", "E", "F", "G", "I", "L"]].isnull().sum(axis="columns")
# This selects the rows where n_null has certain values.
df_notlegit = df.loc[n_null.isin([8, 5, 4, 3, 2])]
# This is another way to do it.
df_nonlegit = df.loc[(n_null > 1) & (n_null < 9)]
df.loc[(df.isna().sum(axis=1)==0) | (df.isna().sum(axis=1)==7) | (df.isna().sum(axis=1)==6)]

pandas: appending a row to a dataframe with values derived using a user defined formula applied on selected columns

I have a dataframe as
df = pd.DataFrame(np.random.randn(5,4),columns=list('ABCD'))
I can use the following to achieve the traditional calculation like mean(), sum()etc.
df.loc['calc'] = df[['A','D']].iloc[2:4].mean(axis=0)
Now I have two questions
How can I apply a formula (like exp(mean()) or 2.5*mean()/sqrt(max()) to column 'A' and 'D' for rows 2 to 4
How can I append row to the existing df where two values would be mean() of the A and D and two values would be of specific formula result of C and B.
Q1:
You can use .apply() and lambda functions.
df.iloc[2:4,[0,3]].apply(lambda x: np.exp(np.mean(x)))
df.iloc[2:4,[0,3]].apply(lambda x: 2.5*np.mean(x)/np.sqrt(max(x)))
Q2:
You can use dictionaries and combine them and add it as a row.
First one is mean, the second one is some custom function.
ad = dict(df[['A', 'D']].mean())
bc = dict(df[['B', 'C']].apply(lambda x: x.sum()*45))
Combine them:
ad.update(bc)
df = df.append(ad, ignore_index=True)

Assigning value in column a based on column b

I have three values in column F: A; B and c
now,
I want to paste A= yes B= no C= Maybe
If the value matches in the column these results should be the output in column G.
Can you please help?
Try this:
=IFERROR(CHOOSE(CODE(UPPER(F2))-64,"Yes","No","Maybe"),"Not Valid")