subtract each value in column by entire column - pandas

I have the following df1 :
prueba
12-03-2018 7
08-03-2018 1
06-03-2018 9
05-03-2018 5
I would like to get each value in the column beggining by the last (5) and substract the entire column by that value. then iterate upwards and subtract the remaining values in the column. for each subtraction I would like to generate a column and generate a df with the results of each subtraction:
The desired output would be something like this:
05-03-2018 06-03-2018 08-03-2018 12-03-2018
12-03-2018 2 -2 6 0
08-03-2018 -4 -8 0 NaN
06-03-2018 4 0 NaN NaN
05-03-2018 0 NaN NaN NaN
What I tried to obtain the desired output was, first take df1 and
df2=df1.sort_index(ascending=True)
create an empty df:
main_df=pd.DataFrame()
and then iterate over the values in the column df2 and subtract to the df1 column
for index, row in df2.iterrows():
datos=df1-row['pruebas']
df=pd.DataFrame(data=datos,index=index)
if main_df.empty:
main_df= df
else:
main_df=main_df.join(df)
print(main_df)
However the following error outputs:
TypeError: Index(...) must be called with a collection of some kind, '05-03-2018' was passed

You can using np.triu, with array subtract
s=df.prueba.values.astype(float)
s=np.triu((s-s[:,None]).T)
s[np.tril_indices(s.shape[0], -1)]=np.nan
pd.DataFrame(s,columns=df.index,index=df.index).reindex(columns=df.index[::-1])
Out[482]:
05-03-2018 06-03-2018 08-03-2018 12-03-2018
12-03-2018 2.0 -2.0 6.0 0.0
08-03-2018 -4.0 -8.0 0.0 NaN
06-03-2018 4.0 0.0 NaN NaN
05-03-2018 0.0 NaN NaN NaN

Kind of messy but does the work:
temp = 0
count = 0
df_new = pd.DataFrame()
for i, v, date in zip(df.index, df["prueba"][::-1], df.index[::-1]):
print(i,v)
new_val = df["prueba"] - v
if count > 0:
new_val[-count:] = np.nan
df_new[date] = new_val
temp += v
count += 1
df_new

Related

In pandas, replace table column with Series while joining indexes

I have a table with preexisting columns, and I want to entirely replace some of those columns with values from a series. The tricky part is that each series will have different indexes and I need to add these varying indexes to the table as necessary, like doing a join/merge operation.
For example, this code generates a table and 5 series where each series only has a subset of the indexes.
import random
cols=['a', 'b', 'c', 'd', 'e', 'f', 'g']
table = pd.DataFrame(columns=cols)
series = []
for i in range(5):
series.append(
pd.Series(
np.random.randint(0, 3, 2)*10,
index=pd.Index(random.sample(range(3), 2))
)
)
series
Output:
[1 10
2 0
dtype: int32,
2 0
0 20
dtype: int32,
2 20
1 0
dtype: int32,
2 0
0 10
dtype: int32,
1 20
2 10
dtype: int32]
But when I try to replace columns of the table with the series, a simple assignment doesn't work
for i in range(5):
col = cols[i]
table[col] = series[i]
table
Output:
a b c d e f g
1 10 NaN 0 NaN 20 NaN NaN
2 0 0 20 0 10 NaN NaN
because the assignment won't add any more indexes after the first series is assigned
Other things I've tried:
combine or combine_first gives the same result as above. (table[col] = table[col].combine(series[i], lambda a, b: b) and table[col] = series[i].combine_first(table[col]))
pd.concat doesn't work either because of duplicate labels (table[col] = pd.concat([table[col], series[i]]) gives ValueError: cannot reindex on an axis with duplicate labels) and I can't just drop the duplicates because other columns may already have values in those indexes
DataFrame.update won't work since it only takes indexes from the table (join='left'). I need to add indexes from the series to the table as necessary.
Of course, I can always do something like this:
table = table.join(series[i].rename('new'), how='outer')
table[col] = table.pop('new')
which gives the correct result:
a b c d e f g
0 NaN 20.0 NaN 10.0 NaN NaN NaN
1 10.0 NaN 0.0 NaN 20.0 NaN NaN
2 0.0 0.0 20.0 0.0 10.0 NaN NaN
But it's doing it in quite a roundabout way, and still isn't robust to column name collisions, so you'd have to add a handful more code to fiddle with column names and protect against that. This produces quite verbose and ugly code for what is a conceptually a very simple operation, that I believe there must be a better way of doing it.
pd.concat should work along the column axis:
out = pd.concat(series, axis=1)
print(out)
# Output
0 1 2 3 4
0 10.0 0.0 0.0 NaN 10.0
1 NaN 10.0 NaN 0.0 20.0
2 0.0 NaN 0.0 0.0 NaN
You could try constructing the dataframe using a dict comprehension like this:
series:
[0 10
1 0
dtype: int64,
0 0
1 0
dtype: int64,
2 20
0 0
dtype: int64,
0 20
2 0
dtype: int64,
0 0
1 0
dtype: int64]
code:
table = pd.DataFrame({
col: series[i]
for i, col in enumerate(cols)
if i < len(series)
})
table
output:
a b c d e
0 10.0 0.0 0.0 20.0 0.0
1 0.0 0.0 NaN NaN 0.0
2 NaN NaN 20.0 0.0 NaN
If you really need the nan columns at the end you could do:
table = pd.DataFrame({
col: series[i] if i < len(series) else np.nan
for i, col in enumerate(cols)
})
Output:
a b c d e f g
0 10.0 0.0 0.0 20.0 0.0 NaN NaN
1 0.0 0.0 NaN NaN 0.0 NaN NaN
2 NaN NaN 20.0 0.0 NaN NaN NaN

How to split dictionary column in dataframe and make a new columns for each key values

I have a dataframe which has a column containing multiple values, separated by ",".
id data
0 {'1':A, '2':B, '3':C}
1 {'1':A}
2 {'0':0}
How can I split up the keys-values of 'data' column and make a new column for each key values present in it, without removing the original 'data' column.
desired output.
id data 1 2 3 0
0 {'1':A, '2':B, '3':C} A B C Nan
1 {'1':A} A Nan Nan Nan
2 {'0':0} Nan Nan Nan 0
Thank you in advance :).
You'll need a regular expression to convert the data into a format that can be parsed as JSON. Then, pd.json_normalize will do the job nicely:
df['data'] = df['data'].str.replace(r'(["\'])\s*:(.+?)\s*(,?\s*["\'}])', '\\1:\'\\2\'\\3', regex=True)
import ast
df['data'] = df['data'].apply(ast.literal_eval)
df = pd.concat([df, pd.json_normalize(df['data'])], axis=1)
Output:
>>> df
data 1 2 3 0
0 {'1': 'A', '2': 'B', '3': 'C'} A B C NaN
1 {'1': 'A'} A NaN NaN NaN
2 {'0': '0'} NaN NaN NaN 0

How to select the rows having same id and have all missing value in another column

I have the following dataframe:
ID col_1
1 NaN
2 NaN
3 4.0
2 NaN
2 NaN
3 NaN
3 3.0
1 NaN
I need the following output:
ID col_1
1 NaN
1 NaN
2 NaN
2 NaN
2 NaN
how to do this in pandas
You can create a boolean mask with isna then group this mask by ID and transform using all, then you can filter the rows with the help of this mask:
mask = df['col_1'].isna().groupby(df['ID']).transform('all')
df[mask].sort_values('ID')
Alternatively you can use groupby + filter to filter out the groups which satisfy the condition where all values in col_1 are NaN but this method should be slower than the above:
df.groupby('ID').filter(lambda g: g['col_1'].isna().all()).sort_values('ID')
ID col_1
0 1 NaN
7 1 NaN
1 2 NaN
3 2 NaN
4 2 NaN
Let us try with isin after groupby with all
s = df['col_1'].isna().groupby(df['ID']).all()
df = df.loc[df.ID.isin(s[s].index.tolist())]
df
Out[73]:
ID col_1
0 1 NaN
1 2 NaN
3 2 NaN
4 2 NaN
7 1 NaN
import pandas as pd
import numpy as np
df=pd.read_excel(r"D:\Stack_overflow\test12.xlsx")
df1=(df[df['cols_1'].isnull()]).sort_values(by=['ID'])
I think we can simply take out the null values.

Find out null values between two columns in a DataFrame

I have to check if there any Null values in between the two columns I have in Dataframe. I have fetched the location of the first non null value and the last non value in the dataframe using these :
x.first_valid_index()
x.last_valid_index()
Now i need to find if there any null values in between these two locations
I think it is same as check all NaN values to boolean mask and sum True values:
x = pd.DataFrame({'a':[1,2,np.nan, np.nan],
'b':[np.nan, 7,np.nan,np.nan],
'c':[4,5,6,np.nan]})
print (x)
a b c
0 1.0 NaN 4.0
1 2.0 7.0 5.0
2 NaN NaN 6.0
3 NaN NaN NaN
cols = ['a','b']
f = x[cols].first_valid_index()
l = x[cols].last_valid_index()
print (f)
0
print (l)
1
print (x.loc[f:l, cols].isnull().sum().sum())
1

How to work with 'NA' in pandas?

I am merging two data frames in pandas. When joining fields contain 'NA', pandas automatically exclude those records. How can I keep the records having the value 'NA'?
For me it works nice:
df1 = pd.DataFrame({'A':[np.nan,2,1],
'B':[5,7,8]})
print (df1)
A B
0 NaN 5
1 2.0 7
2 1.0 8
df2 = pd.DataFrame({'A':[np.nan,2,3],
'C':[4,5,6]})
print (df2)
A C
0 NaN 4
1 2.0 5
2 3.0 6
print (pd.merge(df1, df2, on=['A']))
A B C
0 NaN 5 4
1 2.0 7 5
print (pd.__version__)
0.19.2
EDIT:
It seems there is another problem - your NA values are converted to NaN.
You can use pandas.read_excel, there is possible define which values are converted to NaN with parameter keep_default_na and na_values:
df = pd.read_excel('test.xlsx',keep_default_na=False,na_values=['NaN'])
print (df)
a b
0 NaN NA
1 20.0 40