overlap graph of missing values (NaN values) with filled values - pandas

I have the below Panda DataFrame that contains two columns. The first column is original values containing the missing values (NaN values) and the second column that is the result of missing imputation for filling the NaN values in the first column. How can I plot these two columns in the same graph that show the original values with filled values like the graph below:
Data=pd.DataFrame([[3.83092724, np.nan],
[ np.nan, 3.94103207],
[ np.nan, 3.86621724],
[3.48386179, np.nan],
[ np.nan, 3.7430167 ],
[3.2382959 , np.nan],
[3.9143139 , np.nan],
[4.46676265, np.nan],
[ np.nan, 3.9340262 ],
[3.650658 , np.nan],
[ np.nan, 3.10590516],
[4.19497691, np.nan],
[4.11873876, np.nan],
[4.15286075, np.nan],
[4.67441617, np.nan],
[4.50631534, np.nan],
[ np.nan, 4.01349688],
[ np.nan, 3.48459778],
[ np.nan, 3.83495488],
[ np.nan, 3.10590516],
[ np.nan, 4.09355884],
[4.8433281 , np.nan],
[ np.nan, 3.33450675],
[4.86672126, np.nan],
[ np.nan, 3.2382959 ],
[ np.nan, 3.48210011],
[ np.nan, 3.00958811],
[ np.nan, 3.05774663]], columns=['original', 'filled'])

You need markers, otherwise the chart makes no sense if you have individual original values surrounded by missing values.
We first plot the original values. Then, for the filled values, we fill any missing value directly adjacent to an existing filled value, with the original value to get the dashed line from that original value to the next/preceding filled value. Finally we plot these amended filled values column as a dashed line.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.DataFrame([[3.83092724, np.nan],
[ np.nan, 3.94103207],
[ np.nan, 3.86621724],
[3.48386179, np.nan],
[ np.nan, 3.7430167 ],
[3.2382959 , np.nan],
[3.9143139 , np.nan],
[4.46676265, np.nan],
[ np.nan, 3.9340262 ],
[3.650658 , np.nan],
[ np.nan, 3.10590516],
[4.19497691, np.nan],
[4.11873876, np.nan],
[4.15286075, np.nan],
[4.67441617, np.nan],
[4.50631534, np.nan],
[ np.nan, 4.01349688],
[ np.nan, 3.48459778],
[ np.nan, 3.83495488],
[ np.nan, 3.10590516],
[ np.nan, 4.09355884],
[4.8433281 , np.nan],
[ np.nan, 3.33450675],
[4.86672126, np.nan],
[ np.nan, 3.2382959 ],
[ np.nan, 3.48210011],
[ np.nan, 3.00958811],
[ np.nan, 3.05774663]], columns=['original', 'filled'])
_,ax = plt.subplots()
df.original.plot(marker='o', ax=ax)
m = (df.filled.isna()&df.filled.shift(1).notna()) | (df.filled.isna()&df.filled.shift(-1).notna())
df.filled.fillna(df.loc[m,'original']).plot(ls='--', ax=ax, color=ax.get_lines()[0].get_color())
The above is a clean solution for the general case. If the original values are drawn with a solid opaque line and the filled values with a line width of not greater than that of the original values, you can simply first draw the completely filled filled values and then, on top of that line, the original values:
df.filled.fillna(df.original).plot(ax=ax, color='blue', ls='--')
df.original.plot(marker='o', ax=ax, color='blue')

Related

numpy - Subtract array between actual value and previous value (only not null)

I have the following situation:
Supposing I have an array, and I want to subtract (absolute value) between the actual not null value and the previous not null values.
[np.nan, np.nan, 10, np.nan, np.nan, 5, np.nan, 3, 6, np.nan, np.nan, 7]
Expected output:
[nan, nan, nan, nan, nan, 5, nan, 2, 3, nan, nan, 1]
What is a good approach to get this result using numpy without for loops?
I only solved it using for loop:
x = [np.nan, np.nan, 10, np.nan, np.nan, 5, np.nan, 3, 6, np.nan, np.nan, 7]
idx = np.where(~np.isnan(x))[0]
output = np.full(len(x), np.nan)
for i, j in enumerate(idx):
if i > 0:
output[j] = abs(x[idx[i]] - x[idx[i - 1]])
You're most of the way there already:
output[idx[1:]] = np.abs(np.diff(x[idx]))

Descending sorting in numpy by several columns [duplicate]

This question already has answers here:
Numpy sort ndarray on multiple columns
(4 answers)
Closed last year.
I have NumPy array and need to sort it by two columns (first by column 0 and then sort equal values by column 1), both in descending order. When I try to sort sequentially by column 1 and column 0, the rows equal in the second sorting turn to be sorted in ascending order in the first sorting.
My array:
arr = np.array([
[150, 8],
[105, 20],
[90, 100],
[101, 12],
[110, 80],
[105, 100],
])
When I sort twice (by column 1 and column 0):
arr = arr[arr[:,1].argsort(kind='stable')[::-1]]
arr = arr[arr[:,0].argsort(kind='stable')[::-1]]
I have this result (where rows 2 and 3 are swapped):
array([[150, 8],
[110, 80],
[105, 20],
[105, 100],
[101, 12],
[ 90, 100]])
As far as I understand, it happens because stable mode preserves the original order for equal values, but when we flip the indices to make the order descend, the original order changes too.
The results I'd like to have:
array([[150, 8],
[110, 80],
[105, 100],
[105, 20],
[101, 12],
[ 90, 100]])
Use numpy.lexsort to sort on multiple columns at the same time.
arr = np.array([
[150, 8],
[105, 20],
[90, 100],
[101, 12],
[110, 80],
[105, 100],
])
order = np.lexsort([arr[:, 1], arr[:, 0]])[::-1]
arr[order]
yields:
array([[150, 8],
[110, 80],
[105, 100],
[105, 20],
[101, 12],
[ 90, 100]])

Merge similar columns and add extracted values to dict

Given this input:
pd.DataFrame({'C1': [6, np.NaN, 16, np.NaN], 'C2': [17, np.NaN, 1, np.NaN],
'D1': [8, np.NaN, np.NaN, 6], 'D2': [15, np.NaN, np.NaN, 12]}, index=[1,1,2,2])
I'd like to combine columns beginning in the same letter (the Cs and Ds), as well as rows with same index (1 and 2), and extract the non-null values to the simplest representation without duplicates, which I think is something like:
{1: {'C': [6.0, 17.0], 'D': [8.0, 15.0]}, 2: {'C': [16.0, 1.0], 'D': [6.0, 12.0]}}
Using stack or groupby gets me part of the way there, but I feel like there is a more efficient way to do it.
You can rename columns by lambda function for first letters with aggregate lists after DataFrame.stack and then create nested dictionary in dict comprehension:
s = df.rename(columns=lambda x: x[0]).stack().groupby(level=[0,1]).agg(list)
d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}
print (d)
{1: {'C': [6.0, 17.0], 'D': [8.0, 15.0]}, 2: {'C': [16.0, 1.0], 'D': [6.0, 12.0]}}

Pandas Convert Data Type of List inside a Dictionary

I have the following data structure:
import pandas as pd
names = {'A': [20, 5, 20],
'B': [18, 7, 13],
'C': [19, 6, 18]}
I was able to convert the Data Type for A, B, C from an object to a string as follows:
df = df.astype({'Team-A': 'string', 'Team-B': 'string', 'Team-C': 'string'}, errors='raise')
How can I convert the data types in the list to float64?
You can convert the dictionary to a dataframe and then change the dataframe to float.
import pandas as pd
names = {'A': [20, 5],
'B': [18, 7],
'C': [19, 6]}
df=pd.DataFrame(names)
df.astype('float64')
If you dont want to use dataframe you can do like this
names={k:[float(i) for i in v] for k,v in names.items()}

How to return a list into a dataframe based on matching index of other column

I have a two data frames, one made up with a column of numpy array list, and other with two columns. I am trying to match the elements in the 1st dataframe (df) to get two columns, o1 and o2 from the df2, by matching based on index. I was wondering i can get some inputs.. please note the string 'A1' in column in 'o1' is repeated twice in df2 and as you may see in my desired output dataframe the duplicates are removed in column o1.
import numpy as np
import pandas as pd
array_1 = np.array([[0, 2, 3], [3, 4, 6], [1,2,3,6]])
#dataframe 1
df = pd.DataFrame({ 'A': array_1})
#dataframe 2
df2 = pd.DataFrame({ 'o1': ['A1', 'B1', 'A1', 'C1', 'D1', 'E1', 'F1'], 'o2': [15, 17, 18, 19, 20, 7, 8]})
#desired output
df_output = pd.DataFrame({ 'A': array_1, 'o1': [['A1', 'C1'], ['C1', 'D1', 'F1'], ['B1','A1','C1','F1']],
'o2': [[15, 18, 19], [19, 20, 8], [17,18,19,8]] })
# please note in the output, the 'index 0 of df1 has 0&2 which have same element i.e. 'A1', the output only shows one 'A1' by removing duplicated one.
I believe you can explode df and use that to extract information from df2, then finally join back to df
s = df['A'].explode()
df_output= df.join(df2.loc[s].groupby(s.index).agg(lambda x: list(set(x))))
Output:
A o1 o2
0 [0, 2, 3] [C1, A1] [18, 19, 15]
1 [3, 4, 6] [F1, D1, C1] [8, 19, 20]
2 [1, 2, 3, 6] [F1, B1, C1, A1] [8, 17, 18, 19]