Merge similar columns and add extracted values to dict - pandas

Given this input:
pd.DataFrame({'C1': [6, np.NaN, 16, np.NaN], 'C2': [17, np.NaN, 1, np.NaN],
'D1': [8, np.NaN, np.NaN, 6], 'D2': [15, np.NaN, np.NaN, 12]}, index=[1,1,2,2])
I'd like to combine columns beginning in the same letter (the Cs and Ds), as well as rows with same index (1 and 2), and extract the non-null values to the simplest representation without duplicates, which I think is something like:
{1: {'C': [6.0, 17.0], 'D': [8.0, 15.0]}, 2: {'C': [16.0, 1.0], 'D': [6.0, 12.0]}}
Using stack or groupby gets me part of the way there, but I feel like there is a more efficient way to do it.

You can rename columns by lambda function for first letters with aggregate lists after DataFrame.stack and then create nested dictionary in dict comprehension:
s = df.rename(columns=lambda x: x[0]).stack().groupby(level=[0,1]).agg(list)
d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}
print (d)
{1: {'C': [6.0, 17.0], 'D': [8.0, 15.0]}, 2: {'C': [16.0, 1.0], 'D': [6.0, 12.0]}}

Related

numpy - Subtract array between actual value and previous value (only not null)

I have the following situation:
Supposing I have an array, and I want to subtract (absolute value) between the actual not null value and the previous not null values.
[np.nan, np.nan, 10, np.nan, np.nan, 5, np.nan, 3, 6, np.nan, np.nan, 7]
Expected output:
[nan, nan, nan, nan, nan, 5, nan, 2, 3, nan, nan, 1]
What is a good approach to get this result using numpy without for loops?
I only solved it using for loop:
x = [np.nan, np.nan, 10, np.nan, np.nan, 5, np.nan, 3, 6, np.nan, np.nan, 7]
idx = np.where(~np.isnan(x))[0]
output = np.full(len(x), np.nan)
for i, j in enumerate(idx):
if i > 0:
output[j] = abs(x[idx[i]] - x[idx[i - 1]])
You're most of the way there already:
output[idx[1:]] = np.abs(np.diff(x[idx]))

Merging many multiple dataframes within a list into one dataframe

i have several dataframes, with all the same columns, within one list that i would like to have within one dataframe.
For instance, i have these three dataframes here:
df1 = pd.DataFrame(np.array([[10, 20, 30], [40, 50, 60], [70, 80, 90]]),
columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[11, 22, 33], [44, 55, 66], [77, 88, 99]]),
columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
within one list:
dfList = [df1,df2,df3]
I know i can use the following which provides me with exactly what I'm looking for:
df_merge = pd.concat([dfList[0],dfList[1],dfList[2]])
However, my in my actual data i have 100s of dataframes within a list, so I'm trying to find a way to loop through and concat:
dfList_all = pd.DataFrame()
for i in range(len(dfList)):
dfList_all = pd.concat(dfList[i])
I tried the following above, but it provides me with the following error:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
Any ideas would be wonderful. Thanks

Pandas Convert Data Type of List inside a Dictionary

I have the following data structure:
import pandas as pd
names = {'A': [20, 5, 20],
'B': [18, 7, 13],
'C': [19, 6, 18]}
I was able to convert the Data Type for A, B, C from an object to a string as follows:
df = df.astype({'Team-A': 'string', 'Team-B': 'string', 'Team-C': 'string'}, errors='raise')
How can I convert the data types in the list to float64?
You can convert the dictionary to a dataframe and then change the dataframe to float.
import pandas as pd
names = {'A': [20, 5],
'B': [18, 7],
'C': [19, 6]}
df=pd.DataFrame(names)
df.astype('float64')
If you dont want to use dataframe you can do like this
names={k:[float(i) for i in v] for k,v in names.items()}

How to return a list into a dataframe based on matching index of other column

I have a two data frames, one made up with a column of numpy array list, and other with two columns. I am trying to match the elements in the 1st dataframe (df) to get two columns, o1 and o2 from the df2, by matching based on index. I was wondering i can get some inputs.. please note the string 'A1' in column in 'o1' is repeated twice in df2 and as you may see in my desired output dataframe the duplicates are removed in column o1.
import numpy as np
import pandas as pd
array_1 = np.array([[0, 2, 3], [3, 4, 6], [1,2,3,6]])
#dataframe 1
df = pd.DataFrame({ 'A': array_1})
#dataframe 2
df2 = pd.DataFrame({ 'o1': ['A1', 'B1', 'A1', 'C1', 'D1', 'E1', 'F1'], 'o2': [15, 17, 18, 19, 20, 7, 8]})
#desired output
df_output = pd.DataFrame({ 'A': array_1, 'o1': [['A1', 'C1'], ['C1', 'D1', 'F1'], ['B1','A1','C1','F1']],
'o2': [[15, 18, 19], [19, 20, 8], [17,18,19,8]] })
# please note in the output, the 'index 0 of df1 has 0&2 which have same element i.e. 'A1', the output only shows one 'A1' by removing duplicated one.
I believe you can explode df and use that to extract information from df2, then finally join back to df
s = df['A'].explode()
df_output= df.join(df2.loc[s].groupby(s.index).agg(lambda x: list(set(x))))
Output:
A o1 o2
0 [0, 2, 3] [C1, A1] [18, 19, 15]
1 [3, 4, 6] [F1, D1, C1] [8, 19, 20]
2 [1, 2, 3, 6] [F1, B1, C1, A1] [8, 17, 18, 19]

overlap graph of missing values (NaN values) with filled values

I have the below Panda DataFrame that contains two columns. The first column is original values containing the missing values (NaN values) and the second column that is the result of missing imputation for filling the NaN values in the first column. How can I plot these two columns in the same graph that show the original values with filled values like the graph below:
Data=pd.DataFrame([[3.83092724, np.nan],
[ np.nan, 3.94103207],
[ np.nan, 3.86621724],
[3.48386179, np.nan],
[ np.nan, 3.7430167 ],
[3.2382959 , np.nan],
[3.9143139 , np.nan],
[4.46676265, np.nan],
[ np.nan, 3.9340262 ],
[3.650658 , np.nan],
[ np.nan, 3.10590516],
[4.19497691, np.nan],
[4.11873876, np.nan],
[4.15286075, np.nan],
[4.67441617, np.nan],
[4.50631534, np.nan],
[ np.nan, 4.01349688],
[ np.nan, 3.48459778],
[ np.nan, 3.83495488],
[ np.nan, 3.10590516],
[ np.nan, 4.09355884],
[4.8433281 , np.nan],
[ np.nan, 3.33450675],
[4.86672126, np.nan],
[ np.nan, 3.2382959 ],
[ np.nan, 3.48210011],
[ np.nan, 3.00958811],
[ np.nan, 3.05774663]], columns=['original', 'filled'])
You need markers, otherwise the chart makes no sense if you have individual original values surrounded by missing values.
We first plot the original values. Then, for the filled values, we fill any missing value directly adjacent to an existing filled value, with the original value to get the dashed line from that original value to the next/preceding filled value. Finally we plot these amended filled values column as a dashed line.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.DataFrame([[3.83092724, np.nan],
[ np.nan, 3.94103207],
[ np.nan, 3.86621724],
[3.48386179, np.nan],
[ np.nan, 3.7430167 ],
[3.2382959 , np.nan],
[3.9143139 , np.nan],
[4.46676265, np.nan],
[ np.nan, 3.9340262 ],
[3.650658 , np.nan],
[ np.nan, 3.10590516],
[4.19497691, np.nan],
[4.11873876, np.nan],
[4.15286075, np.nan],
[4.67441617, np.nan],
[4.50631534, np.nan],
[ np.nan, 4.01349688],
[ np.nan, 3.48459778],
[ np.nan, 3.83495488],
[ np.nan, 3.10590516],
[ np.nan, 4.09355884],
[4.8433281 , np.nan],
[ np.nan, 3.33450675],
[4.86672126, np.nan],
[ np.nan, 3.2382959 ],
[ np.nan, 3.48210011],
[ np.nan, 3.00958811],
[ np.nan, 3.05774663]], columns=['original', 'filled'])
_,ax = plt.subplots()
df.original.plot(marker='o', ax=ax)
m = (df.filled.isna()&df.filled.shift(1).notna()) | (df.filled.isna()&df.filled.shift(-1).notna())
df.filled.fillna(df.loc[m,'original']).plot(ls='--', ax=ax, color=ax.get_lines()[0].get_color())
The above is a clean solution for the general case. If the original values are drawn with a solid opaque line and the filled values with a line width of not greater than that of the original values, you can simply first draw the completely filled filled values and then, on top of that line, the original values:
df.filled.fillna(df.original).plot(ax=ax, color='blue', ls='--')
df.original.plot(marker='o', ax=ax, color='blue')