convert long dataframe to single line dataframe with enumerating the column names - pandas

have a df that looks like this:
data = \
[{'len_overlap': 2, 'prox': 1.0, 'freq_sum_w': 0.03962264150943396},
{'len_overlap': 22, 'prox': np.nan, 'freq_sum_w': 0.0311111962264150943396}]
df = pd.DataFrame(data)
len_overlap
prox
freq_sum_w
0
2
1
0.0396226
1
22
nan
0.0311112
I want to make it one row data frame, so far I have this:
pd.DataFrame([np.ravel(df.values)], columns=sum([[f'{x}_{n}' for x in df.columns] for n in range(df.shape[0])], []))
len_overlap_0
prox_0
freq_sum_w_0
len_overlap_1
prox_1
freq_sum_w_1
0
2
1
0.0396226
22
nan
0.0311112
This is what I want (the ints convert to floats, don't know why, but that's not a problem) but I'm wondering if there is a nicer, more Pandas way for doing this.
Thanks

Try via unstack(),to_frame() and Transpose(T) attribute:
out=df.unstack().to_frame().T
Finally:
out.columns=out.columns.map(lambda x:'_'.join(map(str,x)))
output of out:
len_overlap_0 len_overlap_1 prox_0 prox_1 freq_sum_w_0 freq_sum_w_1
0 2.0 22.0 1.0 NaN 0.039623 0.031111

One line but more complex:
>>> df.unstack() \
.to_frame() \
.set_index(pd.MultiIndex.from_product([df.columns, df.index.astype(str)])
.sortlevel(1)[0]
.to_flat_index()
.map('_'.join)) \
.transpose()
freq_sum_w_0 len_overlap_0 prox_0 freq_sum_w_1 len_overlap_1 prox_1
0 2.0 22.0 1.0 NaN 0.039623 0.031111
IMHO, I think the "more Pandas way" is to use a MultiIndex:
>>> df.stack().to_frame().transpose()
0 1
len_overlap prox freq_sum_w len_overlap freq_sum_w
0 2.0 1.0 0.039623 22.0 0.031111
or better (like pd.melt):
>>> df.stack()
0 len_overlap 2.000000
prox 1.000000
freq_sum_w 0.039623
1 len_overlap 22.000000
freq_sum_w 0.031111

Try,
df_out = df.unstack()
df_out = df_out.sort_index(level=1)
df_out.index = [f'{i}_{j}' for i, j in df_out.index]
df_out.to_frame().T
Output:
freq_sum_w_0 len_overlap_0 prox_0 freq_sum_w_1 len_overlap_1 prox_1
0 0.039623 2.0 1.0 0.031111 22.0 NaN

Related

Pandas / Numpy conditional calculation with NaN values

I'm dealing with incomplete data and would like to assign scoring to different rows.
For example:
Bluetooth and WLAN are non integers but I would like to assign the value of 1 if data is available. 0 if there is no data (or NaN).
Samsung's score would be 1 + 1 + 4 = 6
Nokia's score would be 0 + 0 + 5 = 5
Bluetooth WLAN Rating Score
Apple Class-A USB-A NaN
Samsung Class-B USB-B 4
Nokia NaN NaN 5
I'm using Pandas at the moment but I'm not sure if Pandas alone is capable without Numpy.
Thanks a lot!
import pandas as pd
import numpy as np
data = {'Bluetooth': ['class-A', 'class-B', np.nan], 'WLAN': ['usb-A', 'usb-B', np.nan],'Rating': [np.nan, 4, 5]}
df = pd.DataFrame(data)
df = df.replace(np.nan, 0)
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(1)
df['score'] = df.sum(axis=1)
print(df.head())
Output:
Bluetooth WLAN Rating score
0 1.0 1.0 0.0 2.0
1 1.0 1.0 4.0 6.0
2 0.0 0.0 5.0 5.0
try this :
import pandas as pd
import numpy as np
df['Nan_count']=df.isnull().sum(axis=1)
df['score']=-df['Nan_count']+df['Rating'].replace(np.nan,0)+2
With this solution we do need to change the Nan in our dataframe et as computation is pretty low also

Pandas: If condition on multiple columns having null values and fillna with 0

I have a below dataframe, and my requirement is that, if both columns have np.nan then no change, if either of column has empty value then fill na with 0 value. I wrote this code but why its not working. Please suggest.
import pandas as pd
import numpy as np
data = {'Age': [np.nan, np.nan, 22, np.nan, 50,99],
'Salary': [217, np.nan, 262, 352, 570, np.nan]}
df = pd.DataFrame(data)
print(df)
cond1 = (df['Age'].isnull()) & (df['Salary'].isnull())
if cond1 is False:
df['Age'] = df['Age'].fillna(0)
df['Salary'] = df['Salary'].fillna(0)
print(df)
You can just assign it with update
c = ['Age','Salary']
df.update(df.loc[~df[c].isna().all(1),c].fillna(0))
df
Out[341]:
Age Salary
0 0.0 217.0
1 NaN NaN
2 22.0 262.0
3 0.0 352.0
4 50.0 570.0
5 99.0 0.0
c1 = df['Age'].isna()
c2 = df['Salary'].isna()
df[np.c_[c1 & ~c2, ~c1 & c2]]=0
df
Age Salary
0 0.0 217.0
1 NaN NaN
2 22.0 262.0
3 0.0 352.0
4 50.0 570.0
5 99.0 0.0
tmp=df.loc[(df['Age'].isna() & df['Salary'].isna())]
df.fillna(0,inplace=True)
df.loc[tmp.index]=np.nan
This might be a bit less sophisticated than the other answers but worked for me:
I first save the row(s) containing both Nan values at the same time
then fillna the original df as per normal
then set np.nan back to the location where we saved both rows containing Nan at the same time
Get the rows that are all nulls and use where to exclude them during the fill:
bools = df.isna().all(axis = 1)
df.where(bools, df.fillna(0))
Age Salary
0 0.0 217.0
1 NaN NaN
2 22.0 262.0
3 0.0 352.0
4 50.0 570.0
5 99.0 0.0
Your if statement won't work because you need to check each row for True or False; cond1 is a series, and cannot be compared correctly to False (it will just return False, which is not entirely true), there can be multiple False and True in the series.
An inefficient way would be to traverse the rows:
for row, index in zip(cond1, df.index):
if not row:
df.loc[index] = df.loc[index].fillna(0)
apart from the inefficiency, you are keeping track of index positions; the pandas options try to abstract the process while being quite efficient, since the looping is in C

Pandas: Round integers before joining dataframes

I have two data frames that both contain coordinates. One of them, df1, has coordinates at a better resolution (with decimals), and I would like to join it to df2 which has a less-good resolution:
import pandas as pd
df1 = pd.DataFrame({'x': [1.1, 2.2, 3.3],
'y': [2.3, 3.3, 4.1],
'val': [10,11,12]})
df2 = pd.DataFrame({'x': [1,2,3,5.5],
'y': [2,3,4,5.6]})
df1['x_org']=df1['x']
df1['y_org']=df1['y']
df1[['x','y']] = df1[['x','y']].round()
df1 = pd.merge(df1, df2, how='left', on=['x','y'])
df1.drop({'x','y'}, axis=1)
# rename...
The code above does exactly what I want, but it is a bit cumbersome. Is there an easier way to achieve this?
Use:
df1.merge(df2,
how='left',
left_on=[df1['x'].round(), df1['y'].round()],
right_on=['x','y'],
suffixes=('','_')).drop(['x_','y_'], axis=1)
Also is possible remove columns ending by _ dynamic:
df = df1.merge(df2,
how='left',
left_on=[df1['x'].round(), df1['y'].round()],
right_on=['x','y'],
suffixes=('','_')).filter(regex='.*[^_]$')
print (df)
x y val
0 1.1 2.3 10
1 2.2 3.3 11
2 3.3 4.1 12
df = df1.merge(df2,
how='left',
left_on=[df1['x'].round(), df1['y'].round()],
right_on=['x','y'],
suffixes=('','_end')).filter(regex='.*(?<!_end)$')
print (df)
x y val
0 1.1 2.3 10
1 2.2 3.3 11
2 3.3 4.1 12
Or:
df = (df1.set_index(['x','y'], drop=False).rename(lambda x: round(x))
.merge(df2.set_index(['x','y']),
left_index=True,
right_index=True,
how='left').reset_index(drop=True))
print (df)
x y val
0 1.1 2.3 10
1 2.2 3.3 11
2 3.3 4.1 12
IIUC, you could pass the rounded values as joining keys:
pd.merge(df1.rename(columns={'x': 'x_org', 'y': 'y_org'}),
df2,
how='left',
left_on=[df1['x'].round(), df1['x'].round()],
right_on=['x', 'y'])#.drop({'x','y'}, axis=1) # if x/y are unwanted
output:
x_org y_org val x y
0 1.1 2.3 10 1.0 1.0
1 2.2 3.3 11 2.0 2.0
2 3.3 4.1 12 3.0 3.0

Changing Julia dataframe column headers to lowercase?

I am looking for a solution to change column's headers to lowercase.
Let's say, I have this dataframe:
df = DataFrame(TIME = ["2021-10-21","2021-10-22","2021-10-23"],
MQ2= [-1.1, -2, 1],
MQ3=[-1, -1, 3.1],
MQ8= [-1, -4.2, 2],
)
>>>df
TIME MQ2 MQ3 MQ8
String Float64 Float64 Float64
1 2021-10-21 -1.1 -1.0 -1.0
2 2021-10-22 -2.0 -1.0 -4.2
3 2021-10-23 1.0 3.1 2.0
I want to change all of my column's headers, such as MQ2 to mq2.
May be something like df.columns.str.lower() in Python.
Therefore, I can achieve this dataframe:
time mq2 mq3 mq8
String Float64 Float64 Float64
1 2021-10-21 -1.1 -1.0 -1.0
2 2021-10-22 -2.0 -1.0 -4.2
3 2021-10-23 1.0 3.1 2.0
I would probably do the following:
julia> using DataFrames
julia> df = DataFrame(TIME = rand(5), MQ2 = rand(5), MQ3 = rand(5), MQ8 = rand(5));
julia> rename!(df, lowercase.(names(df)))
5×4 DataFrame
Row │ time mq2 mq3 mq8
│ Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────
1 │ 0.0796718 0.997022 0.0838867 0.63886
2 │ 0.923035 0.904928 0.993185 0.36081
3 │ 0.392671 0.0577061 0.518647 0.81432
4 │ 0.0377552 0.506528 0.190017 0.488105
5 │ 0.828534 0.731297 0.383561 0.604786
Here I'm using the DataFrames rename function in its mutating version (hence the bang in rename!), with a vector of new column names as the second argument. The new vector is created by getting the current names using names(df), and then broadcasting the lowercase function across each element in that vector.
Note that rename! also accepts pairs of old/new names if you only want to rename specific columns, e.g. rename!(df, "TIME" => "time")

How to extract different groups of 4 rows from dataframe and unstack the columns

I am new to Python and lost in the way to approach this problem: I have a dataframe where the information I need is mostly grouped in layers of 2,3 and 4 rows. Each group has a different ID in one of the columns. I need to create another dataframe where the groups of rows are now a single row, where the information is unstacked in more columns. Later I can drop unwanted/redundant columns.
I think I need to iterate through the dataframe rows and filter for each ID unstacking the rows into a new dataframe. I cannot obtain much from unstack or groupby functions. Is there a easy function or combination that can make this task?
Here is a sample of the dataframe:
2_SH1_G8_D_total;Positions tolerance d [z] ;"";0.000; ;0.060;"";0.032;0.032;53%
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-58.000;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-1324.500;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";391.000;"";"";"";390.990;"";""
13_SH1_G8_D_total;Flatness;"";0.000; ;0.020;"";0.004;0.004;20%
14_SH1_G8_D_total;Parallelism tolerance ;"";0.000; ;0.030;"";0.025;0.025;84%
15_SH1_B1_B;Positions tolerance d [x y] ;"";0.000; ;0.200;"";0.022;0.022;11%
15_SH1_B1_B;Positions tolerance d [x y] ;"";265.000;"";"";"";264.993;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";1502.800;"";"";"";1502.792;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";-391.000;"";"";"";---;"";""
The original dataframe has information in 4 rows, but not always. Ending dataframe should have only one row per Id occurrence, with all the info in the columns.
So far, with help, I managed to run this code:
with open(path, newline='') as datafile:
data = csv.reader(datafile, delimiter=';')
for row in data:
tmp.append(row)
# Create data table joining data with the same GAT value, GAT is the ID I need
Data = []
Data.append(tmp[0])
GAT = tmp[0][0]
j = 0
counter = 0
for i in range(0,len(tmp)):
if tmp[i][0] == GAT:
counter = counter + 1
if counter == 2:
temp=(tmp[i][5],tmp[i][7],tmp[i][8],tmp[i][9])
else:
temp = (tmp[i][3], tmp[i][7])
Data[j].extend(temp)
else:
Data.append(tmp[i])
GAT = tmp[i][0]
j = j + 1
# for i in range(0,len(Data)):
# print(Data[i])
with open('output.csv', 'w', newline='') as outputfile:
writedata = csv.writer(outputfile, delimiter=';')
for i in range(0, len(Data)):
writedata.writerow(Data[i]);
But is not really using pandas, which probably will give me more power handling the data. In addition, this open() commands have troubles with the non-ascii characters I am unable to solve.
Is there a more elegant way using pandas?
So basically you're doing a "partial transpose". Is this what you want (referenced from this answer)?
Sample Data
With unequal number of rows per line
ID col1 col2
0 A 1.0 2.0
1 A 3.0 4.0
2 B 5.0 NaN
3 B 7.0 8.0
4 B 9.0 10.0
5 B NaN 12.0
Code
import pandas as pd
import io
# read df
df = pd.read_csv(io.StringIO("""
ID col1 col2
A 1 2
A 3 4
B 5 nan
B 7 8
B 9 10
B nan 12
"""), sep=r"\s{2,}", engine="python")
# solution
g = df.groupby('ID').cumcount()
df = df.set_index(['ID', g]).unstack().sort_index(level=1, axis=1)
df.columns = [f'{a}_{b+1}' for a, b in df.columns]
Result
print(df)
col1_1 col2_1 col1_2 col2_2 col1_3 col2_3 col1_4 col2_4
ID
A 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
B 5.0 NaN 7.0 8.0 9.0 10.0 NaN 12.0
Explanation
After the .set_index(["ID", g]) step, the dataset becomes
col1 col2
ID
A 0 1.0 2.0
1 3.0 4.0
B 0 5.0 NaN
1 7.0 8.0
2 9.0 10.0
3 NaN 12.0
where the multi-index is perfect for df.unstack().