adding lists with different length to a new dataframe - pandas

I have two lists with different lengths, like a=[1,2,3] and b=[2,3]
I would like to generate a pd.DataFrame from them, by padding nan at the beginning of list, like this:
a b
1 1 nan
2 2 2
3 3 3
I would appreciate a clean way of doing this.

Use itertools.zip_longest with reversed method:
from itertools import zip_longest
a=[1,2,3]
b=[2,3]
L = [a, b]
iterables = (reversed(it) for it in L)
out = list(reversed(list(zip_longest(*iterables, fillvalue=np.nan))))
df = pd.DataFrame(out, columns=['a','b'])
print (df)
a b
0 1 NaN
1 2 2.0
2 3 3.0
Alternative, if b has less values like a list:
df = pd.DataFrame(list(zip(a, ([np.nan]*(len(a)-len(b)))+b)), columns=['a','b'])
print (df)
a b
0 1 NaN
1 2 2.0
2 3 3.0

b.append(np.nan)#append NaN
b=list(set(b))#Use set to rearrange and then return to list
df=pd.DataFrame(list(zip(a,b)), columns=['a','b'])#dataframe
Alternatively
b.append(np.nan)#append NaN
b=list(dict.fromkeys(b))#Use dict to rearrange and return then to list.This creates dict with the items in the list as keys and values as none but in an ordered manner getting NaN to the top
df=pd.DataFrame(list(zip(a,b)), columns=['a','b'])#dataframe

Related

Append different size lists in dataframe in different columns

list1 = [ 1,2]
list2 = [2,3,4]
main = pd.DataFrame( columns = ['a','b'])
main = main.append(pd.DataFrame(list1, columns=['a']), ignore_index= True)
main = main.append(pd.DataFrame(list2, columns=['b']), ignore_index= True)
Output :
a b
1 NA
2 NA
NA 2
NA 3
NA 4
I want to expect the output of both lists in the same rows of this different list in particular columns.
Your solution with append values of list later is possible if use concat with axis=1:
main = pd.DataFrame()
main = pd.concat([main, pd.DataFrame(list1, columns=['a'])], axis=1)
main = pd.concat([main, pd.DataFrame(list2, columns=['b'])], axis=1)
print (main)
a b
0 1.0 2
1 2.0 3
2 NaN 4
If possible create DataFrame first with Series by lists:
main = pd.DataFrame({'a':pd.Series(list1), 'b':pd.Series(list2)})
print (main)
a b
0 1.0 2
1 2.0 3
2 NaN 4

pandas joining strings in a group, skipping na values

I'm using a combination of str.join (let's call the column joined col_str) and groupby (Let's call the grouped col col_a) in order to summarize data row-wise.
col_str, may contain nan values. Unsurprisingly, and as seen in str.join documentation, joining nan will result in an empty string:
df = df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', '))
To mitigate this, I tried to convert col_str to string (e.g. df['col_str'] = df['col_str'].astype(str) ). But then, empty values now literally have a string nan value, hence considered non empty.
Not only that str.join now includes nan strings, but also other calculations over the script, that rely on those nans, are ruined.
To address that, I thought about converting just the non-empty values as follows:
df['col_str'] = np.where(pd.isnull(df['col_str']), df['col_str'],
df['col_str'].astype(str))
But now str.join return empty values again :-(
So, I tried fillna('') and even dropna(). None provided me with the desired results.
You get the vicious cycle here, right?
astype(str) => nan strings in join and calculations ruined
Leaving as-is => join.str returns empty results.
Thanks for your assistance!
Edit:
Data is read from a csv. Sample:
Code to test -
df = pd.read_csv('/Users/goidelg/Downloads/sample_data.csv', low_memory=False)
print("---Original DF ---")
print(df)
print("---Joining NaNs as NaN---")
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
print("---Convertin col to str---")
df['col_str'] = df['col_str'].astype(str)
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
And results for the script:
First remove missing values by DataFrame.dropna or Series.notna in boolean indexing:
df = pd.DataFrame({'col_a':[1,2,3,4,1,2,3,4,1,2],
'col_str':['a','b','c','d',np.nan, np.nan, np.nan, np.nan,'a', 's']})
df1 = (df.join(df['col_a'].map(df[df['col_str'].notna()]
.groupby('col_a')['col_str'].unique()
.str.join(', ')). rename('labels')))
print (df1)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s
df2 = (df.join(df['col_a'].map(df.dropna(subset=['col_str'])
.groupby('col_a')['col_str']
.unique().str.join(', ')).rename('labels')))
print (df2)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s

How to extract different groups of 4 rows from dataframe and unstack the columns

I am new to Python and lost in the way to approach this problem: I have a dataframe where the information I need is mostly grouped in layers of 2,3 and 4 rows. Each group has a different ID in one of the columns. I need to create another dataframe where the groups of rows are now a single row, where the information is unstacked in more columns. Later I can drop unwanted/redundant columns.
I think I need to iterate through the dataframe rows and filter for each ID unstacking the rows into a new dataframe. I cannot obtain much from unstack or groupby functions. Is there a easy function or combination that can make this task?
Here is a sample of the dataframe:
2_SH1_G8_D_total;Positions tolerance d [z] ;"";0.000; ;0.060;"";0.032;0.032;53%
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-58.000;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";-1324.500;"";"";"";---;"";""
12_SH1_G8_D_total;Positions tolerance d [z] ;"";391.000;"";"";"";390.990;"";""
13_SH1_G8_D_total;Flatness;"";0.000; ;0.020;"";0.004;0.004;20%
14_SH1_G8_D_total;Parallelism tolerance ;"";0.000; ;0.030;"";0.025;0.025;84%
15_SH1_B1_B;Positions tolerance d [x y] ;"";0.000; ;0.200;"";0.022;0.022;11%
15_SH1_B1_B;Positions tolerance d [x y] ;"";265.000;"";"";"";264.993;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";1502.800;"";"";"";1502.792;"";""
15_SH1_B1_B;Positions tolerance d [x y] ;"";-391.000;"";"";"";---;"";""
The original dataframe has information in 4 rows, but not always. Ending dataframe should have only one row per Id occurrence, with all the info in the columns.
So far, with help, I managed to run this code:
with open(path, newline='') as datafile:
data = csv.reader(datafile, delimiter=';')
for row in data:
tmp.append(row)
# Create data table joining data with the same GAT value, GAT is the ID I need
Data = []
Data.append(tmp[0])
GAT = tmp[0][0]
j = 0
counter = 0
for i in range(0,len(tmp)):
if tmp[i][0] == GAT:
counter = counter + 1
if counter == 2:
temp=(tmp[i][5],tmp[i][7],tmp[i][8],tmp[i][9])
else:
temp = (tmp[i][3], tmp[i][7])
Data[j].extend(temp)
else:
Data.append(tmp[i])
GAT = tmp[i][0]
j = j + 1
# for i in range(0,len(Data)):
# print(Data[i])
with open('output.csv', 'w', newline='') as outputfile:
writedata = csv.writer(outputfile, delimiter=';')
for i in range(0, len(Data)):
writedata.writerow(Data[i]);
But is not really using pandas, which probably will give me more power handling the data. In addition, this open() commands have troubles with the non-ascii characters I am unable to solve.
Is there a more elegant way using pandas?
So basically you're doing a "partial transpose". Is this what you want (referenced from this answer)?
Sample Data
With unequal number of rows per line
ID col1 col2
0 A 1.0 2.0
1 A 3.0 4.0
2 B 5.0 NaN
3 B 7.0 8.0
4 B 9.0 10.0
5 B NaN 12.0
Code
import pandas as pd
import io
# read df
df = pd.read_csv(io.StringIO("""
ID col1 col2
A 1 2
A 3 4
B 5 nan
B 7 8
B 9 10
B nan 12
"""), sep=r"\s{2,}", engine="python")
# solution
g = df.groupby('ID').cumcount()
df = df.set_index(['ID', g]).unstack().sort_index(level=1, axis=1)
df.columns = [f'{a}_{b+1}' for a, b in df.columns]
Result
print(df)
col1_1 col2_1 col1_2 col2_2 col1_3 col2_3 col1_4 col2_4
ID
A 1.0 2.0 3.0 4.0 NaN NaN NaN NaN
B 5.0 NaN 7.0 8.0 9.0 10.0 NaN 12.0
Explanation
After the .set_index(["ID", g]) step, the dataset becomes
col1 col2
ID
A 0 1.0 2.0
1 3.0 4.0
B 0 5.0 NaN
1 7.0 8.0
2 9.0 10.0
3 NaN 12.0
where the multi-index is perfect for df.unstack().

Pandas columns headers split

I have a dataframe with colums header made up of 3 tags which are split by '__'
E.g
A__2__66 B__4__45
0
1
2
3
4
5
I know I cant split the header and just use the first tag with this code; df.columns=df.columns.str.split('__').str[0]
giving:
A B
0
1
2
3
4
5
Is there a way I can use a combination of the tags, for example 1 and 3.
giving
A__66 B__45
0
1
2
3
4
5
I've trided the below but its not working
df.columns=df.columns.str.split('__').str[0]+'__'+df.columns.str.split('__').str[2]
With specific regex substitution:
In [124]: df.columns.str.replace(r'__[^_]+__', '__')
Out[124]: Index(['A__66', 'B__45'], dtype='object')
Use Index.map with f-strings for select first and third values of lists:
df.columns = df.columns.str.split('__').map(lambda x: f'{x[0]}__{x[2]}')
print (df)
A__66 B__45
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
Also you can try split and join:
df.columns=['__'.join((i[0],i[-1])) for i in df.columns.str.split('__')]
#Columns: [A__66, B__45]
I found your own solution perfectly fine, and probably most readable. Just needs a little adjustment
df.columns = df.columns.str.split('__').str[0] + '__' + df.columns.str.split('__').str[-1]
Index(['A__66', 'B__45'], dtype='object')
Or for the sake of efficiency, we do not want to call str.split twice:
lst_split = df.columns.str.split('__')
df.columns = lst_split.str[0] + '__' + lst_split.str[-1]
Index(['A__66', 'B__45'], dtype='object')

Can you prevent automatic alphabetical order of df.append()?

I am trying to append data to a log where the order of columns isn't in alphabetical order but makes logical sense, ex.
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 Diff_Goals_2
I am running through several calculations based on different variables and logging the results through appending a dictionary of the values after each run. Is there a way to prevent the df.append() function to order the columns alphabetically?
Seems you have to reorder the columns after the append operation:
In [25]:
# assign the appended dfs to merged
merged = df1.append(df2)
# create a list of the columns in the order you desire
cols = list(df1) + list(df2)
# assign directly
merged.columns = cols
# column order is now as desired
merged.columns
Out[25]:
Index(['Org_Goals_1', 'Calc_Goals_1', 'Diff_Goals_1', 'Org_Goals_2', 'Calc_Goals_2', 'Diff_Goals_2'], dtype='object')
example:
In [26]:
df1 = pd.DataFrame(columns=['Org_Goals_1','Calc_Goals_1','Diff_Goals_1'], data = randn(5,3))
df2 = pd.DataFrame(columns=['Org_Goals_2','Calc_Goals_2','Diff_Goals_2'], data=randn(5,3))
merged = df1.append(df2)
cols = list(df1) + list(df2)
merged.columns = cols
merged
Out[26]:
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 \
0 0.028935 NaN -0.687143 NaN 1.528579
1 0.943432 NaN -2.055357 NaN -0.720132
2 0.035234 NaN 0.020756 NaN 1.556319
3 1.447863 NaN 0.847496 NaN -1.458852
4 0.132337 NaN -0.255578 NaN -0.222660
0 NaN 0.131085 NaN 0.850022 NaN
1 NaN -1.942110 NaN 0.672965 NaN
2 NaN 0.944052 NaN 1.274509 NaN
3 NaN -1.796448 NaN 0.130338 NaN
4 NaN 0.961545 NaN -0.741825 NaN
Diff_Goals_2
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
0 0.727619
1 0.022209
2 -0.350757
3 1.116637
4 1.947526
The same alpha sorting of the columns happens with concat also so it looks like you have to reorder after appending.
EDIT
An alternative is to use join:
In [32]:
df1.join(df2)
Out[32]:
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 \
0 0.163745 1.608398 0.876040 0.651063 0.371263
1 -1.762973 -0.471050 -0.206376 1.323191 0.623045
2 0.166269 1.021835 -0.119982 1.005159 -0.831738
3 -0.400197 0.567782 -1.581803 0.417112 0.188023
4 -1.443269 -0.001080 0.804195 0.480510 -0.660761
Diff_Goals_2
0 -2.723280
1 2.463258
2 0.147251
3 2.328377
4 -0.248114
Actually, I found "advanced indexing" to work quite well
df2=df.ix[:,'order of columns']
As I see it, the order is lost, but when appending, the original data should have the correct order. To maintain that, assuming Dataframe 'alldata' and dataframe to be appended data 'newdata', appending and keeping column order as in 'alldata' would be:
alldata.append(newdata)[list(alldata)]
(I encountered this problem with named date fields, where 'Month' would be sorted between 'Minute' and 'Second')