Create multiple-columns pandas dataframe from list - pandas

I can't figure out how to create pandas dataframe (multiple-columns) from list. Some lines contains character ">" at the beggining. I want them to be column headers. Number of lines after each header is not the same.
My list:
>header
a
b
>header2
c
d
e
f
>header3
g
h
i
Dataframe I want to create:
>header1 >header2 >header3
a c g
b d h
e i
f

Simply iterate through lines and match the headers with '>'. The challenge though is to create a df from a dictionary of lists with unequal size.
# The given list
lines = [">header", "a", "b", ">header2", "c", "d", "e", "f", ">header3", "g", "h", "i"]
# Iterate through the lines and create a sublist for each header
data = {}
column = ''
for line in lines:
if line.startswith('>'):
column = line
data[column] = []
continue
data[column].append(line)
# Create the DataFrame
df = pd.DataFrame.from_dict(data,orient='index').T
output:
>header >header2 >header3
0 a c g
1 b d h
2 None e i
3 None f None

I'm assuming you have a text with this list. You can use str.splitlines() to split it and then construct the dataframe with help of itertools.zip_longest:
from itertools import zip_longest
text = '''\
>header
a
b
>header2
c
d
e
f
>header3
g
h
i'''
current, data = None, {}
for line in text.splitlines():
line = line.strip()
if line.startswith('>'):
current = line
else:
data.setdefault(current, []).append(line)
df = pd.DataFrame(zip_longest(*data.values(), fillvalue=''), columns=list(data))
print(df)
Prints:
>header >header2 >header3
0 a c g
1 b d h
2 e i
3 f

Related

get values of previous rows as list

How to get the values of the previous three rows in a new column?
data = { 'foo':['a','b','c','d','e','f','g']}
df = pd.DataFrame(data)
df = some_function(x)
print(df)
foo bar
1 a None
2 b None
3 c None
4 d ['a','b','c']
5 e ['b','c','d']
6 f ['c','d','e']
7 g ['d','e','f']
I could use the following method, by adding columns and then merging it to a new one, but i wonder if there is a better way to do this
def some_function_v1(df)
df[foo1] = df.foo.shift(1)
df[foo2] = df.foo.shift(2)
df[foo3] = df.foo.shift(3)
df['bar'] = df.apply(lambda x: [x['foo1'],x['foo2'],x['foo3']], axis=1)
df = df.drop(columns=[foo1,foo2,foo3]
return df
Try sliding_window_view on foo to create a new DataFrame with the grouped lists:
window = 3
bar_df = pd.DataFrame({
'bar': np.lib.stride_tricks.sliding_window_view(df['foo'], window).tolist()
})
Offset the index:
bar_df.index += window
bar_df:
bar
3 [a, b, c]
4 [b, c, d]
5 [c, d, e]
6 [d, e, f]
7 [e, f, g]
Then join back to the original frame:
out = df.join(bar_df)
out:
foo bar
0 a NaN
1 b NaN
2 c NaN
3 d [a, b, c]
4 e [b, c, d]
5 f [c, d, e]
6 g [d, e, f]
Complete Working Example:
import numpy as np
import pandas as pd
data = {'foo': ['a', 'b', 'c', 'd', 'e', 'f', 'g']}
df = pd.DataFrame(data)
window = 3
bar_df = pd.DataFrame({
'bar': np.lib.stride_tricks.sliding_window_view(df['foo'], window).tolist()
})
bar_df.index += window
out = df.join(bar_df)
print(out)
We can try list comprehension to generate sliding window view
n, v = 3, df['foo'].to_numpy()
df['bar'] = [None] * n + [v[i: i + n] for i in range(len(v) - n)]
Alternative approach with sliding_window_view method
n, v = 3, df['foo'].to_numpy()
df['bar'] = [None] * n + list(np.lib.stride_tricks.sliding_window_view(v[:-1], n))
foo bar
0 a None
1 b None
2 c None
3 d [a, b, c]
4 e [b, c, d]
5 f [c, d, e]
6 g [d, e, f]
You can use shift with zip to shift and merge lists element-wise instead of creating new columns-
df['bar'] = pd.Series(zip(df.foo.shift(3), df.foo.shift(2), df.foo.shift(1))).apply(lambda x:None if np.nan in x else list(x))
Here's a function to make the shift dynamic-
n_shift = lambda s, n: pd.Series(zip(*[s.shift(x) for x in range(n,0,-1)])).apply(lambda x:None if np.nan in x else list(x))
df['bar'] = n_shift(df.foo, 3))
Output-
foo bar
1 a None
2 b None
3 c None
4 d ['a','b','c']
5 e ['b','c','d']
6 f ['c','d','e']
7 g ['d','e','f']

How can delete the index from the data?

I was trying to use the re.sub() on my data, but it keeps showing the TypeError.
(TypeError: expected string or bytes-like object).
This (example) is the data that I'm using:
I was trying to do:
import re
example_sub = re.sub('\n', ' ', example)
example_sub
I tried to resolve it by removing the index using reset_index(), but it didn't work.
What should I do?
Thank you!
You can use pandas.Series.str.replace:
>>> import pandas as pd
>>> df = pd.DataFrame({"a": ["a\na", "b\nb", "c\nc\nc\nc\n"]})
>>> df.a.str.replace("\n", " ")
0 a a
1 b b
2 c c c c
Name: a, dtype: object
For more complex substitutions, you can use a regex pattern:
>>> import re
>>> import pandas as pd
>>> df = pd.DataFrame({"a": ["a\na", "b\nb", "c\nc\nc\nc\n"]})
>>> pattern = re.compile(r"\n")
>>> df.a.str.replace(pattern, " ")
0 a a
1 b b
2 c c c c
Name: a, dtype: object

How to split dict in dataframe to many columns

I'm using dataframe. How to split dict list to many columns?
This is for a junior dataprocessor. In the past, I've tried on many ways.
import pandas as pd
l = [{'a':1,'b':2},{'a':3,'b':4}]
data = [{'key1':'x','key2':'y','value':l}]
df = pd.DataFrame(data)
data1 = {'key1':['x','x'],'key2':['y','y'],'a':[1,3],'b':[2,4]}
df1 = pd.DataFrame(data1)
df1 is what I need.
comprehension
d1 = df.drop('value', axis=1)
co = d1.columns
d2 = df.value
pd.DataFrame([
{**dict(zip(co, tup)), **d}
for tup, D in zip(zip(*map(d1.get, d1)), d2)
for d in D
])
a b key1 key2
0 1 2 x y
1 3 4 x y
Explode
See post on explode
This is a tad different but close
idx = df.index.repeat(df.value.str.len())
val = np.concatenate(df.value).tolist()
d0 = pd.DataFrame(val)
df.drop('value', axis=1).loc[idx].reset_index(drop=True).join(d0)
a b key1 key2
0 1 2 x y
1 3 4 x y

Adding dataframes that share the same columns, and expand one more dimesion

I want to sum two dataframes that share the same columns
df1=pd.DataFrame(np.random.randn(3,3),index=list("ABC"),columns=list("XYZ"))
df2=pd.DataFrame(np.random.randn(3,3),index=list("abc"),columns=list("XYZ"))
My desired result would be:
X Y Z
A a
A b
A c
....
C c
How can I achieve this?
I have tried the following but didnt get what I wanted.
df1.add(df2,axis="columns")
You can create MultiIndex in both DataFrames first by MultiIndex.from_product and then reindex for MultiIndex in both DataFrames:
np.random.seed(45)
df1=pd.DataFrame(np.random.randn(3,3),index=list("ABC"),columns=list("XYZ"))
df2=pd.DataFrame(np.random.randn(3,3),index=list("abc"),columns=list("XYZ"))
mux = pd.MultiIndex.from_product([df1.index, df2.index])
df1 = df1.reindex(mux, level=0)
df2 = df2.reindex(mux, level=1)
print (df1)
X Y Z
A a 0.026375 0.260322 -0.395146
b 0.026375 0.260322 -0.395146
c 0.026375 0.260322 -0.395146
B a -0.204301 -1.271633 -2.596879
b -0.204301 -1.271633 -2.596879
c -0.204301 -1.271633 -2.596879
C a 0.289681 -0.873305 0.394073
b 0.289681 -0.873305 0.394073
c 0.289681 -0.873305 0.394073
print (df2)
X Y Z
A a 0.935106 -0.015685 0.259596
b -1.473314 0.801927 -1.750752
c -0.495052 -1.008601 0.025244
B a 0.935106 -0.015685 0.259596
b -1.473314 0.801927 -1.750752
c -0.495052 -1.008601 0.025244
C a 0.935106 -0.015685 0.259596
b -1.473314 0.801927 -1.750752
c -0.495052 -1.008601 0.025244
df3 = df1.add(df2,axis="columns")
print (df3)
X Y Z
A a 0.961480 0.244637 -0.135550
b -1.446939 1.062248 -2.145898
c -0.468677 -0.748279 -0.369901
B a 0.730805 -1.287317 -2.337283
b -1.677615 -0.469706 -4.347631
c -0.699353 -2.280233 -2.571634
C a 1.224786 -0.888989 0.653669
b -1.183633 -0.071378 -1.356680
c -0.205371 -1.881905 0.419317
IIUIC, Here's one way, using merge on temporary k, resulting in every index combination and then groupby on columns.
In [192]: (df1.reset_index().assign(k='k').merge(df2.assign(k='k').reset_index(), on=['k'])
.set_index(['index_x', 'index_y'])
.groupby(lambda x:x.split('_')[0], axis=1)
.sum()
.drop('k', 1))
Out[192]:
X Y Z
index_x index_y
A a -2.281005 -1.606760 -0.853813
b -2.683788 -2.487876 2.471459
c -0.333471 -2.155734 1.688883
B a -0.790146 0.074629 -2.368680
b -1.192928 -0.806487 0.956592
c 1.157388 -0.474345 0.174017
C a -2.114412 0.100412 -2.352661
b -2.517195 -0.780704 0.972611
c -0.166878 -0.448562 0.190036
Details
In [193]: df1
Out[193]:
X Y Z
A -1.087129 -1.264522 1.147618
B 0.403731 0.416867 -0.367249
C -0.920536 0.442650 -0.351229
In [194]: df2
Out[194]:
X Y Z
a -1.193876 -0.342237 -2.001431
b -1.596659 -1.223354 1.323841
c 0.753658 -0.891211 0.541265
In [196]: (df1.reset_index().assign(k='k').merge(df2.assign(k='k').reset_index(), on=['k'])
.set_index(['index_x', 'index_y']))
Out[196]:
X_x Y_x Z_x k X_y Y_y Z_y
index_x index_y
A a -1.087129 -1.264522 1.147618 k -1.193876 -0.342237 -2.001431
b -1.087129 -1.264522 1.147618 k -1.596659 -1.223354 1.323841
c -1.087129 -1.264522 1.147618 k 0.753658 -0.891211 0.541265
B a 0.403731 0.416867 -0.367249 k -1.193876 -0.342237 -2.001431
b 0.403731 0.416867 -0.367249 k -1.596659 -1.223354 1.323841
c 0.403731 0.416867 -0.367249 k 0.753658 -0.891211 0.541265
C a -0.920536 0.442650 -0.351229 k -1.193876 -0.342237 -2.001431
b -0.920536 0.442650 -0.351229 k -1.596659 -1.223354 1.323841
c -0.920536 0.442650 -0.351229 k 0.753658 -0.891211 0.541265

How to build column by column dataframe pandas

I have a dataframe looking like this example
A | B | C
__|___|___
s s nan
nan x x
I would like to create a table of intersections between columns like this
| A | B | C
__|______|____|______
A | True |True| False
__|______|____|______
B | True |True|True
__|______|____|______
C | False|True|True
__|______|____|______
Is there an elegant cycle-free way to do it?
Thank you!
Setup
df = pd.DataFrame(dict(A=['s', np.nan], B=['s', 'x'], C=[np.nan, 'x']))
Option 1
You can use numpy broadcasting to evaluate each column by each other column. Then determine if any of the comparisons are True
v = df.values
pd.DataFrame(
(v[:, :, None] == v[:, None]).any(0),
df.columns, df.columns
)
A B C
A True True False
B True True True
C False True True
By replacing any with sum you can get a count of how many intersections.
v = df.values
pd.DataFrame(
(v[:, :, None] == v[:, None]).sum(0),
df.columns, df.columns
)
A B C
A 1 1 0
B 1 2 1
C 0 1 1
Or use np.count_nonzero instead of sum
v = df.values
pd.DataFrame(
np.count_nonzero(v[:, :, None] == v[:, None], 0),
df.columns, df.columns
)
A B C
A 1 1 0
B 1 2 1
C 0 1 1
Option 2
Fun & Creative way
d = pd.get_dummies(df.stack()).unstack(fill_value=0)
d = d.T.dot(d)
d.groupby(level=1).sum().groupby(level=1, axis=1).sum()
A B C
A 1 1 0
B 1 2 1
C 0 1 1