Create multiple-columns pandas dataframe from list

Create multiple-columns pandas dataframe from list - pandas

I can't figure out how to create pandas dataframe (multiple-columns) from list. Some lines contains character ">" at the beggining. I want them to be column headers. Number of lines after each header is not the same.
My list:
>header
a
b
>header2
c
d
e
f
>header3
g
h
i
Dataframe I want to create:
>header1 >header2 >header3
a c g
b d h
e i
f

Simply iterate through lines and match the headers with '>'. The challenge though is to create a df from a dictionary of lists with unequal size.
# The given list
lines = [">header", "a", "b", ">header2", "c", "d", "e", "f", ">header3", "g", "h", "i"]
# Iterate through the lines and create a sublist for each header
data = {}
column = ''
for line in lines:
if line.startswith('>'):
column = line
data[column] = []
continue
data[column].append(line)
# Create the DataFrame
df = pd.DataFrame.from_dict(data,orient='index').T
output:
>header >header2 >header3
0 a c g
1 b d h
2 None e i
3 None f None

I'm assuming you have a text with this list. You can use str.splitlines() to split it and then construct the dataframe with help of itertools.zip_longest:
from itertools import zip_longest
text = '''\
>header
a
b
>header2
c
d
e
f
>header3
g
h
i'''
current, data = None, {}
for line in text.splitlines():
line = line.strip()
if line.startswith('>'):
current = line
else:
data.setdefault(current, []).append(line)
df = pd.DataFrame(zip_longest(*data.values(), fillvalue=''), columns=list(data))
print(df)
Prints:
>header >header2 >header3
0 a c g
1 b d h
2 e i
3 f

Related

get values of previous rows as list

How to get the values of the previous three rows in a new column?
data = { 'foo':['a','b','c','d','e','f','g']}
df = pd.DataFrame(data)
df = some_function(x)
print(df)
foo bar
1 a None
2 b None
3 c None
4 d ['a','b','c']
5 e ['b','c','d']
6 f ['c','d','e']
7 g ['d','e','f']
I could use the following method, by adding columns and then merging it to a new one, but i wonder if there is a better way to do this
def some_function_v1(df)
df[foo1] = df.foo.shift(1)
df[foo2] = df.foo.shift(2)
df[foo3] = df.foo.shift(3)
df['bar'] = df.apply(lambda x: [x['foo1'],x['foo2'],x['foo3']], axis=1)
df = df.drop(columns=[foo1,foo2,foo3]
return df

Try sliding_window_view on foo to create a new DataFrame with the grouped lists:
window = 3
bar_df = pd.DataFrame({
'bar': np.lib.stride_tricks.sliding_window_view(df['foo'], window).tolist()
})
Offset the index:
bar_df.index += window
bar_df:
bar
3 [a, b, c]
4 [b, c, d]
5 [c, d, e]
6 [d, e, f]
7 [e, f, g]
Then join back to the original frame:
out = df.join(bar_df)
out:
foo bar
0 a NaN
1 b NaN
2 c NaN
3 d [a, b, c]
4 e [b, c, d]
5 f [c, d, e]
6 g [d, e, f]
Complete Working Example:
import numpy as np
import pandas as pd
data = {'foo': ['a', 'b', 'c', 'd', 'e', 'f', 'g']}
df = pd.DataFrame(data)
window = 3
bar_df = pd.DataFrame({
'bar': np.lib.stride_tricks.sliding_window_view(df['foo'], window).tolist()
})
bar_df.index += window
out = df.join(bar_df)
print(out)

We can try list comprehension to generate sliding window view
n, v = 3, df['foo'].to_numpy()
df['bar'] = [None] * n + [v[i: i + n] for i in range(len(v) - n)]
Alternative approach with sliding_window_view method
n, v = 3, df['foo'].to_numpy()
df['bar'] = [None] * n + list(np.lib.stride_tricks.sliding_window_view(v[:-1], n))
foo bar
0 a None
1 b None
2 c None
3 d [a, b, c]
4 e [b, c, d]
5 f [c, d, e]
6 g [d, e, f]

You can use shift with zip to shift and merge lists element-wise instead of creating new columns-
df['bar'] = pd.Series(zip(df.foo.shift(3), df.foo.shift(2), df.foo.shift(1))).apply(lambda x:None if np.nan in x else list(x))
Here's a function to make the shift dynamic-
n_shift = lambda s, n: pd.Series(zip(*[s.shift(x) for x in range(n,0,-1)])).apply(lambda x:None if np.nan in x else list(x))
df['bar'] = n_shift(df.foo, 3))
Output-
foo bar
1 a None
2 b None
3 c None
4 d ['a','b','c']
5 e ['b','c','d']
6 f ['c','d','e']
7 g ['d','e','f']

How can delete the index from the data?

I was trying to use the re.sub() on my data, but it keeps showing the TypeError.
(TypeError: expected string or bytes-like object).
This (example) is the data that I'm using:
I was trying to do:
import re
example_sub = re.sub('\n', ' ', example)
example_sub
I tried to resolve it by removing the index using reset_index(), but it didn't work.
What should I do?
Thank you!

You can use pandas.Series.str.replace:
>>> import pandas as pd
>>> df = pd.DataFrame({"a": ["a\na", "b\nb", "c\nc\nc\nc\n"]})
>>> df.a.str.replace("\n", " ")
0 a a
1 b b
2 c c c c
Name: a, dtype: object
For more complex substitutions, you can use a regex pattern:
>>> import re
>>> import pandas as pd
>>> df = pd.DataFrame({"a": ["a\na", "b\nb", "c\nc\nc\nc\n"]})
>>> pattern = re.compile(r"\n")
>>> df.a.str.replace(pattern, " ")
0 a a
1 b b
2 c c c c
Name: a, dtype: object

How to split dict in dataframe to many columns

I'm using dataframe. How to split dict list to many columns?
This is for a junior dataprocessor. In the past, I've tried on many ways.
import pandas as pd
l = [{'a':1,'b':2},{'a':3,'b':4}]
data = [{'key1':'x','key2':'y','value':l}]
df = pd.DataFrame(data)
data1 = {'key1':['x','x'],'key2':['y','y'],'a':[1,3],'b':[2,4]}
df1 = pd.DataFrame(data1)
df1 is what I need.

comprehension
d1 = df.drop('value', axis=1)
co = d1.columns
d2 = df.value
pd.DataFrame([
{**dict(zip(co, tup)), **d}
for tup, D in zip(zip(*map(d1.get, d1)), d2)
for d in D
])
a b key1 key2
0 1 2 x y
1 3 4 x y
Explode
See post on explode
This is a tad different but close
idx = df.index.repeat(df.value.str.len())
val = np.concatenate(df.value).tolist()
d0 = pd.DataFrame(val)
df.drop('value', axis=1).loc[idx].reset_index(drop=True).join(d0)
a b key1 key2
0 1 2 x y
1 3 4 x y

Adding dataframes that share the same columns, and expand one more dimesion

I want to sum two dataframes that share the same columns
df1=pd.DataFrame(np.random.randn(3,3),index=list("ABC"),columns=list("XYZ"))
df2=pd.DataFrame(np.random.randn(3,3),index=list("abc"),columns=list("XYZ"))
My desired result would be:
X Y Z
A a
A b
A c
....
C c
How can I achieve this?
I have tried the following but didnt get what I wanted.
df1.add(df2,axis="columns")

You can create MultiIndex in both DataFrames first by MultiIndex.from_product and then reindex for MultiIndex in both DataFrames:
np.random.seed(45)
df1=pd.DataFrame(np.random.randn(3,3),index=list("ABC"),columns=list("XYZ"))
df2=pd.DataFrame(np.random.randn(3,3),index=list("abc"),columns=list("XYZ"))
mux = pd.MultiIndex.from_product([df1.index, df2.index])
df1 = df1.reindex(mux, level=0)
df2 = df2.reindex(mux, level=1)
print (df1)
X Y Z
A a 0.026375 0.260322 -0.395146
b 0.026375 0.260322 -0.395146
c 0.026375 0.260322 -0.395146
B a -0.204301 -1.271633 -2.596879
b -0.204301 -1.271633 -2.596879
c -0.204301 -1.271633 -2.596879
C a 0.289681 -0.873305 0.394073
b 0.289681 -0.873305 0.394073
c 0.289681 -0.873305 0.394073
print (df2)
X Y Z
A a 0.935106 -0.015685 0.259596
b -1.473314 0.801927 -1.750752
c -0.495052 -1.008601 0.025244
B a 0.935106 -0.015685 0.259596
b -1.473314 0.801927 -1.750752
c -0.495052 -1.008601 0.025244
C a 0.935106 -0.015685 0.259596
b -1.473314 0.801927 -1.750752
c -0.495052 -1.008601 0.025244
df3 = df1.add(df2,axis="columns")
print (df3)
X Y Z
A a 0.961480 0.244637 -0.135550
b -1.446939 1.062248 -2.145898
c -0.468677 -0.748279 -0.369901
B a 0.730805 -1.287317 -2.337283
b -1.677615 -0.469706 -4.347631
c -0.699353 -2.280233 -2.571634
C a 1.224786 -0.888989 0.653669
b -1.183633 -0.071378 -1.356680
c -0.205371 -1.881905 0.419317

IIUIC, Here's one way, using merge on temporary k, resulting in every index combination and then groupby on columns.
In [192]: (df1.reset_index().assign(k='k').merge(df2.assign(k='k').reset_index(), on=['k'])
.set_index(['index_x', 'index_y'])
.groupby(lambda x:x.split('_')[0], axis=1)
.sum()
.drop('k', 1))
Out[192]:
X Y Z
index_x index_y
A a -2.281005 -1.606760 -0.853813
b -2.683788 -2.487876 2.471459
c -0.333471 -2.155734 1.688883
B a -0.790146 0.074629 -2.368680
b -1.192928 -0.806487 0.956592
c 1.157388 -0.474345 0.174017
C a -2.114412 0.100412 -2.352661
b -2.517195 -0.780704 0.972611
c -0.166878 -0.448562 0.190036
Details
In [193]: df1
Out[193]:
X Y Z
A -1.087129 -1.264522 1.147618
B 0.403731 0.416867 -0.367249
C -0.920536 0.442650 -0.351229
In [194]: df2
Out[194]:
X Y Z
a -1.193876 -0.342237 -2.001431
b -1.596659 -1.223354 1.323841
c 0.753658 -0.891211 0.541265
In [196]: (df1.reset_index().assign(k='k').merge(df2.assign(k='k').reset_index(), on=['k'])
.set_index(['index_x', 'index_y']))
Out[196]:
X_x Y_x Z_x k X_y Y_y Z_y
index_x index_y
A a -1.087129 -1.264522 1.147618 k -1.193876 -0.342237 -2.001431
b -1.087129 -1.264522 1.147618 k -1.596659 -1.223354 1.323841
c -1.087129 -1.264522 1.147618 k 0.753658 -0.891211 0.541265
B a 0.403731 0.416867 -0.367249 k -1.193876 -0.342237 -2.001431
b 0.403731 0.416867 -0.367249 k -1.596659 -1.223354 1.323841
c 0.403731 0.416867 -0.367249 k 0.753658 -0.891211 0.541265
C a -0.920536 0.442650 -0.351229 k -1.193876 -0.342237 -2.001431
b -0.920536 0.442650 -0.351229 k -1.596659 -1.223354 1.323841
c -0.920536 0.442650 -0.351229 k 0.753658 -0.891211 0.541265

How to build column by column dataframe pandas

I have a dataframe looking like this example
A | B | C
__|___|___
s s nan
nan x x
I would like to create a table of intersections between columns like this
| A | B | C
__|______|____|______
A | True |True| False
__|______|____|______
B | True |True|True
__|______|____|______
C | False|True|True
__|______|____|______
Is there an elegant cycle-free way to do it?
Thank you!

Setup
df = pd.DataFrame(dict(A=['s', np.nan], B=['s', 'x'], C=[np.nan, 'x']))
Option 1
You can use numpy broadcasting to evaluate each column by each other column. Then determine if any of the comparisons are True
v = df.values
pd.DataFrame(
(v[:, :, None] == v[:, None]).any(0),
df.columns, df.columns
)
A B C
A True True False
B True True True
C False True True
By replacing any with sum you can get a count of how many intersections.
v = df.values
pd.DataFrame(
(v[:, :, None] == v[:, None]).sum(0),
df.columns, df.columns
)
A B C
A 1 1 0
B 1 2 1
C 0 1 1
Or use np.count_nonzero instead of sum
v = df.values
pd.DataFrame(
np.count_nonzero(v[:, :, None] == v[:, None], 0),
df.columns, df.columns
)
A B C
A 1 1 0
B 1 2 1
C 0 1 1
Option 2
Fun & Creative way
d = pd.get_dummies(df.stack()).unstack(fill_value=0)
d = d.T.dot(d)
d.groupby(level=1).sum().groupby(level=1, axis=1).sum()
A B C
A 1 1 0
B 1 2 1
C 0 1 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Create multiple-columns pandas dataframe from list - pandas

Related

get values of previous rows as list

How can delete the index from the data?

How to split dict in dataframe to many columns

Adding dataframes that share the same columns, and expand one more dimesion

How to build column by column dataframe pandas

Categories

Resources