Break text files into multiple datasets at each \newline with Pandas

Break text files into multiple datasets at each \newline with Pandas - pandas

I have a text file, textfile.qdp:
Line to skip 1
Line to skip 2
Line to skip 3
1.25 0.649999976 2.24733017E-2 2.07460159E-3 3.01663446 1.89463757E-2 1.48296626E-2 2.98285842
2.0999999 0.199999988 7.33829737E-2 6.63989689E-3 3.48941302 3.8440533E-2 6.34965161E-3 3.44462299
2.5 0.200000048 0.118000358 8.37391801E-3 2.64556909 3.93543094E-2 6.16234308E-3 2.60005236
2.9000001 0.199999928 0.145619139 9.26280301E-3 2.56852388 4.85827066E-2 6.0398886E-3 2.51390147
3.29999995 0.200000048 0.167878062 9.94068757E-3 2.46484375 5.69529012E-2 6.81256084E-3 2.40107822
3.70000005 0.200000048 0.175842062 1.01562217E-2 2.28405786 6.24930188E-2 8.10874719E-3 2.21345592
4.10000038 0.200000048 0.181325018 1.03028165E-2 2.02467489 6.38177395E-2 1.2183371E-2 1.94867384
4.5 0.199999809 0.157546207 9.59824398E-3 1.76375055 6.11177757E-2 6.072836E-2 1.64190447
4.94999981 0.25 0.156071633 8.54758453E-3 1.51925421 5.52904457E-2 0.149736568 1.3142271
5.5 0.300000191 0.125403479 6.9860979E-3 1.52551162 4.61589135E-2 0.511757791 0.967594922
6.10000038 0.299999952 9.54503566E-2 6.10219687E-3 3.56054449 3.59460302E-2 2.85172343 0.672874987
6.86499977 0.464999914 5.7642214E-2 3.80936684E-3 4.10104704 2.42055673E-2 3.67026114 0.406580269
8.28999996 0.960000038 2.10143197E-2 1.60136714E-3 0.142320022 8.9181494E-3 6.96837786E-4 0.132705033
9.48999977 0.239999771 5.72929019E-3 1.6677354E-3 3.82030606E-2 2.56266794E-3 4.94769251E-4 3.51456255E-2
4.13999987 1.99999809E-2 2.47749758 4.67826687E-2 30.4350224 0.973279834 0.754008532 28.7077332
4.17999983 1.99999809E-2 2.44065595 4.64052781E-2 30.5456734 0.99132967 0.677066088 28.8772774
4.21999979 1.99999809E-2 2.4736743 4.67251018E-2 30.8877811 1.01084304 0.807663918 29.0692749
4.26000023 2.00002193E-2 2.48481822 4.68727946E-2 30.9508438 1.02374947 0.834929705 29.092165
4.30000019 1.99999809E-2 2.54010344 4.73690033E-2 31.119503 1.03903878 0.93061626 29.1498489
4.34000015 1.99999809E-2 2.49571872 4.69326451E-2 31.1599998 1.05370748 0.892735004 29.2135563
4.38000011 1.99999809E-2 2.58409572 4.77907397E-2 31.367794 1.06788957 1.05168498 29.2482204
4.42000008 1.99999809E-2 2.6437602 4.83172201E-2 31.5764256 1.08456028 1.1402396 29.3516254
4.46000004 1.99999809E-2 2.65394902 4.84031737E-2 31.5579567 1.09554553 1.1519351 29.3104763
4.5 1.99999809E-2 2.62269425 4.81106751E-2 31.644083 1.11161876 1.12954116 29.4029236
Each column is a different parameter value, and the text file includes several datasets that, in the end, I want to plot with, e.g., different colors. A new line in the text file marks a different set.
import pandas as pd
import matplotlib as plt
names = ['e', 'de', 'y', 'y_err', 'total', 'model1', 'model2', 'model3']
for j in range(8):
names.append('model%i' %j)
df = pd.read_table('textfile.qdp', skiprows=3, names=names, delimiter=' ', skip_blank_lines=True)
fig, ax = plt.subplots(figsize=(10, 6))
ax.errorbar(df.e, df.y, xerr=df.de, yerr=df.y_err, fmt='o', label='data') # Here I want to plot different dfs
How do I do that with Pandas? I think it is related to this question where:
dfs = {
k: pd.read_csv(pd.io.common.StringIO('\n'.join(dat)), delim_whitespace=True)
for k, *dat in map(str.splitlines, open('my.csv').read().split('\n\n'))
}
but I am not sure how this translates with read_table (also *dat returns Invalid syntax).

You could do it this way:
First the basics (note that you should call matplotlib.pyplot and not matplotlib to access the .subplots function at the end):
import pandas as pd
import matplotlib.pyplot as plt
import io
names = ['e', 'de', 'y', 'y_err', 'total']
for j in range(8):
names.append('model%i' %j)
Keep your arguments of the read_table stored:
kwargs = {
"names":names,
"delimiter":' ',
"skip_blank_lines":True
}
Read and store the content of your files, skipping the rows not needed:
skiprows=3
with open('textfile.qdp') as f:
collect_lines_as_list = f.readlines()
selected_lines = collect_lines_as_list[skiprows:]
content = "".join(selected_lines) # Join all lines to get one string only
Then split the content on blank lines (that is, two successive line return) and store it in a temp StringIO to recreate an object that Pandas will manage. Using dict comprehension (quite the same as the other answer you linked), you can collect all data in one shot:
dfs = {
k: pd.read_table(io.StringIO(data), **kwargs)
for k, data
in enumerate(content.split('\n\n'))
}
The plot your dataframes the way you want (by iterating through your dict values):
fig, axs = plt.subplots(len(dfs), figsize=(10, 6))
for ax, df in zip(list(axs), dfs.values()):
ax.errorbar(df.e, df.y, xerr=df.de, yerr=df.y_err, fmt='o', label='data')
plt.show()
You will get something like this:
If you want to plot all data in the same subplot, proceed like this:
fig, ax = plt.subplots(figsize=(10, 6))
for df in dfs.values():
ax.errorbar(df.e, df.y, xerr=df.de, yerr=df.y_err, fmt='o', label='data')
plt.show()]

Related

'poorly' organized csv file

I have a CSV file that I have to do some data processing and it's a bit of a mess. It's about 20 columns long, but there are multiple datasets that are concatenated in each column. see dummy file below
I'm trying to import each sub file into a separate pandas dataframe, but I'm not sure the best way to parse the csv other than manually hardcoding importing a certain length. any suggestions? I guess if there is some way to find where the spaces are (I could loop through the entire file and find them, and then read each block, but that doesn't seem very efficient). I have lots of csv files like this to read.
import pandas as pd
nrows = 20
skiprows = 0 #but this only reads in the first block
df = pd.read_csv(csvfile, nrows=nrows, skiprows=skiprows)
Below is a dummy example:
TIME,HDRA-1,HDRA-2,HDRA-3,HDRA-4
0.473934934,0.944026678,0.460177668,0.157028404,0.221362174
0.911384892,0.336694914,0.586014563,0.828339071,0.632790473
0.772652589,0.318146985,0.162987171,0.555896202,0.659099194
0.541382917,0.033706768,0.229596419,0.388057901,0.465507295
0.462815443,0.088206108,0.717132904,0.545779038,0.268174922
0.522861489,0.736462083,0.532785319,0.961993893,0.393424116
0.128671067,0.56740537,0.689995486,0.518493779,0.94916205
0.214026742,0.176948186,0.883636252,0.732258971,0.463732841
0.769415726,0.960761306,0.401863804,0.41823372,0.812081565
0.529750933,0.360314266,0.461615009,0.387516958,0.136616263
TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4
0.92264286,0.026312552,0.905839375,0.869477136,0.985560264
0.410573341,0.004825381,0.920616162,0.19473237,0.848603523
0.999293171,0.259955029,0.380094352,0.101050014,0.428047493
0.820216119,0.655118219,0.586754951,0.568492346,0.017038336
0.040384337,0.195101879,0.778631044,0.655215972,0.701596844
0.897559206,0.659759362,0.691643603,0.155601111,0.713735399
0.860188233,0.805013656,0.772153733,0.809025634,0.257632085
0.844167809,0.268060979,0.015993504,0.95131982,0.321210766
0.86288383,0.236599974,0.279435193,0.311005146,0.037592509
0.938348876,0.941851279,0.582434058,0.900348616,0.381844182
0.344351819,0.821571854,0.187962046,0.218234588,0.376122331
0.829766776,0.869014514,0.434165111,0.051749472,0.766748447
0.327865017,0.938176948,0.216764504,0.216666543,0.278110502
0.243953506,0.030809033,0.450110334,0.097976735,0.762393831
0.484856452,0.312943244,0.443236377,0.017201097,0.038786057
0.803696521,0.328088545,0.764850865,0.090543472,0.023363909
TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4
0.342418934,0.290979228,0.84201758,0.690964176,0.927385229
0.173485057,0.214049903,0.27438753,0.433904377,0.821778689
0.982816721,0.094490904,0.105895645,0.894103833,0.34362529
0.738593272,0.423470984,0.343551191,0.192169774,0.907698897
0.021809601,0.406001002,0.072701623,0.964640184,0.023427393
0.406226618,0.421944527,0.413150342,0.337243905,0.515996389
0.829989793,0.168974332,0.246064043,0.067662474,0.851182924
0.812736737,0.667154845,0.118274705,0.484017732,0.052666038
0.215947395,0.145078319,0.484063281,0.79414799,0.373845815
0.497877968,0.554808367,0.370429652,0.081553316,0.793608698
0.607612542,0.424703584,0.208995066,0.249033837,0.808169709
0.199613478,0.065853429,0.77236195,0.757789625,0.597225697
0.044167285,0.1024231,0.959682778,0.892311813,0.621810775
0.861175219,0.853442735,0.742542086,0.704287769,0.435969078
0.706544823,0.062501379,0.482065481,0.598698867,0.845585046
0.967217599,0.13127149,0.294860203,0.191045015,0.590202032
0.031666757,0.965674812,0.177792841,0.419935921,0.895265056
TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4
0.306849588,0.177454423,0.538670939,0.602747137,0.081221293
0.729747557,0.11762043,0.409064884,0.051577964,0.666653287
0.492543468,0.097222882,0.448642979,0.130965724,0.48613413
0.0802024,0.726352481,0.457476151,0.647556514,0.033820374
0.617976299,0.934428994,0.197735831,0.765364856,0.350880707
0.07660401,0.285816636,0.276995238,0.047003343,0.770284864
0.620820688,0.700434525,0.896417099,0.652364756,0.93838793
0.364233925,0.200229902,0.648342989,0.919306736,0.897029239
0.606100716,0.203585366,0.167232701,0.523079381,0.767224301
0.616600448,0.130377791,0.554714839,0.468486555,0.582775753
0.254480861,0.933534632,0.054558237,0.948978985,0.731855548
0.620161044,0.583061202,0.457991555,0.441254272,0.657127968
0.415874646,0.408141761,0.843133575,0.40991199,0.540792744
0.254903429,0.655739954,0.977873649,0.210656057,0.072451639
0.473680525,0.298845701,0.144989283,0.998560665,0.223980961
0.30605008,0.837920854,0.450681322,0.887787908,0.793229776
0.584644405,0.423279153,0.444505314,0.686058204,0.041154856

from io import StringIO
import pandas as pd
data ="""
TIME,HDRA-1,HDRA-2,HDRA-3,HDRA-4
0.473934934,0.944026678,0.460177668,0.157028404,0.221362174
0.911384892,0.336694914,0.586014563,0.828339071,0.632790473
0.772652589,0.318146985,0.162987171,0.555896202,0.659099194
0.541382917,0.033706768,0.229596419,0.388057901,0.465507295
0.462815443,0.088206108,0.717132904,0.545779038,0.268174922
0.522861489,0.736462083,0.532785319,0.961993893,0.393424116
TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4
0.92264286,0.026312552,0.905839375,0.869477136,0.985560264
0.410573341,0.004825381,0.920616162,0.19473237,0.848603523
0.999293171,0.259955029,0.380094352,0.101050014,0.428047493
0.820216119,0.655118219,0.586754951,0.568492346,0.017038336
0.040384337,0.195101879,0.778631044,0.655215972,0.701596844
TIME,HDRB-1,HDRB-2,HDRB-3,HDRB-4
0.342418934,0.290979228,0.84201758,0.690964176,0.927385229
0.173485057,0.214049903,0.27438753,0.433904377,0.821778689
0.982816721,0.094490904,0.105895645,0.894103833,0.34362529
0.738593272,0.423470984,0.343551191,0.192169774,0.907698897
"""
df = pd.read_csv(StringIO(data), header=None)
start_marker = 'TIME'
grouper = (df.iloc[:, 0] == start_marker).cumsum()
groups = df.groupby(grouper)
frames = [gr.T.set_index(gr.index[0]).T for _, gr in groups]

journal quality kde plots with seaborn/pandas

I'm trying to do some comparative analysis for a publication. I came across seaborn and pandas and really like the ease with which I can create the analysis that I want. However, I find the manuals a bit scanty on the things that I'm trying to understand about the example plots and how to modify the plots to my needs. I'm hoping for some advice here on to get the plots I'm want. Perhaps pandas/seaborn is not what I need.
So, I would like to create subplots, (3,1) or (2,3), of the following figure:
Questions:
I would like the attached plot to have a title on the colorbar. Not sure if this is possible or exactly what is shown, i.e., is it relative frequency or occurrence or a percentage, etc? How can I put a explanatory tile on the colorbar (oriented vertically).
The text is a nice addition. The pearsonr is the correlation, but I'm not sure what is p. My guess is that it is showing the lag, or? If so, how can I remove the p in the text?
I would like to make the same kind of figure for different variables and put it all in a subplot.
Here's the code I pieced together from the seaborn manual/examples and from other users here on SO (thanks guys).
import netCDF4 as nc
import pandas as pd
import xarray as xr
import numpy as np
import seaborn as sns
import pdb
import matplotlib.pyplot as plt
from scipy import stats, integrate
import matplotlib as mpl
import matplotlib.ticker as tkr
import matplotlib.gridspec as gridspec
sns.set(style="white")
sns.set(color_codes=True)
octp = [622.0, 640.0, 616.0, 731.0, 668.0, 631.0, 641.0, 589.0, 801.0,
828.0, 598.0, 742.0,665.0, 611.0, 773.0, 608.0, 734.0, 725.0, 716.0,
699.0, 686.0, 671.0, 700.0, 656.0,686.0, 675.0, 678.0, 653.0, 659.0,
682.0, 674.0, 684.0, 679.0, 704.0, 624.0, 727.0,739.0, 662.0, 801.0,
633.0, 896.0, 729.0, 659.0, 741.0, 510.0, 836.0, 720.0, 685.0,430.0,
833.0, 710.0, 799.0, 534.0, 532.0, 605.0, 519.0, 850.0, 357.0, 858.0,
497.0,404.0, 456.0, 448.0, 836.0, 462.0, 381.0, 499.0, 673.0, 642.0,
641.0, 458.0, 809.0,562.0, 742.0, 732.0, 710.0, 658.0, 533.0, 811.0,
853.0, 856.0, 785.0, 659.0, 697.0,654.0, 673.0, 707.0, 711.0, 423.0,
751.0, 761.0, 638.0, 576.0, 538.0, 596.0, 718.0,843.0, 640.0, 647.0,
692.0, 599.0, 607.0, 537.0, 679.0, 712.0, 612.0, 641.0, 665.0,658.0,
722.0, 656.0, 656.0, 742.0, 505.0, 688.0, 805.0]
cctp = [482.0, 462.0, 425.0, 506.0, 500.0, 464.0, 486.0, 473.0, 577.0,
735.0, 390.0, 590.0,464.0, 417.0, 722.0, 410.0, 679.0, 680.0, 711.0,
658.0, 687.0, 621.0, 643.0, 690.0,630.0, 661.0, 608.0, 658.0, 624.0,
646.0, 651.0, 634.0, 612.0, 636.0, 607.0, 539.0,706.0, 614.0, 706.0,
401.0, 720.0, 746.0, 511.0, 700.0, 453.0, 677.0, 637.0, 605.0,454.0,
733.0, 535.0, 725.0, 668.0, 513.0, 470.0, 589.0, 765.0, 596.0, 749.0,
462.0,469.0, 514.0, 511.0, 789.0, 647.0, 324.0, 555.0, 670.0, 656.0,
786.0, 374.0, 757.0,645.0, 744.0, 708.0, 497.0, 654.0, 288.0, 705.0,
703.0, 446.0, 675.0, 440.0, 652.0,589.0, 542.0, 661.0, 631.0, 343.0,
585.0, 632.0, 591.0, 602.0, 365.0, 535.0, 663.0,561.0, 448.0, 582.0,
591.0, 535.0, 475.0, 422.0, 599.0, 594.0, 569.0, 576.0, 622.0,483.0,
539.0, 515.0, 621.0, 443.0, 435.0, 502.0, 443.0]
cctp = pd.Series(cctp, name='CTP [hPa]')
octp = pd.Series(octp, name='CTP [hPa]')
formatter = tkr.ScalarFormatter(useMathText=True)
formatter.set_scientific(True)
formatter.set_powerlimits((-2, 2))
g = sns.jointplot(cctp,octp, kind="kde",size=8,space=0.2,cbar=True,
n_levels=50,cbar_kws={"format": formatter})
# add a line x=y
x0, x1 = g.ax_joint.get_xlim()
y0, y1 = g.ax_joint.get_ylim()
lims = [max(x0, y0), min(x1, y1)]
g.ax_joint.plot(lims, lims, ':k')
plt.show()
plt.savefig('test_fig.png')
I know I'm asking a lot here. So I put the questions in order of priority.

1: To set the colorbar label, you can add the label key to the cbar_kws dict:
cbar_kws={"format": formatter, "label": 'My colorbar'}
2: To change the stats label, you need to first slightly modify the stats.pearsonr function to only return the first value, instead of the (pearsonr, p) tuple:
pr = lambda a, b: stats.pearsonr(a, b)[0]
Then, you can change that function using jointplot's stat_func kwarg:
stat_func=pr
and finally, you need to change the annotation to get the label right:
annot_kws={'stat':'pearsonr'})
Putting that all together:
pr = lambda a, b: stats.pearsonr(a, b)[0]
g = sns.jointplot(cctp,octp, kind="kde",size=8,space=0.2,cbar=True,
n_levels=50,cbar_kws={"format": formatter, "label": 'My colorbar'},
stat_func=pr, annot_kws={'stat':'pearsonr'})
3: I don't think its possible to put everything in a subplot with jointplot. Happy to be proven wrong there though.

POS Tags to Wordnet in Pandas Dataframe

I am using NLTK on a dataset stored as a pandas dataframe. All the raw text processing procedures worked fine until I tried to convert the Treebank POS tags to Wordnet POS tags. These are the codes which worked fine for me.
import pandas as pd
import string
from nltk import WordPunctTokenizer, pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn, stopwords
# Example dataframe
df = pd.DataFrame([[2, "I am new at programming."],
[7, "Leaves are falling from the tree."],
[4, "Sophia has been studying since this morning."]], columns = ['ID', 'Text'])
# Tokenize text
tokenizer = nltk.WordPunctTokenizer()
df["Tokens"] = df["Text"].str.lower().apply(tokenizer.tokenize)
# Remove punctuations
pattern = string.punctuation
print(pattern)
def remove_punctuation(tokens):
filtered = [word for word in tokens if word not in pattern]
return filtered
df["Tokens"] = df["Tokens"].apply(remove_punctuation)
# Remove stopwords
stopwords = stopwords.words('english')
def remove_stopwords(tokens):
filtered_words = [word for word in tokens if word not in stopwords]
return filtered_words
df["Tokens"] = df["Tokens"].apply(remove_stopwords)
The following lines of codes did not work and I got this error:
ValueError: too many values to unpack (expected 2)
def wordnet_pos(pos_tag):
if pos_tag.startswith('J'):
return wn.ADJ
elif pos_tag.startswith('V'):
return wn.VERB
elif pos_tag.startswith('N'):
return wn.NOUN
elif pos_tag.startswith('R'):
return wn.ADV
else:
return None
def wordnet(tokens):
pos_tokens = [nltk.pos_tag(token) for token in tokens]
pos_tokens = [(word, wordnet_pos(pos_tag)) for (word, pos_tag) in pos_tokens]
return pos_tokens
df["Wordnet"] = df["Tokens"].apply(wordnet)
This is what I had hoped to achieve - to create df["Wordnet"] with the Wordnet POS tags.
print(df["Wordnet"])
0 [(new, a), (programming, n)]
1 [(leaves, n), (falling, v), (tree, n)]
2 [(sophia, n), (studying, v), (since, n), (...
Name: Wordnet, dtype: object

Iterating over columns in data frame by skipping first column and drawing multiple plots

I have a data frame as following,
df.head()
ID AS_FP AC_FP RP11_FP RP11_be AC_be AS_be Info
AE02 0.060233 0 0.682884 0.817115 0.591182 0.129252 SAP
AE03 0 0 0 0.889181 0.670113 0.766243 SAP
AE04 0 0 0.033256 0.726193 0.171861 0.103839 others
AE05 0 0 0.034988 0.451329 0.431836 0.219843 others
What I am aiming is to plot each column starting from AS_FP til RP11_beta as lmplot, each x axis is column ending with FP and y axis is its corresponding column ending with be.
And I wanted to save it as separate files so I strated iterating through the columns by skipping first column ID, like this,
for ind, column in enumerate(df.columns):
if column.split('_')[0] == column.split('_')[0]:
But I got lost how to continue, I need to plot
sns.lmplot(x, y, data=df, hue='Info',palette=colors, fit_reg=False,
size=10,scatter_kws={"s": 700},markers=["o", "v"])
and save each image as seperate file

Straightforward solution:
1) Toy data:
import pandas as pd
from collections import OrderedDict
import matplotlib.pyplot as plt
import seaborn as sns
dct = OrderedDict()
dct["ID"] = ["AE02", "AE03", "AE04", "AE05"]
dct["AS_FP"] = [0.060233, 0, 0, 0]
dct["AC_FP"] = [0, 0,0, 0]
dct["RP11_FP"] = [0.682884, 0, 0.033256, 0.034988]
dct["AS_be"] = [0.129252, 0.766243, 0.103839, 0.219843]
dct["AC_be"] = [0.591182, 0.670113, 0.171861, 0.431836]
dct["RP11_be"] = [0.817115, 0.889181, 0.726193, 0.451329]
dct["Info"] = ["SAP", "SAP", "others", "others"]
df = pd.DataFrame(dct)
2) Iterating through pairs, saving each figure with unique filename:
graph_cols = [col for col in df.columns if ("_FP" in col) or ("_be" in col)]
fps = sorted([col for col in graph_cols if "_FP" in col])
bes = sorted([col for col in graph_cols if "_be" in col])
for x, y in zip(fps, bes):
snsplot = sns.lmplot(x, y, data=df, fit_reg=False, hue='Info',
size=10, scatter_kws={"s": 700})
snsplot.savefig(x.split("_")[0] + ".png")
You can add needed params in lmlplot as you need.

How to render two pd.DataFrames in jupyter notebook side by side?

Is there an easy way to quickly see contents of two pd.DataFrames side-by-side in Jupyter notebooks?
df1 = pd.DataFrame([(1,2),(3,4)], columns=['a', 'b'])
df2 = pd.DataFrame([(1.1,2.1),(3.1,4.1)], columns=['a', 'b'])
df1, df2

You should try this function from #Wes_McKinney
def side_by_side(*objs, **kwds):
''' Une fonction print objects side by side '''
from pandas.io.formats.printing import adjoin
space = kwds.get('space', 4)
reprs = [repr(obj).split('\n') for obj in objs]
print(adjoin(space, *reprs))
# building a test case of two DataFrame
import pandas as pd
import numpy as np
n, p = (10, 3) # dfs' shape
# dfs indexes and columns labels
index_rowA = [t[0]+str(t[1]) for t in zip(['rA']*n, range(n))]
index_colA = [t[0]+str(t[1]) for t in zip(['cA']*p, range(p))]
index_rowB = [t[0]+str(t[1]) for t in zip(['rB']*n, range(n))]
index_colB = [t[0]+str(t[1]) for t in zip(['cB']*p, range(p))]
# buliding the df A and B
dfA = pd.DataFrame(np.random.rand(n,p), index=index_rowA, columns=index_colA)
dfB = pd.DataFrame(np.random.rand(n,p), index=index_rowB, columns=index_colB)
side_by_side(dfA,dfB) Outputs
cA0 cA1 cA2 cB0 cB1 cB2
rA0 0.708763 0.665374 0.718613 rB0 0.320085 0.677422 0.722697
rA1 0.120551 0.277301 0.646337 rB1 0.682488 0.273689 0.871989
rA2 0.372386 0.953481 0.934957 rB2 0.015203 0.525465 0.223897
rA3 0.456871 0.170596 0.501412 rB3 0.941295 0.901428 0.329489
rA4 0.049491 0.486030 0.365886 rB4 0.597779 0.201423 0.010794
rA5 0.277720 0.436428 0.533683 rB5 0.701220 0.261684 0.502301
rA6 0.391705 0.982510 0.561823 rB6 0.182609 0.140215 0.389426
rA7 0.827597 0.105354 0.180547 rB7 0.041009 0.936011 0.613592
rA8 0.224394 0.975854 0.089130 rB8 0.697824 0.887613 0.972838
rA9 0.433850 0.489714 0.339129 rB9 0.263112 0.355122 0.447154

The closest to what you want could be:
> df1.merge(df2, right_index=1, left_index=1, suffixes=("_1", "_2"))
a_1 b_1 a_2 b_2
0 1 2 1.1 2.1
1 3 4 3.1 4.1
It's not specific of the notebook, but it will work, and it's not that complicated. Another solution would be to convert your dataframe to an image and put them side by side in subplots. But it's a bit far-fetched and complicated.

I ended up using a helper function to quickly compare two data frames:
def cmp(df1, df2, topn=10):
n = topn
a = df1.reset_index().head(n=n)
b = df2.reset_index().head(n=n)
span = pd.DataFrame(data=[('-',) for _ in range(n)], columns=['sep'])
a = a.merge(span, right_index=1, left_index=1)
return a.merge(b, right_index=1, left_index=1, suffixes=['_L', '_R'])

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Break text files into multiple datasets at each \newline with Pandas - pandas

Related

'poorly' organized csv file

journal quality kde plots with seaborn/pandas

POS Tags to Wordnet in Pandas Dataframe

Iterating over columns in data frame by skipping first column and drawing multiple plots

How to render two pd.DataFrames in jupyter notebook side by side?

Categories

Resources