How to customize headers and column widths of DataFrame display? - pandas

As a rule, I like to use long, descriptive column names (e.g. estimated_background_signal rather than just bg) for DataFrame objects. The one downside of this preference is that the DataFrame's display form has several columns that are much wider than their values require. For example:
In [10]: data.head()
barcode estimated_background_signal inhibitor_code inhibitor_concentration
0 R00577279 133 IRB 0.001
1 R00577279 189 SNZ 0.001
2 R00577279 101 CMY 0.001
3 R00577279 112 BRC 0.001
4 R00577279 244 ISB 0.001
It would be nice if the display were narrower. Disregarding the headers, the narrowest display would be:
0 R00577279 113 IRB 0.001
1 R00577279 189 SNZ 0.001
2 R00577279 101 CMY 0.001
3 R00577279 112 BRC 0.001
4 R00577279 244 ISB 0.001
...but eliminating the headers altogether is not an entirely satisfactory solution. A better one would be to make the display wide enough to allow for some headers, possibly taking up several lines:
barcode estim inhib inhib
ated_ itor_ itor_
backg code conce
0 R00577279 113 IRB 0.001
1 R00577279 189 SNZ 0.001
2 R00577279 101 CMY 0.001
3 R00577279 112 BRC 0.001
4 R00577279 244 ISB 0.001
It's probably obvious that no single convention would be suitable for all situations, but, in any case, does pandas offer any way to customize the headers and column widths of a DataFrame's display form?

This is a bit of a hack that uses the multi-index feature of pandas in a non-standard way, although I don't see any significant problems with doing that. Of course, there is some increased complexity from using a multi-index rather than a simple index.
cols = df.columns
lencols = [ int(len(c)/2) for c in cols ]
df.columns = pd.MultiIndex.from_tuples(
tuple( ( c[:ln], c[ln:] ) for c, ln in zip(cols, lencols) ) )
Results:
bar estimated_bac inhibit inhibitor_c
code kground_signal or_code oncentration
0 R00577279 133 IRB 0.001
1 R00577279 189 SNZ 0.001
2 R00577279 101 CMY 0.001
3 R00577279 112 BRC 0.001
4 R00577279 244 ISB 0.001
You could also consider creating a dictionary to convert between long & short names as needed:
Display column name different from dictionary key name in Pandas?

There is obviously pd.set_option display settings you can utilize. If you're looking for a pandas specific answer that doesn't involve changing notebook display settings, consider the below.
df = pd.DataFrame(np.random.randn(10, 2),
columns=['Very Long Column Title ' + str(i) for i in range(2)])
df.style.set_table_styles([dict(selector="th",props=[('max-width', '50px')])])

Related

Remove related row from pandas dataframe

I have the following dataframe:
id
relatedId
coordinate
123
125
55
125
123
45
128
130
60
132
135
50
130
128
40
135
132
50
So I have 6 rows in this dataframe, but I would like to get rid of the related rows resulting in 3 rows. The coordinate column equals 100 between the two related rows, and I would like to keep the one with the lowest value (so the one less than 50. If both are 50, simply one of them). The resulting dataframe would thus be:
id
relatedId
coordinate
125
123
45
132
135
50
130
128
40
Hopefully someone has a good solution for this problem.
Thanks
You can sort the values and get the first value per group using a frozenset of the 2 ids as grouper:
(df
.sort_values(by='coordinate')
.groupby(df[['id', 'relatedId']].agg(frozenset, axis=1), as_index=False)
.first()
)
output:
id relatedId coordinate
0 130 128 40
1 125 123 45
2 132 135 50
Alternatively, to keep the original order, and original indices, use idxmin per group:
group = df[['id', 'relatedId']].agg(frozenset, axis=1)
idx = df['coordinate'].groupby(group).idxmin()
df.loc[sorted(idx)]
output:
id relatedId coordinate
1 125 123 45
3 132 135 50
4 130 128 40

Unconsistent Pandas axis labels

I have a pandas data-frame (df) including a column as labels (column 'Specimens' here).
Specimens Sample Min_read_lg Avr_read_lg Max_read_lg
0 B.pleb_sili 1 32 249.741 488
1 B.pleb_sili 2 30 276.959 489
2 B.conc_sili 3 25 256.294 489
3 B.conc_sili 4 27 277.923 489
4 F1_1_sili 5 34 303.328 489
...
I have tried to plot it as following, but the labels on the x axis are not matching the actual values of the table. Would anyone know why it could be the case?
plot=df.plot.area()
plot.set_xlabel("Specimens")
plot.set_ylabel("Read length")
plot.set_xticklabels(df['Specimens'], rotation=90)
I think the "plot.set_xticklabels" method is not right, but I would like to understand why the labels on the x axis are mismatched, and most of them missing.

multi-dimensional indexing warning with pandas

x = df.x_value
y = df.y_value
x = x[:, np.newaxis]
y = y[:, np.newaxis]
polynomial_features= PolynomialFeatures(degree=2)
x_transformed = polynomial_features.fit_transform(x)
The above code is giving following warning...how can I avoid these
FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
A full working example with solution as suggested by the warning:
In [194]: df
Out[194]:
age rank height weight
0 20 2 155 53
1 15 7 159 60
2 34 6 180 75
3 40 5 163 80
4 60 1 170 49
In [195]: df.height
Out[195]:
0 155
1 159
2 180
3 163
4 170
Name: height, dtype: int64
In [196]: df.height[:,None]
<ipython-input-196-1af0bb09495a>:1: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
df.height[:,None]
Out[196]:
array([[155],
[159],
[180],
[163],
[170]])
In [197]: df.height.to_numpy()[:,None]
Out[197]:
array([[155],
[159],
[180],
[163],
[170]])

Get value of variable quantile per group

I have data that is categorized in groups, with a given quantile percentage per group. I want to create a threshold for each group that seperates all values within the group based on the quantile percentage. So if one group has q=0.8, I want the lowest 80% values given 1, and the upper 20% values given 0.
So, given the data like this:
I want object 1, 2 and 5 to get result 1 and the other 3 result 0. In total my data consists of 7.000.000 rows with 14.000 groups. I tried doing this with groupby.quantile but therefore I need a constant quantile measure, whereas my data has a different one for each group.
Setup:
num = 7_000_000
grp_num = 14_000
qua = np.around(np.random.uniform(size=grp_num), 2)
df = pd.DataFrame({
"Group": np.random.randint(low=0, high=grp_num, size=num),
"Quantile": 0.0,
"Value": np.random.randint(low=100, high=300, size=num)
}).sort_values("Group").reset_index(0, drop=True)
def func(grp):
grp["Quantile"] = qua[grp.Group]
return grp
df = df.groupby("Group").apply(func)
Answer: (This is basically a for loop, so for performance you can try to apply numba to this)
def func2(grp):
return grp.Value < grp.Value.quantile(grp.Quantile.iloc[0])
df["result"] = df.groupby("Group").apply(func2).reset_index(0, drop=True)
print(df)
Outputs:
Group Quantile Value result
0 0 0.33 156 1
1 0 0.33 259 0
2 0 0.33 166 1
3 0 0.33 183 0
4 0 0.33 111 1
... ... ... ... ...
6999995 13999 0.83 194 1
6999996 13999 0.83 227 1
6999997 13999 0.83 215 1
6999998 13999 0.83 103 1
6999999 13999 0.83 115 1
[7000000 rows x 4 columns]
CPU times: user 14.2 s, sys: 362 ms, total: 14.6 s
Wall time: 14.7 s

Gnuplot conditional plotting

I need to use gnuplot to plot wind direction values (y) against time (x) in a 2D plot using lines and points. This works fine if successive values are close together. If the values are eg separated by 250 degrees then I need to have a condition that checks the previous y value and does not draw a line joining the two points. This condition occurs when the wind dir is in the 280 degrees to 20 degrees sector and the plots are messy eg a north wind. AS the data is time dependent I cannot use polar plots except at a specific point in time. I need to show change in direction over time.
Basically the problem is:
plot y against x ; when (y2-y1)>= 180 then break/erase line joining successive points
Can anyone give me an example of how to do this?
A sample from the data file is:
2014-06-16 16:00:00 0.000 990.081 0.001 0.001 0.001 0.001 0.002 0.001 11.868 308 002.54 292 004.46 00
2014-06-16 16:10:00 0.000 990.047 0.001 0.001 0.001 0.001 0.002 0.001 11.870 303 001.57 300 002.48 00
2014-06-16 16:20:00 0.000 990.014 0.001 0.001 0.001 0.001 0.002 0.001 11.961 334 001.04 314 002.07 00
2014-06-16 16:30:00 0.000 990.014 0.001 0.001 0.001 0.001 0.002 0.001 11.818 005 001.18 020 002.14 00
2014-06-16 16:40:00 0.000 990.014 0.001 0.001 0.001 0.001 0.002 0.001 11.725 332 001.14 337 002.26 00
and I want to plot column 12 vs time.
You can insert a filtering condition in the using statement and use a value of 1/0 if the condition is not fullfilled. In that case this point is not connect to others:
set timefmt '%Y-%m-%d %H:%M:%S'
set xdata time
unset key
y1 = y2 = 0
plot 'data.dat' using 1:(y1 = y2, y2 = $12, ($0 == 0 || y2 - y1 < 180) ? $12 : 1/0) with lines,\
'data.dat' using 1:12 with points
With your data sample and gnuplot version 4.6.5 I get the plot:
Unfortunately, with this approach you cannot categorize lines, but only points and also the line following the 1/0 point aren't drawn.
A better approach would be to use awk to insert an empty line when a jump occurs. In a 2D-plot points from different data blocks (separated by a single new line) aren't connected:
set timefmt '%Y-%m-%d %H:%M:%S'
set xdata time
unset key
plot '< awk ''{y1 = y2; y2 = $12; if (NR > 1 && y2 - y1 >= 180) printf("\n"); print}'' data.dat' using 1:12 with linespoints
In order to break the joining lines two conditional statements must be fulfilled and BOTH must include the newline statement printf("\n"):
plot '< awk ''{y1 = y2; y2 = $12; if (NR > 1 && y2 - y1 >= 180) printf("\n") ; if (NR > 1 && y2 -y1 <= 0) printf("\n"); print}'' /Desktop/plotdata.txt' using 1:12 with linespoints
There is absolutely no need for awk. You can simply "interrupt" the lines by using variable color for the line. For gnuplot<5.0.0 you can use 0xffffff=white (or whatever your background color is) as linecolor and the line will hardly be visible. For gnuplot>=5.0.0 you can use any transparent color, e.g. 0xff123456, i.e. the line is really invisible.
Data: SO24425910.dat
2014-06-16 16:00:00 330
2014-06-16 16:10:00 320
2014-06-16 16:20:00 310
2014-06-16 16:30:00 325
2014-06-16 16:40:00 090
2014-06-16 16:50:00 060
2014-06-16 17:00:00 070
2014-06-16 17:10:00 280
2014-06-16 17:20:00 290
2014-06-16 17:30:00 300
Script: (works for gnuplot>=4.4.0, March 2010)
### conditional interruption of line
reset
FILE = "SO24425910.dat"
set key noautotitle
set yrange[0:360]
set ytics 90
set grid x
set grid y
set xdata time
set timefmt "%Y-%m-%d %H:%M"
set format x "%H:%M"
set multiplot layout 2,1
plot y1=NaN FILE u 1:(y0=y1,y1=$3):(abs(y1-y0)>=180?0xffffff:0xff0000) w l lc rgb var
plot y1=NaN FILE u 1:(y0=y1,y1=$3):(abs(y1-y0)>=180?0xffffff:0xff0000) w l lc rgb var, \
'' u 1:3 w p pt 7 lc rgb "red"
unset multiplot
### end of script
Result: