larray.Array.percentile_by

Array.percentile_by(q, *axes_and_groups, out=None, method='linear', skipna=None, keepaxes=False, **explicit_axes)[source]

Compute the qth percentile of the data for the specified axis.

Parameters
qint in range of [0,100] (or sequence of floats)

Percentile to compute, which must be between 0 and 100 inclusive.

*axes_and_groupsNone or int or str or Axis or Group or any combination of those

The qth percentile is performed along all axes except the given one(s). For groups, qth percentile is performed along groups and non associated axes. The default (no axis or group) is to perform the qth percentile over all the dimensions of the input array.

An axis can be referred by:

  • its index (integer). Index can be a negative integer, in which case it counts from the last to the first axis.

  • its name (str or AxisReference). You can use either a simple string (‘axis_name’) or the special variable X (X.axis_name).

  • a variable (Axis). If the axis has been defined previously and assigned to a variable, you can pass it as argument.

You may not want to perform the qth percentile over a whole axis but over a selection of specific labels. To do so, you have several possibilities:

  • ([‘a1’, ‘a3’, ‘a5’], ‘b1, b3, b5’) : labels separated by commas in a list or a string

  • (‘a1:a5:2’) : select labels using a slice (general syntax is ‘start:end:step’ where is ‘step’ is optional and 1 by default).

  • (a=’a1, a2, a3’, X.b[‘b1, b2, b3’]) : in case of possible ambiguity, i.e. if labels can belong to more than one axis, you must precise the axis.

  • (‘a1:a3; a5:a7’, b=’b0,b2; b1,b3’) : create several groups with semicolons. Names are simply given by the concatenation of labels (here: ‘a1,a2,a3’, ‘a5,a6,a7’, ‘b0,b2’ and ‘b1,b3’)

  • (‘a1:a3 >> a123’, ‘b[b0,b2] >> b12’) : operator ‘ >> ‘ allows to rename groups.

outArray, optional

Alternate output array in which to place the result. It must have the same shape as the expected output and its type is preserved (e.g., if dtype(out) is float, the result will consist of 0.0’s and 1.0’s). Axes and labels can be different, only the shape matters. Defaults to None (create a new array).

methodstr, optional

This parameter specifies the method to use for estimating the percentile when the desired percentile lies between two indexes. The different methods supported are described in the Notes section. The options are:

  • ‘inverted_cdf’

  • ‘averaged_inverted_cdf’

  • ‘closest_observation’

  • ‘interpolated_inverted_cdf’

  • ‘hazen’

  • ‘weibull’

  • ‘linear’ (default)

  • ‘median_unbiased’

  • ‘normal_unbiased’

  • ‘lower’

  • ‘higher’

  • ‘midpoint’

  • ‘nearest’

The first three and last four methods are discontinuous. Defaults to ‘linear’.

skipnabool, optional

Whether to skip NaN (null) values. If False, resulting cells will be NaN if any of the aggregated cells is NaN. Defaults to True.

keepaxesbool or label-like, optional

Whether reduced axes are left in the result as dimensions with size one. If True, reduced axes will contain a unique label representing the applied aggregation (e.g. ‘sum’, ‘prod’, …). It is possible to override this label by passing a specific value (e.g. keepaxes=’summation’). Defaults to False.

Returns
Array or scalar

Notes

Given a vector V of length n, the q-th percentile of V is the value q/100 of the way from the minimum to the maximum in a sorted copy of V. The values and distances of the two nearest neighbors as well as the method parameter will determine the percentile if the normalized ranking does not match the location of q exactly. This function is the same as the median if q=50, the same as the minimum if q=0 and the same as the maximum if q=100.

The optional method parameter specifies the method to use when the desired percentile lies between two indexes i and j = i + 1. In that case, we first determine i + g, a virtual index that lies between i and j, where i is the floor and g is the fractional part of the index. The final result is, then, an interpolation of a[i] and a[j] based on g. During the computation of g, i and j are modified using correction constants alpha and beta whose choices depend on the method used. Finally, note that since Python uses 0-based indexing, the code subtracts another 1 from the index internally.

The following formula determines the virtual index i + g, the location of the percentile in the sorted sample:

\[i + g = (q / 100) * ( n - alpha - beta + 1 ) + alpha\]

The different methods then work as follows

inverted_cdf:

method 1 of H&F [1]. This method gives discontinuous results:

  • if g > 0 ; then take j

  • if g = 0 ; then take i

averaged_inverted_cdf:

method 2 of H&F [1]. This method give discontinuous results:

  • if g > 0 ; then take j

  • if g = 0 ; then average between bounds

closest_observation:

method 3 of H&F [1]. This method give discontinuous results:

  • if g > 0 ; then take j

  • if g = 0 and index is odd ; then take j

  • if g = 0 and index is even ; then take i

interpolated_inverted_cdf:

method 4 of H&F [1]. This method give continuous results using:

  • alpha = 0

  • beta = 1

hazen:

method 5 of H&F [1]. This method give continuous results using:

  • alpha = 1/2

  • beta = 1/2

weibull:

method 6 of H&F [1]. This method give continuous results using:

  • alpha = 0

  • beta = 0

linear:

method 7 of H&F [1]. This method give continuous results using:

  • alpha = 1

  • beta = 1

median_unbiased:

method 8 of H&F [1]. This method is probably the best method if the sample distribution function is unknown (see reference). This method give continuous results using:

  • alpha = 1/3

  • beta = 1/3

normal_unbiased:

method 9 of H&F [1]. This method is probably the best method if the sample distribution function is known to be normal. This method give continuous results using:

  • alpha = 3/8

  • beta = 3/8

lower:

NumPy method kept for backwards compatibility. Takes i as the interpolation point.

higher:

NumPy method kept for backwards compatibility. Takes j as the interpolation point.

nearest:

NumPy method kept for backwards compatibility. Takes i or j, whichever is nearest.

midpoint:

NumPy method kept for backwards compatibility. Uses (i + j) / 2.

References

1(1,2,3,4,5,6,7,8,9)

R. J. Hyndman and Y. Fan, “Sample quantiles in statistical packages,” The American Statistician, 50(4), pp. 361-365, 1996

Examples

>>> arr = ndtest((4, 4))
>>> arr
a\b  b0  b1  b2  b3
 a0   0   1   2   3
 a1   4   5   6   7
 a2   8   9  10  11
 a3  12  13  14  15
>>> arr.percentile_by(25)
3.75
>>> # along axis 'a'
>>> arr.percentile_by(25, 'a')
a    a0    a1    a2     a3
   0.75  4.75  8.75  12.75
>>> # along axis 'b'
>>> arr.percentile_by(25, 'b')
b   b0   b1   b2   b3
   3.0  4.0  5.0  6.0
>>> # several percentile values
>>> arr.percentile_by([25, 50, 75], 'b')
percentile\b   b0    b1    b2    b3
          25  3.0   4.0   5.0   6.0
          50  6.0   7.0   8.0   9.0
          75  9.0  10.0  11.0  12.0

Select some rows only

>>> arr.percentile_by(25, ['a0', 'a1'])
1.75
>>> # or equivalently
>>> # arr.percentile_by('a0,a1')

Split an axis in several parts

>>> arr.percentile_by(25, (['a0', 'a1'], ['a2', 'a3']))
a  a0,a1  a2,a3
    1.75   9.75
>>> # or equivalently
>>> # arr.percentile_by('a0,a1;a2,a3')

Same with renaming

>>> arr.percentile_by(25, (X.a['a0', 'a1'] >> 'a01', X.a['a2', 'a3'] >> 'a23'))
a   a01   a23
   1.75  9.75
>>> # or equivalently
>>> # arr.percentile_by('a0,a1>>a01;a2,a3>>a23')