larray.Array.percentile_by
- Array.percentile_by(q, *axes_and_groups, out=None, method='linear', skipna=None, keepaxes=False, **explicit_axes)[source]
Compute the qth percentile of the data for the specified axis.
- Parameters
- qint in range of [0,100] (or sequence of floats)
Percentile to compute, which must be between 0 and 100 inclusive.
- *axes_and_groupsNone or int or str or Axis or Group or any combination of those
The qth percentile is performed along all axes except the given one(s). For groups, qth percentile is performed along groups and non associated axes. The default (no axis or group) is to perform the qth percentile over all the dimensions of the input array.
An axis can be referred by:
its index (integer). Index can be a negative integer, in which case it counts from the last to the first axis.
its name (str or AxisReference). You can use either a simple string (‘axis_name’) or the special variable X (X.axis_name).
a variable (Axis). If the axis has been defined previously and assigned to a variable, you can pass it as argument.
You may not want to perform the qth percentile over a whole axis but over a selection of specific labels. To do so, you have several possibilities:
([‘a1’, ‘a3’, ‘a5’], ‘b1, b3, b5’) : labels separated by commas in a list or a string
(‘a1:a5:2’) : select labels using a slice (general syntax is ‘start:end:step’ where is ‘step’ is optional and 1 by default).
(a=’a1, a2, a3’, X.b[‘b1, b2, b3’]) : in case of possible ambiguity, i.e. if labels can belong to more than one axis, you must precise the axis.
(‘a1:a3; a5:a7’, b=’b0,b2; b1,b3’) : create several groups with semicolons. Names are simply given by the concatenation of labels (here: ‘a1,a2,a3’, ‘a5,a6,a7’, ‘b0,b2’ and ‘b1,b3’)
(‘a1:a3 >> a123’, ‘b[b0,b2] >> b12’) : operator ‘ >> ‘ allows to rename groups.
- outArray, optional
Alternate output array in which to place the result. It must have the same shape as the expected output and its type is preserved (e.g., if dtype(out) is float, the result will consist of 0.0’s and 1.0’s). Axes and labels can be different, only the shape matters. Defaults to None (create a new array).
- methodstr, optional
This parameter specifies the method to use for estimating the percentile when the desired percentile lies between two indexes. The different methods supported are described in the Notes section. The options are:
‘inverted_cdf’
‘averaged_inverted_cdf’
‘closest_observation’
‘interpolated_inverted_cdf’
‘hazen’
‘weibull’
‘linear’ (default)
‘median_unbiased’
‘normal_unbiased’
‘lower’
‘higher’
‘midpoint’
‘nearest’
The first three and last four methods are discontinuous. Defaults to ‘linear’.
- skipnabool, optional
Whether to skip NaN (null) values. If False, resulting cells will be NaN if any of the aggregated cells is NaN. Defaults to True.
- keepaxesbool or label-like, optional
Whether reduced axes are left in the result as dimensions with size one. If True, reduced axes will contain a unique label representing the applied aggregation (e.g. ‘sum’, ‘prod’, …). It is possible to override this label by passing a specific value (e.g. keepaxes=’summation’). Defaults to False.
- Returns
- Array or scalar
See also
Notes
Given a vector
V
of lengthn
, the q-th percentile ofV
is the valueq/100
of the way from the minimum to the maximum in a sorted copy ofV
. The values and distances of the two nearest neighbors as well as the method parameter will determine the percentile if the normalized ranking does not match the location ofq
exactly. This function is the same as the median ifq=50
, the same as the minimum ifq=0
and the same as the maximum ifq=100
.The optional method parameter specifies the method to use when the desired percentile lies between two indexes
i
andj = i + 1
. In that case, we first determinei + g
, a virtual index that lies betweeni
andj
, wherei
is the floor andg
is the fractional part of the index. The final result is, then, an interpolation ofa[i]
anda[j]
based ong
. During the computation ofg
,i
andj
are modified using correction constantsalpha
andbeta
whose choices depend on themethod
used. Finally, note that since Python uses 0-based indexing, the code subtracts another 1 from the index internally.The following formula determines the virtual index
i + g
, the location of the percentile in the sorted sample:\[i + g = (q / 100) * ( n - alpha - beta + 1 ) + alpha\]The different methods then work as follows
- inverted_cdf:
method 1 of H&F [1]. This method gives discontinuous results:
if g > 0 ; then take j
if g = 0 ; then take i
- averaged_inverted_cdf:
method 2 of H&F [1]. This method give discontinuous results:
if g > 0 ; then take j
if g = 0 ; then average between bounds
- closest_observation:
method 3 of H&F [1]. This method give discontinuous results:
if g > 0 ; then take j
if g = 0 and index is odd ; then take j
if g = 0 and index is even ; then take i
- interpolated_inverted_cdf:
method 4 of H&F [1]. This method give continuous results using:
alpha = 0
beta = 1
- hazen:
method 5 of H&F [1]. This method give continuous results using:
alpha = 1/2
beta = 1/2
- weibull:
method 6 of H&F [1]. This method give continuous results using:
alpha = 0
beta = 0
- linear:
method 7 of H&F [1]. This method give continuous results using:
alpha = 1
beta = 1
- median_unbiased:
method 8 of H&F [1]. This method is probably the best method if the sample distribution function is unknown (see reference). This method give continuous results using:
alpha = 1/3
beta = 1/3
- normal_unbiased:
method 9 of H&F [1]. This method is probably the best method if the sample distribution function is known to be normal. This method give continuous results using:
alpha = 3/8
beta = 3/8
- lower:
NumPy method kept for backwards compatibility. Takes
i
as the interpolation point.- higher:
NumPy method kept for backwards compatibility. Takes
j
as the interpolation point.- nearest:
NumPy method kept for backwards compatibility. Takes
i
orj
, whichever is nearest.- midpoint:
NumPy method kept for backwards compatibility. Uses
(i + j) / 2
.
References
- 1(1,2,3,4,5,6,7,8,9)
R. J. Hyndman and Y. Fan, “Sample quantiles in statistical packages,” The American Statistician, 50(4), pp. 361-365, 1996
Examples
>>> arr = ndtest((4, 4)) >>> arr a\b b0 b1 b2 b3 a0 0 1 2 3 a1 4 5 6 7 a2 8 9 10 11 a3 12 13 14 15 >>> arr.percentile_by(25) 3.75 >>> # along axis 'a' >>> arr.percentile_by(25, 'a') a a0 a1 a2 a3 0.75 4.75 8.75 12.75 >>> # along axis 'b' >>> arr.percentile_by(25, 'b') b b0 b1 b2 b3 3.0 4.0 5.0 6.0 >>> # several percentile values >>> arr.percentile_by([25, 50, 75], 'b') percentile\b b0 b1 b2 b3 25 3.0 4.0 5.0 6.0 50 6.0 7.0 8.0 9.0 75 9.0 10.0 11.0 12.0
Select some rows only
>>> arr.percentile_by(25, ['a0', 'a1']) 1.75 >>> # or equivalently >>> # arr.percentile_by('a0,a1')
Split an axis in several parts
>>> arr.percentile_by(25, (['a0', 'a1'], ['a2', 'a3'])) a a0,a1 a2,a3 1.75 9.75 >>> # or equivalently >>> # arr.percentile_by('a0,a1;a2,a3')
Same with renaming
>>> arr.percentile_by(25, (X.a['a0', 'a1'] >> 'a01', X.a['a2', 'a3'] >> 'a23')) a a01 a23 1.75 9.75 >>> # or equivalently >>> # arr.percentile_by('a0,a1>>a01;a2,a3>>a23')