Indexing, Selecting and Assigning¶
Import the LArray library:
[2]:
from larray import *
Check the version of LArray:
[3]:
from larray import __version__
__version__
[3]:
'0.30'
Import the test array pop
:
[4]:
# let's start with
pop = load_example_data('demography').pop
pop
[4]:
time geo age sex\nat BE FO
1991 BruCap 0 M 4182 2377
1991 BruCap 0 F 4052 2188
1991 BruCap 1 M 3904 2316
1991 BruCap 1 F 3769 2241
1991 BruCap 2 M 3790 2365
... ... ... ... ... ...
2016 Wal 118 F 0 0
2016 Wal 119 M 0 0
2016 Wal 119 F 0 0
2016 Wal 120 M 0 0
2016 Wal 120 F 0 0
Selecting (Subsets)¶
LArray allows to select a subset of an array either by labels or indices (positions)
Selecting by Labels¶
To take a subset of an array using labels, use brackets [ ].
Let’s start by selecting a single element:
[5]:
# here we select the value associated with Belgian women
# of age 50 from Brussels region for the year 2015
pop[2015, 'BruCap', 50, 'F', 'BE']
[5]:
4813
Continue with selecting a subset using slices and lists of labels
[6]:
# here we select the subset associated with Belgian women of age 50, 51 and 52
# from Brussels region for the years 2010 to 2016
pop[2010:2016, 'BruCap', 50:52, 'F', 'BE']
[6]:
time\age 50 51 52
2010 4869 4811 4699
2011 5015 4860 4792
2012 4722 5014 4818
2013 4711 4727 5007
2014 4788 4702 4730
2015 4813 4767 4676
2016 4814 4792 4740
[7]:
# slices bounds are optional:
# if not given start is assumed to be the first label and stop is the last one.
# Here we select all years starting from 2010
pop[2010:, 'BruCap', 50:52, 'F', 'BE']
[7]:
time\age 50 51 52
2010 4869 4811 4699
2011 5015 4860 4792
2012 4722 5014 4818
2013 4711 4727 5007
2014 4788 4702 4730
2015 4813 4767 4676
2016 4814 4792 4740
[8]:
# Slices can also have a step (defaults to 1), to take every Nth labels
# Here we select all even years starting from 2010
pop[2010::2, 'BruCap', 50:52, 'F', 'BE']
[8]:
time\age 50 51 52
2010 4869 4811 4699
2012 4722 5014 4818
2014 4788 4702 4730
2016 4814 4792 4740
[9]:
# one can also use list of labels to take non-contiguous labels.
# Here we select years 2008, 2010, 2013 and 2015
pop[[2008, 2010, 2013, 2015], 'BruCap', 50:52, 'F', 'BE']
[9]:
time\age 50 51 52
2008 4731 4735 4724
2010 4869 4811 4699
2013 4711 4727 5007
2015 4813 4767 4676
The order of indexing does not matter either, so you usually do not care/have to remember about axes positions during computation. It only matters for output.
[10]:
# order of index doesn't matter
pop['F', 'BE', 'BruCap', [2008, 2010, 2013, 2015], 50:52]
[10]:
time\age 50 51 52
2008 4731 4735 4724
2010 4869 4811 4699
2013 4711 4727 5007
2015 4813 4767 4676
Warning: Selecting by labels as above works well as long as there is no ambiguity. When two or more axes have common labels, it may lead to a crash. The solution is then to precise to which axis belong the labels.
[11]:
# let us now create an array with the same labels on several axes
age, weight, size = Axis('age=0..80'), Axis('weight=0..120'), Axis('size=0..200')
arr_ws = ndtest([age, weight, size])
[12]:
# let's try to select teenagers with size between 1 m 60 and 1 m 65 and weight > 80 kg.
# In this case the subset is ambiguous and this results in an error:
arr_ws[10:18, :80, 160:165]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-139cd48d3ba8> in <module>
1 # let's try to select teenagers with size between 1 m 60 and 1 m 65 and weight > 80 kg.
2 # In this case the subset is ambiguous and this results in an error:
----> 3 arr_ws[10:18, :80, 160:165]
~/checkouts/readthedocs.org/user_builds/larray/conda/0.30/lib/python3.6/site-packages/larray-0.30-py3.6.egg/larray/core/array.py in __getitem__(self, key, collapse_slices, translate_key)
2087 # FIXME: I have a huge problem with boolean axis labels + non points
2088 raw_broadcasted_key, res_axes, transpose_indices = self.axes._key_to_raw_and_axes(key, collapse_slices,
-> 2089 translate_key)
2090 res_data = data[raw_broadcasted_key]
2091 if res_axes:
~/checkouts/readthedocs.org/user_builds/larray/conda/0.30/lib/python3.6/site-packages/larray-0.30-py3.6.egg/larray/core/axis.py in _key_to_raw_and_axes(self, key, collapse_slices, translate_key)
2803
2804 if translate_key:
-> 2805 key = self._translated_key(key)
2806 assert isinstance(key, tuple) and len(key) == self.ndim
2807
~/checkouts/readthedocs.org/user_builds/larray/conda/0.30/lib/python3.6/site-packages/larray-0.30-py3.6.egg/larray/core/axis.py in _translated_key(self, key)
2763 """
2764 # any key -> (IGroup, IGroup, ...)
-> 2765 igroup_key = self._key_to_igroups(key)
2766
2767 # extract axis from Group keys
~/checkouts/readthedocs.org/user_builds/larray/conda/0.30/lib/python3.6/site-packages/larray-0.30-py3.6.egg/larray/core/axis.py in _key_to_igroups(self, key)
2743
2744 # translate all keys to IGroup
-> 2745 return tuple(self._translate_axis_key(axis_key) for axis_key in key)
2746
2747 def _translated_key(self, key):
~/checkouts/readthedocs.org/user_builds/larray/conda/0.30/lib/python3.6/site-packages/larray-0.30-py3.6.egg/larray/core/axis.py in <genexpr>(.0)
2743
2744 # translate all keys to IGroup
-> 2745 return tuple(self._translate_axis_key(axis_key) for axis_key in key)
2746
2747 def _translated_key(self, key):
~/checkouts/readthedocs.org/user_builds/larray/conda/0.30/lib/python3.6/site-packages/larray-0.30-py3.6.egg/larray/core/axis.py in _translate_axis_key(self, axis_key)
2683 return self._translate_axis_key_chunk(axis_key)
2684 else:
-> 2685 return self._translate_axis_key_chunk(axis_key)
2686
2687 def _key_to_igroups(self, key):
~/checkouts/readthedocs.org/user_builds/larray/conda/0.30/lib/python3.6/site-packages/larray-0.30-py3.6.egg/larray/core/axis.py in _translate_axis_key_chunk(self, axis_key)
2615 valid_axes = ', '.join(a.name if a.name is not None else '{{{}}}'.format(self.index(a))
2616 for a in valid_axes)
-> 2617 raise ValueError('%s is ambiguous (valid in %s)' % (axis_key, valid_axes))
2618 return valid_axes[0].i[axis_pos_key]
2619
ValueError: slice(10, 18, None) is ambiguous (valid in age, weight, size)
[13]:
# the solution is simple. You need to precise the axes on which you make a selection
arr_ws[age[10:18], weight[:80], size[160:165]]
[13]:
age weight\size 160 161 162 163 164 165
10 0 243370 243371 243372 243373 243374 243375
10 1 243571 243572 243573 243574 243575 243576
10 2 243772 243773 243774 243775 243776 243777
10 3 243973 243974 243975 243976 243977 243978
10 4 244174 244175 244176 244177 244178 244179
... ... ... ... ... ... ... ...
18 76 453214 453215 453216 453217 453218 453219
18 77 453415 453416 453417 453418 453419 453420
18 78 453616 453617 453618 453619 453620 453621
18 79 453817 453818 453819 453820 453821 453822
18 80 454018 454019 454020 454021 454022 454023
Ambiguous Cases - Specifying Axes Using The Special Variable X¶
When selecting, assiging or using aggregate functions, an axis can be refered via the special variable X
:
pop[X.age[:20]]
pop.sum(X.age)
This gives you acces to axes of the array you are manipulating. The main drawback of using X
is that you lose the autocompletion available from many editors. It only works with non-anonymous axes for which names do not contain whitespaces or special characters.
[14]:
# the previous example could have been also written as
arr_ws[X.age[10:18], X.weight[:80], X.size[160:165]]
[14]:
age weight\size 160 161 162 163 164 165
10 0 243370 243371 243372 243373 243374 243375
10 1 243571 243572 243573 243574 243575 243576
10 2 243772 243773 243774 243775 243776 243777
10 3 243973 243974 243975 243976 243977 243978
10 4 244174 244175 244176 244177 244178 244179
... ... ... ... ... ... ... ...
18 76 453214 453215 453216 453217 453218 453219
18 77 453415 453416 453417 453418 453419 453420
18 78 453616 453617 453618 453619 453620 453621
18 79 453817 453818 453819 453820 453821 453822
18 80 454018 454019 454020 454021 454022 454023
Selecting by Indices¶
Sometimes it is more practical to use indices (positions) along the axis, instead of labels. You need to add the character i
before the brackets: .i[indices]
. As for selection with labels, you can use a single index, a slice or a list of indices. Indices can be also negative (-1 represent the last element of an axis).
Note: Remember that indices (positions) are always 0-based in Python. So the first element is at index 0, the second is at index 1, etc.
[15]:
# here we select the subset associated with Belgian women of age 50, 51 and 52
# from Brussels region for the first 3 years
pop[X.time.i[:3], 'BruCap', 50:52, 'F', 'BE']
[15]:
time\age 50 51 52
1991 3739 4138 4101
1992 3373 3665 4088
1993 3648 3335 3615
[16]:
# same but for the last 3 years
pop[X.time.i[-3:], 'BruCap', 50:52, 'F', 'BE']
[16]:
time\age 50 51 52
2014 4788 4702 4730
2015 4813 4767 4676
2016 4814 4792 4740
[17]:
# using list of indices
pop[X.time.i[-9,-7,-4,-2], 'BruCap', 50:52, 'F', 'BE']
[17]:
time\age 50 51 52
2008 4731 4735 4724
2010 4869 4811 4699
2013 4711 4727 5007
2015 4813 4767 4676
Warning: The end indice (position) is EXCLUSIVE while the end label is INCLUSIVE.
[18]:
# with labels (3 is included)
pop[2015, 'BruCap', X.age[:3], 'F', 'BE']
[18]:
age 0 1 2 3
6020 5882 6023 5861
[19]:
# with indices (3 is out)
pop[2015, 'BruCap', X.age.i[:3], 'F', 'BE']
[19]:
age 0 1 2
6020 5882 6023
You can use .i[]
selection directly on array instead of axes. In this context, if you want to select a subset of the first and third axes for example, you must use a full slice :
for the second one.
[20]:
# here we select the last year and first 3 ages
# equivalent to: pop.i[-1, :, :3, :, :]
pop.i[-1, :, :3]
[20]:
geo age sex\nat BE FO
BruCap 0 M 6155 3104
BruCap 0 F 5900 2817
BruCap 1 M 6165 3068
BruCap 1 F 5916 2946
BruCap 2 M 6053 2918
BruCap 2 F 5736 2776
Fla 0 M 29993 3717
Fla 0 F 28483 3587
Fla 1 M 31292 3716
Fla 1 F 29721 3575
Fla 2 M 31718 3597
Fla 2 F 30353 3387
Wal 0 M 17869 1472
Wal 0 F 17242 1454
Wal 1 M 18820 1432
Wal 1 F 17604 1443
Wal 2 M 19076 1444
Wal 2 F 18189 1358
Using Groups In Selections¶
[21]:
teens = pop.age[10:20]
pop[2015, 'BruCap', teens, 'F', 'BE']
[21]:
age 10 11 12 13 14 15 16 17 18 19 20
5124 4865 4758 4807 4587 4593 4429 4466 4517 4461 4464
Assigning subsets¶
Assigning A Value¶
Assign a value to a subset
[22]:
# let's take a smaller array
pop = load_example_data('demography').pop[2016, 'BruCap', 100:105]
pop2 = pop
pop2
[22]:
age sex\nat BE FO
100 M 12 0
100 F 60 3
101 M 12 2
101 F 66 5
102 M 8 0
102 F 26 1
103 M 2 1
103 F 17 2
104 M 2 1
104 F 14 0
105 M 0 0
105 F 2 2
[23]:
# set all data corresponding to age >= 102 to 0
pop2[102:] = 0
pop2
[23]:
age sex\nat BE FO
100 M 12 0
100 F 60 3
101 M 12 2
101 F 66 5
102 M 0 0
102 F 0 0
103 M 0 0
103 F 0 0
104 M 0 0
104 F 0 0
105 M 0 0
105 F 0 0
One very important gotcha though…
Warning: Modifying a slice of an array in-place like we did above should be done with care otherwise you could have unexpected effects. The reason is that taking a slice subset of an array does not return a copy of that array, but rather a view on that array. To avoid such behavior, use .copy()
method.
Remember:
taking a slice subset of an array is extremely fast (no data is copied)
if one modifies that subset in-place, one also modifies the original array
.copy() returns a copy of the subset (takes speed and memory) but allows you to change the subset without modifying the original array in the same time
[24]:
# indeed, data from the original array have also changed
pop
[24]:
age sex\nat BE FO
100 M 12 0
100 F 60 3
101 M 12 2
101 F 66 5
102 M 0 0
102 F 0 0
103 M 0 0
103 F 0 0
104 M 0 0
104 F 0 0
105 M 0 0
105 F 0 0
[25]:
# the right way
pop = load_example_data('demography').pop[2016, 'BruCap', 100:105]
pop2 = pop.copy()
pop2[102:] = 0
pop2
[25]:
age sex\nat BE FO
100 M 12 0
100 F 60 3
101 M 12 2
101 F 66 5
102 M 0 0
102 F 0 0
103 M 0 0
103 F 0 0
104 M 0 0
104 F 0 0
105 M 0 0
105 F 0 0
[26]:
# now, data from the original array have not changed this time
pop
[26]:
age sex\nat BE FO
100 M 12 0
100 F 60 3
101 M 12 2
101 F 66 5
102 M 8 0
102 F 26 1
103 M 2 1
103 F 17 2
104 M 2 1
104 F 14 0
105 M 0 0
105 F 2 2
Assigning Arrays And Broadcasting¶
Instead of a value, we can also assign an array to a subset. In that case, that array can have less axes than the target but those which are present must be compatible with the subset being targeted.
[27]:
sex, nat = Axis('sex=M,F'), Axis('nat=BE,FO')
new_value = LArray([[1, -1], [2, -2]],[sex, nat])
new_value
[27]:
sex\nat BE FO
M 1 -1
F 2 -2
[28]:
# this assigns 1, -1 to Belgian, Foreigner men
# and 2, -2 to Belgian, Foreigner women for all
# people older than 100
pop[102:] = new_value
pop
[28]:
age sex\nat BE FO
100 M 12 0
100 F 60 3
101 M 12 2
101 F 66 5
102 M 1 -1
102 F 2 -2
103 M 1 -1
103 F 2 -2
104 M 1 -1
104 F 2 -2
105 M 1 -1
105 F 2 -2
Warning: The array being assigned must have compatible axes (i.e. same axes names and same labels) with the target subset.
[29]:
# assume we define the following array with shape 3 x 2 x 2
new_value = zeros(['age=100..102', sex, nat])
new_value
[29]:
age sex\nat BE FO
100 M 0.0 0.0
100 F 0.0 0.0
101 M 0.0 0.0
101 F 0.0 0.0
102 M 0.0 0.0
102 F 0.0 0.0
[30]:
# now let's try to assign the previous array in a subset from age 103 to 105
pop[103:105] = new_value
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-30-63d0ef0af080> in <module>
1 # now let's try to assign the previous array in a subset from age 103 to 105
----> 2 pop[103:105] = new_value
~/checkouts/readthedocs.org/user_builds/larray/conda/0.30/lib/python3.6/site-packages/larray-0.30-py3.6.egg/larray/core/array.py in __setitem__(self, key, value, collapse_slices, translate_key)
2107 # TODO: the check_compatible should be included in broadcast_with
2108 value = value.broadcast_with(target_axes)
-> 2109 value.axes.check_compatible(target_axes)
2110
2111 # replace incomprehensible error message "could not broadcast input array from shape XX into shape YY"
~/checkouts/readthedocs.org/user_builds/larray/conda/0.30/lib/python3.6/site-packages/larray-0.30-py3.6.egg/larray/core/axis.py in check_compatible(self, axes)
1983 local_axis = self.get_by_pos(axis, i)
1984 if not local_axis.iscompatible(axis):
-> 1985 raise ValueError("incompatible axes:\n{!r}\nvs\n{!r}".format(axis, local_axis))
1986
1987 # XXX: deprecate method (functionality is duplicated in union)?
ValueError: incompatible axes:
Axis([103, 104, 105], 'age')
vs
Axis([100, 101, 102], 'age')
[31]:
# but this works
pop[100:102] = new_value
pop
[31]:
age sex\nat BE FO
100 M 0 0
100 F 0 0
101 M 0 0
101 F 0 0
102 M 0 0
102 F 0 0
103 M 1 -1
103 F 2 -2
104 M 1 -1
104 F 2 -2
105 M 1 -1
105 F 2 -2
Boolean Filtering¶
Boolean filtering can be use to extract subsets.
[32]:
#Let's focus on population living in Brussels during the year 2016
pop = load_example_data('demography').pop[2016, 'BruCap']
# here we select all males and females with age less than 5 and 10 respectively
subset = pop[((X.sex == 'H') & (X.age <= 5)) | ((X.sex == 'F') & (X.age <= 10))]
subset
[32]:
sex_age\nat BE FO
F_0 5900 2817
F_1 5916 2946
F_2 5736 2776
F_3 5883 2734
F_4 5784 2523
F_5 5780 2521
F_6 5759 2290
F_7 5518 2234
F_8 5474 2066
F_9 5354 1896
F_10 5200 1785
Note: Be aware that after boolean filtering, several axes may have merged.
[33]:
# 'age' and 'sex' axes have been merged together
subset.info
[33]:
11 x 2
sex_age [11]: 'F_0' 'F_1' 'F_2' ... 'F_8' 'F_9' 'F_10'
nat [2]: 'BE' 'FO'
dtype: int64
memory used: 176 bytes
This may be not what you because previous selections on merged axes are no longer valid
[34]:
# now let's try to calculate the proportion of females with age less than 10
subset['F'].sum() / pop['F'].sum()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-d9f443e5c9e1> in <module>
1 # now let's try to calculate the proportion of females with age less than 10
----> 2 subset['F'].sum() / pop['F'].sum()
~/checkouts/readthedocs.org/user_builds/larray/conda/0.30/lib/python3.6/site-packages/larray-0.30-py3.6.egg/larray/core/array.py in __getitem__(self, key, collapse_slices, translate_key)
2087 # FIXME: I have a huge problem with boolean axis labels + non points
2088 raw_broadcasted_key, res_axes, transpose_indices = self.axes._key_to_raw_and_axes(key, collapse_slices,
-> 2089 translate_key)
2090 res_data = data[raw_broadcasted_key]
2091 if res_axes:
~/checkouts/readthedocs.org/user_builds/larray/conda/0.30/lib/python3.6/site-packages/larray-0.30-py3.6.egg/larray/core/axis.py in _key_to_raw_and_axes(self, key, collapse_slices, translate_key)
2803
2804 if translate_key:
-> 2805 key = self._translated_key(key)
2806 assert isinstance(key, tuple) and len(key) == self.ndim
2807
~/checkouts/readthedocs.org/user_builds/larray/conda/0.30/lib/python3.6/site-packages/larray-0.30-py3.6.egg/larray/core/axis.py in _translated_key(self, key)
2763 """
2764 # any key -> (IGroup, IGroup, ...)
-> 2765 igroup_key = self._key_to_igroups(key)
2766
2767 # extract axis from Group keys
~/checkouts/readthedocs.org/user_builds/larray/conda/0.30/lib/python3.6/site-packages/larray-0.30-py3.6.egg/larray/core/axis.py in _key_to_igroups(self, key)
2743
2744 # translate all keys to IGroup
-> 2745 return tuple(self._translate_axis_key(axis_key) for axis_key in key)
2746
2747 def _translated_key(self, key):
~/checkouts/readthedocs.org/user_builds/larray/conda/0.30/lib/python3.6/site-packages/larray-0.30-py3.6.egg/larray/core/axis.py in <genexpr>(.0)
2743
2744 # translate all keys to IGroup
-> 2745 return tuple(self._translate_axis_key(axis_key) for axis_key in key)
2746
2747 def _translated_key(self, key):
~/checkouts/readthedocs.org/user_builds/larray/conda/0.30/lib/python3.6/site-packages/larray-0.30-py3.6.egg/larray/core/axis.py in _translate_axis_key(self, axis_key)
2683 return self._translate_axis_key_chunk(axis_key)
2684 else:
-> 2685 return self._translate_axis_key_chunk(axis_key)
2686
2687 def _key_to_igroups(self, key):
~/checkouts/readthedocs.org/user_builds/larray/conda/0.30/lib/python3.6/site-packages/larray-0.30-py3.6.egg/larray/core/axis.py in _translate_axis_key_chunk(self, axis_key)
2609 continue
2610 if not valid_axes:
-> 2611 raise ValueError("%s is not a valid label for any axis" % axis_key)
2612 elif len(valid_axes) > 1:
2613 # TODO: make an AxisCollection.display_name(axis) method out of this
ValueError: F is not a valid label for any axis
Therefore, it is sometimes more useful to not select, but rather set to 0 (or another value) non matching elements
[35]:
subset = pop.copy()
subset[((X.sex == 'F') & (X.age > 10))] = 0
subset['F', :20]
[35]:
age\nat BE FO
0 5900 2817
1 5916 2946
2 5736 2776
3 5883 2734
4 5784 2523
5 5780 2521
6 5759 2290
7 5518 2234
8 5474 2066
9 5354 1896
10 5200 1785
11 0 0
12 0 0
13 0 0
14 0 0
15 0 0
16 0 0
17 0 0
18 0 0
19 0 0
20 0 0
[36]:
# now we can calculate the proportion of females with age less than 10
subset['F'].sum() / pop['F'].sum()
[36]:
0.14618110657051941
Boolean filtering can also mix axes and arrays. Example above could also have been written as
[37]:
age_limit = sequence('sex=M,F', initial=5, inc=5)
age_limit
[37]:
sex M F
5 10
[38]:
age = pop.axes['age']
(age <= age_limit)[:20]
[38]:
age\sex M F
0 True True
1 True True
2 True True
3 True True
4 True True
5 True True
6 False True
7 False True
8 False True
9 False True
10 False True
11 False False
12 False False
13 False False
14 False False
15 False False
16 False False
17 False False
18 False False
19 False False
20 False False
[39]:
subset = pop.copy()
subset[X.age > age_limit] = 0
subset['F'].sum() / pop['F'].sum()
[39]:
0.14618110657051941
Finally, you can choose to filter on data instead of axes
[40]:
# let's focus on females older than 90
subset = pop['F', 90:110].copy()
subset
[40]:
age\nat BE FO
90 1477 136
91 1298 105
92 1141 78
93 906 74
94 739 65
95 566 53
96 327 25
97 171 21
98 135 9
99 92 8
100 60 3
101 66 5
102 26 1
103 17 2
104 14 0
105 2 2
106 3 3
107 1 2
108 1 0
109 0 0
110 0 0
[41]:
# here we set to 0 all data < 10
subset[subset < 10] = 0
subset
[41]:
age\nat BE FO
90 1477 136
91 1298 105
92 1141 78
93 906 74
94 739 65
95 566 53
96 327 25
97 171 21
98 135 0
99 92 0
100 60 0
101 66 0
102 26 0
103 17 0
104 14 0
105 0 0
106 0 0
107 0 0
108 0 0
109 0 0
110 0 0