Indexing, Selecting and Assigning
Import the LArray library:
[1]:
from larray import *
Import the test array population
:
[2]:
# let's start with
population = load_example_data('demography_eurostat').population
population
[2]:
country gender\time 2013 2014 2015 2016 2017
Belgium Male 5472856 5493792 5524068 5569264 5589272
Belgium Female 5665118 5687048 5713206 5741853 5762455
France Male 31772665 32045129 32174258 32247386 32318973
France Female 33827685 34120851 34283895 34391005 34485148
Germany Male 39380976 39556923 39835457 40514123 40697118
Germany Female 41142770 41210540 41362080 41661561 41824535
Selecting (Subsets)
The Array
class allows to select a subset either by labels or indices (positions)
Selecting by Labels
To take a subset of an array using labels, use brackets [ ].
Let’s start by selecting a single element:
[3]:
population['Belgium', 'Female', 2017]
[3]:
5762455
As long as there is no ambiguity (i.e. axes sharing one or several same label(s)), the order of indexing does not matter. So you usually do not care/have to remember about axes positions during computation. It only matters for output.
[4]:
# order of index doesn't matter
population['Female', 2017, 'Belgium']
[4]:
5762455
Selecting a subset is done by using slices or lists of labels:
[5]:
population[['Belgium', 'Germany'], 2014:2016]
[5]:
country gender\time 2014 2015 2016
Belgium Male 5493792 5524068 5569264
Belgium Female 5687048 5713206 5741853
Germany Male 39556923 39835457 40514123
Germany Female 41210540 41362080 41661561
Slices bounds are optional: if not given, start is assumed to be the first label and stop is the last one.
[6]:
# select all years starting from 2015
population[2015:]
[6]:
country gender\time 2015 2016 2017
Belgium Male 5524068 5569264 5589272
Belgium Female 5713206 5741853 5762455
France Male 32174258 32247386 32318973
France Female 34283895 34391005 34485148
Germany Male 39835457 40514123 40697118
Germany Female 41362080 41661561 41824535
[7]:
# select all first years until 2015
population[:2015]
[7]:
country gender\time 2013 2014 2015
Belgium Male 5472856 5493792 5524068
Belgium Female 5665118 5687048 5713206
France Male 31772665 32045129 32174258
France Female 33827685 34120851 34283895
Germany Male 39380976 39556923 39835457
Germany Female 41142770 41210540 41362080
Slices can also have a step (defaults to 1), to take every Nth labels:
[8]:
# select all even years starting from 2014
population[2014::2]
[8]:
country gender\time 2014 2016
Belgium Male 5493792 5569264
Belgium Female 5687048 5741853
France Male 32045129 32247386
France Female 34120851 34391005
Germany Male 39556923 40514123
Germany Female 41210540 41661561
Warning: Selecting by labels as in above examples works well as long as there is no ambiguity. When two or more axes have common labels, it leads to a crash. The solution is then to precise to which axis belong the labels.
[9]:
immigration = load_example_data('demography_eurostat').immigration
# the 'immigration' array has two axes (country and citizenship) which share the same labels
immigration
[9]:
country citizenship gender\time 2013 2014 2015 2016 2017
Belgium Belgium Male 8822 10512 11378 11055 11082
Belgium Belgium Female 5727 6301 6486 6560 6454
Belgium Luxembourg Male 102 117 105 130 110
Belgium Luxembourg Female 117 123 114 108 118
Belgium Netherlands Male 4185 4222 4183 4199 4138
Belgium Netherlands Female 3737 3844 3942 3664 3632
Luxembourg Belgium Male 896 937 880 762 781
Luxembourg Belgium Female 574 655 622 558 575
Luxembourg Luxembourg Male 694 722 660 740 650
Luxembourg Luxembourg Female 607 586 535 591 549
Luxembourg Netherlands Male 160 165 147 141 167
Luxembourg Netherlands Female 92 97 85 94 119
Netherlands Belgium Male 1063 1141 1113 1364 1493
Netherlands Belgium Female 980 1071 1181 1340 1449
Netherlands Luxembourg Male 23 43 59 70 83
Netherlands Luxembourg Female 24 34 46 60 97
Netherlands Netherlands Male 19374 20037 21119 22707 23750
Netherlands Netherlands Female 16945 17411 18084 19815 20894
[10]:
# LArray doesn't use the position of the labels used inside the brackets
# to determine the corresponding axes. Instead LArray will try to guess the
# corresponding axis for each label whatever is its position.
# Then, if a label is shared by two or more axes, LArray will not be able
# to choose between the possible axes and will raise an error.
try:
immigration['Belgium', 'Netherlands']
except Exception as e:
print(type(e).__name__, ':', e)
ValueError : 'Belgium' is ambiguous, it is valid in the following axes:
country [3]: 'Belgium' 'Luxembourg' 'Netherlands'
citizenship [3]: 'Belgium' 'Luxembourg' 'Netherlands'
[11]:
# the solution is simple. You need to precise the axes on which you make a selection
immigration[immigration.country['Belgium'], immigration.citizenship['Netherlands']]
[11]:
gender\time 2013 2014 2015 2016 2017
Male 4185 4222 4183 4199 4138
Female 3737 3844 3942 3664 3632
Ambiguous Cases - Specifying Axes Using The Special Variable X
When selecting, assigning or using aggregate functions, an axis can be referred via the special variable X
:
population[X.time[2015:]]
population.sum(X.time)
This gives you access to axes of the array you are manipulating. The main drawback of using X
is that you lose the autocompletion available from many editors. It only works with non-anonymous axes for which names do not contain whitespaces or special characters.
[12]:
# the previous example can also be written as
immigration[X.country['Belgium'], X.citizenship['Netherlands']]
[12]:
gender\time 2013 2014 2015 2016 2017
Male 4185 4222 4183 4199 4138
Female 3737 3844 3942 3664 3632
Selecting by Indices
Sometimes it is more practical to use indices (positions) along the axis, instead of labels. You need to add the character i
before the brackets: .i[indices]
. As for selection with labels, you can use a single index, a slice or a list of indices. Indices can be also negative (-1 represent the last element of an axis).
Note: Remember that indices (positions) are always 0-based in Python. So the first element is at index 0, the second is at index 1, etc.
[13]:
# select the last year
population[X.time.i[-1]]
[13]:
country\gender Male Female
Belgium 5589272 5762455
France 32318973 34485148
Germany 40697118 41824535
[14]:
# same but for the last 3 years
population[X.time.i[-3:]]
[14]:
country gender\time 2015 2016 2017
Belgium Male 5524068 5569264 5589272
Belgium Female 5713206 5741853 5762455
France Male 32174258 32247386 32318973
France Female 34283895 34391005 34485148
Germany Male 39835457 40514123 40697118
Germany Female 41362080 41661561 41824535
[15]:
# using a list of indices
population[X.time.i[0, 2, 4]]
[15]:
country gender\time 2013 2015 2017
Belgium Male 5472856 5524068 5589272
Belgium Female 5665118 5713206 5762455
France Male 31772665 32174258 32318973
France Female 33827685 34283895 34485148
Germany Male 39380976 39835457 40697118
Germany Female 41142770 41362080 41824535
Warning: The end indice (position) is EXCLUSIVE while the end label is INCLUSIVE.
[16]:
year = 2015
# with labels
population[X.time[:year]]
[16]:
country gender\time 2013 2014 2015
Belgium Male 5472856 5493792 5524068
Belgium Female 5665118 5687048 5713206
France Male 31772665 32045129 32174258
France Female 33827685 34120851 34283895
Germany Male 39380976 39556923 39835457
Germany Female 41142770 41210540 41362080
[17]:
# with indices (i.e. using the .i[indices] syntax)
index_year = population.time.index(year)
population[X.time.i[:index_year]]
[17]:
country gender\time 2013 2014
Belgium Male 5472856 5493792
Belgium Female 5665118 5687048
France Male 31772665 32045129
France Female 33827685 34120851
Germany Male 39380976 39556923
Germany Female 41142770 41210540
You can use .i[]
selection directly on array instead of axes. In this context, if you want to select a subset of the first and third axes for example, you must use a full slice :
for the second one.
[18]:
# select first country and last three years
population.i[0, :, -3:]
[18]:
gender\time 2015 2016 2017
Male 5524068 5569264 5589272
Female 5713206 5741853 5762455
Using Groups In Selections
[19]:
even_years = population.time[2014::2]
population[even_years]
[19]:
country gender\time 2014 2016
Belgium Male 5493792 5569264
Belgium Female 5687048 5741853
France Male 32045129 32247386
France Female 34120851 34391005
Germany Male 39556923 40514123
Germany Female 41210540 41661561
Boolean Filtering
Boolean filtering can be used to extract subsets. Filtering can be done on axes:
[20]:
# select even years
population[X.time % 2 == 0]
[20]:
country gender\time 2014 2016
Belgium Male 5493792 5569264
Belgium Female 5687048 5741853
France Male 32045129 32247386
France Female 34120851 34391005
Germany Male 39556923 40514123
Germany Female 41210540 41661561
or data:
[21]:
# select population for the year 2017
population_2017 = population[2017]
# select all data with a value greater than 30 million
population_2017[population_2017 > 30e6]
[21]:
country_gender France_Male France_Female Germany_Male Germany_Female
32318973 34485148 40697118 41824535
Note: Be aware that after boolean filtering, several axes may have merged.
Arrays can also be used to create boolean filters:
[22]:
start_year = Array([2015, 2016, 2017], axes=population.country)
start_year
[22]:
country Belgium France Germany
2015 2016 2017
[23]:
population[X.time >= start_year]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[23], line 1
----> 1 population[X.time >= start_year]
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.4/lib/python3.11/site-packages/larray/core/expr.py:13, in ExprNode._binop.<locals>.opmethod(self, other)
10 def opmethod(self, other):
11 # evaluate eagerly when possible
12 if isinstance(other, ABCArray):
---> 13 self_value = self.evaluate(other.axes)
14 return getattr(self_value, f'__{opname}__')(other)
15 else:
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.4/lib/python3.11/site-packages/larray/core/axis.py:3818, in AxisReference.evaluate(self, context)
3811 def evaluate(self, context) -> Axis:
3812 r"""
3813 Parameters
3814 ----------
3815 context : AxisCollection
3816 Use axes from this collection
3817 """
-> 3818 return context[self.name]
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.4/lib/python3.11/site-packages/larray/core/axis.py:1735, in AxisCollection.__getitem__(self, key)
1733 return self._map[key]
1734 else:
-> 1735 raise KeyError(f"axis '{key}' not found in {self}")
KeyError: "axis 'time' not found in {country}"
Iterating over an axis
Iterating over an axis is straightforward:
[24]:
for year in population.time:
print(year)
2013
2014
2015
2016
2017
Assigning subsets
Assigning A Value
Assigning a value to a subset is simple:
[25]:
population[2017] = 0
population
[25]:
country gender\time 2013 2014 2015 2016 2017
Belgium Male 5472856 5493792 5524068 5569264 0
Belgium Female 5665118 5687048 5713206 5741853 0
France Male 31772665 32045129 32174258 32247386 0
France Female 33827685 34120851 34283895 34391005 0
Germany Male 39380976 39556923 39835457 40514123 0
Germany Female 41142770 41210540 41362080 41661561 0
Now, let’s store a subset in a new variable and modify it:
[26]:
# store the data associated with the year 2016 in a new variable
population_2016 = population[2016]
population_2016
[26]:
country\gender Male Female
Belgium 5569264 5741853
France 32247386 34391005
Germany 40514123 41661561
[27]:
# now, we modify the new variable
population_2016['Belgium'] = 0
# and we can see that the original array has been also modified
population
[27]:
country gender\time 2013 2014 2015 2016 2017
Belgium Male 5472856 5493792 5524068 0 0
Belgium Female 5665118 5687048 5713206 0 0
France Male 31772665 32045129 32174258 32247386 0
France Female 33827685 34120851 34283895 34391005 0
Germany Male 39380976 39556923 39835457 40514123 0
Germany Female 41142770 41210540 41362080 41661561 0
One very important gotcha though…
Warning: Storing a subset of an array in a new variable and modifying it after may also impact the original array. The reason is that selecting a contiguous subset of the data does not return a copy of the selected subset, but rather a view on a subset of the array. To avoid such behavior, use the .copy()
method.
Remember:
taking a contiguous subset of an array is extremely fast (no data is copied)
if one modifies that subset, one also modifies the original array
.copy() returns a copy of the subset (takes speed and memory) but allows you to change the subset without modifying the original array in the same time
The same warning apply for entire arrays:
[28]:
# reload the 'population' array
population = load_example_data('demography_eurostat').population
# create a second 'population2' variable
population2 = population
population2
[28]:
country gender\time 2013 2014 2015 2016 2017
Belgium Male 5472856 5493792 5524068 5569264 5589272
Belgium Female 5665118 5687048 5713206 5741853 5762455
France Male 31772665 32045129 32174258 32247386 32318973
France Female 33827685 34120851 34283895 34391005 34485148
Germany Male 39380976 39556923 39835457 40514123 40697118
Germany Female 41142770 41210540 41362080 41661561 41824535
[29]:
# set all data corresponding to the year 2017 to 0
population2[2017] = 0
population2
[29]:
country gender\time 2013 2014 2015 2016 2017
Belgium Male 5472856 5493792 5524068 5569264 0
Belgium Female 5665118 5687048 5713206 5741853 0
France Male 31772665 32045129 32174258 32247386 0
France Female 33827685 34120851 34283895 34391005 0
Germany Male 39380976 39556923 39835457 40514123 0
Germany Female 41142770 41210540 41362080 41661561 0
[30]:
# and now take a look of what happened to the original array 'population'
# after modifying the 'population2' array
population
[30]:
country gender\time 2013 2014 2015 2016 2017
Belgium Male 5472856 5493792 5524068 5569264 0
Belgium Female 5665118 5687048 5713206 5741853 0
France Male 31772665 32045129 32174258 32247386 0
France Female 33827685 34120851 34283895 34391005 0
Germany Male 39380976 39556923 39835457 40514123 0
Germany Female 41142770 41210540 41362080 41661561 0
Warning: The syntax new_array = old_array
does not create a new array but rather an ‘alias’ variable. To actually create a new array as a copy of a previous one, the .copy()
method must be called.
[31]:
# reload the 'population' array
population = load_example_data('demography_eurostat').population
# copy the 'population' array and store the copy in a new variable
population2 = population.copy()
# modify the copy
population2[2017] = 0
population2
[31]:
country gender\time 2013 2014 2015 2016 2017
Belgium Male 5472856 5493792 5524068 5569264 0
Belgium Female 5665118 5687048 5713206 5741853 0
France Male 31772665 32045129 32174258 32247386 0
France Female 33827685 34120851 34283895 34391005 0
Germany Male 39380976 39556923 39835457 40514123 0
Germany Female 41142770 41210540 41362080 41661561 0
[32]:
# the data from the original array have not been modified
population
[32]:
country gender\time 2013 2014 2015 2016 2017
Belgium Male 5472856 5493792 5524068 5569264 5589272
Belgium Female 5665118 5687048 5713206 5741853 5762455
France Male 31772665 32045129 32174258 32247386 32318973
France Female 33827685 34120851 34283895 34391005 34485148
Germany Male 39380976 39556923 39835457 40514123 40697118
Germany Female 41142770 41210540 41362080 41661561 41824535
Assigning Arrays And Broadcasting
Instead of a value, we can also assign an array to a subset. In that case, that array can have less axes than the target but those which are present must be compatible with the subset being targeted.
[33]:
# select population for the year 2015
population_2015 = population[2015]
# propagate population for the year 2015 to all next years
population[2016:] = population_2015
population
[33]:
country gender\time 2013 2014 2015 2016 2017
Belgium Male 5472856 5493792 5524068 5524068 5524068
Belgium Female 5665118 5687048 5713206 5713206 5713206
France Male 31772665 32045129 32174258 32174258 32174258
France Female 33827685 34120851 34283895 34283895 34283895
Germany Male 39380976 39556923 39835457 39835457 39835457
Germany Female 41142770 41210540 41362080 41362080 41362080
Warning: The array being assigned must have compatible axes (i.e. same axes names and same labels) with the target subset.
[34]:
# replace 'Male' and 'Female' labels by 'M' and 'F'
population_2015 = population_2015.set_labels('gender', 'M,F')
population_2015
[34]:
country\gender M F
Belgium 5524068 5713206
France 32174258 34283895
Germany 39835457 41362080
[35]:
# now let's try to repeat the assignement operation above with the new labels.
# An error is raised because of incompatible axes
try:
population[2016:] = population_2015
except Exception as e:
print(type(e).__name__, ':', e)
ValueError : incompatible axes:
Axis(['Male', 'Female'], 'gender')
vs
Axis(['M', 'F'], 'gender')