Interactive online version: Binder badge

Indexing, Selecting and Assigning

Import the LArray library:

[1]:
from larray import *

Import the test array population:

[2]:
# let's start with
population = load_example_data('demography_eurostat').population
population
[2]:
country  gender\time      2013      2014      2015      2016      2017
Belgium         Male   5472856   5493792   5524068   5569264   5589272
Belgium       Female   5665118   5687048   5713206   5741853   5762455
 France         Male  31772665  32045129  32174258  32247386  32318973
 France       Female  33827685  34120851  34283895  34391005  34485148
Germany         Male  39380976  39556923  39835457  40514123  40697118
Germany       Female  41142770  41210540  41362080  41661561  41824535

Selecting (Subsets)

The Array class allows to select a subset either by labels or indices (positions)

Selecting by Labels

To take a subset of an array using labels, use brackets [ ].

Let’s start by selecting a single element:

[3]:
population['Belgium', 'Female', 2017]
[3]:
5762455

As long as there is no ambiguity (i.e. axes sharing one or several same label(s)), the order of indexing does not matter. So you usually do not care/have to remember about axes positions during computation. It only matters for output.

[4]:
# order of index doesn't matter
population['Female', 2017, 'Belgium']
[4]:
5762455

Selecting a subset is done by using slices or lists of labels:

[5]:
population[['Belgium', 'Germany'], 2014:2016]
[5]:
country  gender\time      2014      2015      2016
Belgium         Male   5493792   5524068   5569264
Belgium       Female   5687048   5713206   5741853
Germany         Male  39556923  39835457  40514123
Germany       Female  41210540  41362080  41661561

Slices bounds are optional: if not given, start is assumed to be the first label and stop is the last one.

[6]:
# select all years starting from 2015
population[2015:]
[6]:
country  gender\time      2015      2016      2017
Belgium         Male   5524068   5569264   5589272
Belgium       Female   5713206   5741853   5762455
 France         Male  32174258  32247386  32318973
 France       Female  34283895  34391005  34485148
Germany         Male  39835457  40514123  40697118
Germany       Female  41362080  41661561  41824535
[7]:
# select all first years until 2015
population[:2015]
[7]:
country  gender\time      2013      2014      2015
Belgium         Male   5472856   5493792   5524068
Belgium       Female   5665118   5687048   5713206
 France         Male  31772665  32045129  32174258
 France       Female  33827685  34120851  34283895
Germany         Male  39380976  39556923  39835457
Germany       Female  41142770  41210540  41362080

Slices can also have a step (defaults to 1), to take every Nth labels:

[8]:
# select all even years starting from 2014
population[2014::2]
[8]:
country  gender\time      2014      2016
Belgium         Male   5493792   5569264
Belgium       Female   5687048   5741853
 France         Male  32045129  32247386
 France       Female  34120851  34391005
Germany         Male  39556923  40514123
Germany       Female  41210540  41661561

Warning: Selecting by labels as in above examples works well as long as there is no ambiguity. When two or more axes have common labels, it leads to a crash. The solution is then to precise to which axis belong the labels.

[9]:
immigration = load_example_data('demography_eurostat').immigration

# the 'immigration' array has two axes (country and citizenship) which share the same labels
immigration
[9]:
    country  citizenship  gender\time   2013   2014   2015   2016   2017
    Belgium      Belgium         Male   8822  10512  11378  11055  11082
    Belgium      Belgium       Female   5727   6301   6486   6560   6454
    Belgium   Luxembourg         Male    102    117    105    130    110
    Belgium   Luxembourg       Female    117    123    114    108    118
    Belgium  Netherlands         Male   4185   4222   4183   4199   4138
    Belgium  Netherlands       Female   3737   3844   3942   3664   3632
 Luxembourg      Belgium         Male    896    937    880    762    781
 Luxembourg      Belgium       Female    574    655    622    558    575
 Luxembourg   Luxembourg         Male    694    722    660    740    650
 Luxembourg   Luxembourg       Female    607    586    535    591    549
 Luxembourg  Netherlands         Male    160    165    147    141    167
 Luxembourg  Netherlands       Female     92     97     85     94    119
Netherlands      Belgium         Male   1063   1141   1113   1364   1493
Netherlands      Belgium       Female    980   1071   1181   1340   1449
Netherlands   Luxembourg         Male     23     43     59     70     83
Netherlands   Luxembourg       Female     24     34     46     60     97
Netherlands  Netherlands         Male  19374  20037  21119  22707  23750
Netherlands  Netherlands       Female  16945  17411  18084  19815  20894
[10]:
# LArray doesn't use the position of the labels used inside the brackets
# to determine the corresponding axes. Instead LArray will try to guess the
# corresponding axis for each label whatever is its position.
# Then, if a label is shared by two or more axes, LArray will not be able
# to choose between the possible axes and will raise an error.
try:
    immigration['Belgium', 'Netherlands']
except Exception as e:
    print(type(e).__name__, ':', e)
ValueError : 'Belgium' is ambiguous, it is valid in the following axes:
 country [3]: 'Belgium' 'Luxembourg' 'Netherlands'
 citizenship [3]: 'Belgium' 'Luxembourg' 'Netherlands'
[11]:
# the solution is simple. You need to precise the axes on which you make a selection
immigration[immigration.country['Belgium'], immigration.citizenship['Netherlands']]
[11]:
gender\time  2013  2014  2015  2016  2017
       Male  4185  4222  4183  4199  4138
     Female  3737  3844  3942  3664  3632

Ambiguous Cases - Specifying Axes Using The Special Variable X

When selecting, assigning or using aggregate functions, an axis can be referred via the special variable X:

  • population[X.time[2015:]]

  • population.sum(X.time)

This gives you access to axes of the array you are manipulating. The main drawback of using X is that you lose the autocompletion available from many editors. It only works with non-anonymous axes for which names do not contain whitespaces or special characters.

[12]:
# the previous example can also be written as
immigration[X.country['Belgium'], X.citizenship['Netherlands']]
[12]:
gender\time  2013  2014  2015  2016  2017
       Male  4185  4222  4183  4199  4138
     Female  3737  3844  3942  3664  3632

Selecting by Indices

Sometimes it is more practical to use indices (positions) along the axis, instead of labels. You need to add the character i before the brackets: .i[indices]. As for selection with labels, you can use a single index, a slice or a list of indices. Indices can be also negative (-1 represent the last element of an axis).

Note: Remember that indices (positions) are always 0-based in Python. So the first element is at index 0, the second is at index 1, etc.

[13]:
# select the last year
population[X.time.i[-1]]
[13]:
country\gender      Male    Female
       Belgium   5589272   5762455
        France  32318973  34485148
       Germany  40697118  41824535
[14]:
# same but for the last 3 years
population[X.time.i[-3:]]
[14]:
country  gender\time      2015      2016      2017
Belgium         Male   5524068   5569264   5589272
Belgium       Female   5713206   5741853   5762455
 France         Male  32174258  32247386  32318973
 France       Female  34283895  34391005  34485148
Germany         Male  39835457  40514123  40697118
Germany       Female  41362080  41661561  41824535
[15]:
# using a list of indices
population[X.time.i[0, 2, 4]]
[15]:
country  gender\time      2013      2015      2017
Belgium         Male   5472856   5524068   5589272
Belgium       Female   5665118   5713206   5762455
 France         Male  31772665  32174258  32318973
 France       Female  33827685  34283895  34485148
Germany         Male  39380976  39835457  40697118
Germany       Female  41142770  41362080  41824535

Warning: The end indice (position) is EXCLUSIVE while the end label is INCLUSIVE.

[16]:
year = 2015

# with labels
population[X.time[:year]]
[16]:
country  gender\time      2013      2014      2015
Belgium         Male   5472856   5493792   5524068
Belgium       Female   5665118   5687048   5713206
 France         Male  31772665  32045129  32174258
 France       Female  33827685  34120851  34283895
Germany         Male  39380976  39556923  39835457
Germany       Female  41142770  41210540  41362080
[17]:
# with indices (i.e. using the .i[indices] syntax)
index_year = population.time.index(year)
population[X.time.i[:index_year]]
[17]:
country  gender\time      2013      2014
Belgium         Male   5472856   5493792
Belgium       Female   5665118   5687048
 France         Male  31772665  32045129
 France       Female  33827685  34120851
Germany         Male  39380976  39556923
Germany       Female  41142770  41210540

You can use .i[] selection directly on array instead of axes. In this context, if you want to select a subset of the first and third axes for example, you must use a full slice : for the second one.

[18]:
# select first country and last three years
population.i[0, :, -3:]
[18]:
gender\time     2015     2016     2017
       Male  5524068  5569264  5589272
     Female  5713206  5741853  5762455

Using Groups In Selections

[19]:
even_years = population.time[2014::2]

population[even_years]
[19]:
country  gender\time      2014      2016
Belgium         Male   5493792   5569264
Belgium       Female   5687048   5741853
 France         Male  32045129  32247386
 France       Female  34120851  34391005
Germany         Male  39556923  40514123
Germany       Female  41210540  41661561

Boolean Filtering

Boolean filtering can be used to extract subsets. Filtering can be done on axes:

[20]:
# select even years
population[X.time % 2 == 0]
[20]:
country  gender\time      2014      2016
Belgium         Male   5493792   5569264
Belgium       Female   5687048   5741853
 France         Male  32045129  32247386
 France       Female  34120851  34391005
Germany         Male  39556923  40514123
Germany       Female  41210540  41661561

or data:

[21]:
# select population for the year 2017
population_2017 = population[2017]

# select all data with a value greater than 30 million
population_2017[population_2017 > 30e6]
[21]:
country_gender  France_Male  France_Female  Germany_Male  Germany_Female
                   32318973       34485148      40697118        41824535

Note: Be aware that after boolean filtering, several axes may have merged.

Arrays can also be used to create boolean filters:

[22]:
start_year = Array([2015, 2016, 2017], axes=population.country)
start_year
[22]:
country  Belgium  France  Germany
            2015    2016     2017
[23]:
population[X.time >= start_year]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[23], line 1
----> 1 population[X.time >= start_year]

File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.4/lib/python3.11/site-packages/larray/core/expr.py:13, in ExprNode._binop.<locals>.opmethod(self, other)
     10 def opmethod(self, other):
     11     # evaluate eagerly when possible
     12     if isinstance(other, ABCArray):
---> 13         self_value = self.evaluate(other.axes)
     14         return getattr(self_value, f'__{opname}__')(other)
     15     else:

File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.4/lib/python3.11/site-packages/larray/core/axis.py:3818, in AxisReference.evaluate(self, context)
   3811 def evaluate(self, context) -> Axis:
   3812     r"""
   3813     Parameters
   3814     ----------
   3815     context : AxisCollection
   3816         Use axes from this collection
   3817     """
-> 3818     return context[self.name]

File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.4/lib/python3.11/site-packages/larray/core/axis.py:1735, in AxisCollection.__getitem__(self, key)
   1733     return self._map[key]
   1734 else:
-> 1735     raise KeyError(f"axis '{key}' not found in {self}")

KeyError: "axis 'time' not found in {country}"

Iterating over an axis

Iterating over an axis is straightforward:

[24]:
for year in population.time:
    print(year)
2013
2014
2015
2016
2017

Assigning subsets

Assigning A Value

Assigning a value to a subset is simple:

[25]:
population[2017] = 0
population
[25]:
country  gender\time      2013      2014      2015      2016  2017
Belgium         Male   5472856   5493792   5524068   5569264     0
Belgium       Female   5665118   5687048   5713206   5741853     0
 France         Male  31772665  32045129  32174258  32247386     0
 France       Female  33827685  34120851  34283895  34391005     0
Germany         Male  39380976  39556923  39835457  40514123     0
Germany       Female  41142770  41210540  41362080  41661561     0

Now, let’s store a subset in a new variable and modify it:

[26]:
# store the data associated with the year 2016 in a new variable
population_2016 = population[2016]
population_2016
[26]:
country\gender      Male    Female
       Belgium   5569264   5741853
        France  32247386  34391005
       Germany  40514123  41661561
[27]:
# now, we modify the new variable
population_2016['Belgium'] = 0

# and we can see that the original array has been also modified
population
[27]:
country  gender\time      2013      2014      2015      2016  2017
Belgium         Male   5472856   5493792   5524068         0     0
Belgium       Female   5665118   5687048   5713206         0     0
 France         Male  31772665  32045129  32174258  32247386     0
 France       Female  33827685  34120851  34283895  34391005     0
Germany         Male  39380976  39556923  39835457  40514123     0
Germany       Female  41142770  41210540  41362080  41661561     0

One very important gotcha though…

Warning: Storing a subset of an array in a new variable and modifying it after may also impact the original array. The reason is that selecting a contiguous subset of the data does not return a copy of the selected subset, but rather a view on a subset of the array. To avoid such behavior, use the .copy() method.

Remember:

  • taking a contiguous subset of an array is extremely fast (no data is copied)

  • if one modifies that subset, one also modifies the original array

  • .copy() returns a copy of the subset (takes speed and memory) but allows you to change the subset without modifying the original array in the same time

The same warning apply for entire arrays:

[28]:
# reload the 'population' array
population = load_example_data('demography_eurostat').population

# create a second 'population2' variable
population2 = population
population2
[28]:
country  gender\time      2013      2014      2015      2016      2017
Belgium         Male   5472856   5493792   5524068   5569264   5589272
Belgium       Female   5665118   5687048   5713206   5741853   5762455
 France         Male  31772665  32045129  32174258  32247386  32318973
 France       Female  33827685  34120851  34283895  34391005  34485148
Germany         Male  39380976  39556923  39835457  40514123  40697118
Germany       Female  41142770  41210540  41362080  41661561  41824535
[29]:
# set all data corresponding to the year 2017 to 0
population2[2017] = 0
population2
[29]:
country  gender\time      2013      2014      2015      2016  2017
Belgium         Male   5472856   5493792   5524068   5569264     0
Belgium       Female   5665118   5687048   5713206   5741853     0
 France         Male  31772665  32045129  32174258  32247386     0
 France       Female  33827685  34120851  34283895  34391005     0
Germany         Male  39380976  39556923  39835457  40514123     0
Germany       Female  41142770  41210540  41362080  41661561     0
[30]:
# and now take a look of what happened to the original array 'population'
# after modifying the 'population2' array
population
[30]:
country  gender\time      2013      2014      2015      2016  2017
Belgium         Male   5472856   5493792   5524068   5569264     0
Belgium       Female   5665118   5687048   5713206   5741853     0
 France         Male  31772665  32045129  32174258  32247386     0
 France       Female  33827685  34120851  34283895  34391005     0
Germany         Male  39380976  39556923  39835457  40514123     0
Germany       Female  41142770  41210540  41362080  41661561     0

Warning: The syntax new_array = old_array does not create a new array but rather an ‘alias’ variable. To actually create a new array as a copy of a previous one, the .copy() method must be called.

[31]:
# reload the 'population' array
population = load_example_data('demography_eurostat').population

# copy the 'population' array and store the copy in a new variable
population2 = population.copy()

# modify the copy
population2[2017] = 0
population2
[31]:
country  gender\time      2013      2014      2015      2016  2017
Belgium         Male   5472856   5493792   5524068   5569264     0
Belgium       Female   5665118   5687048   5713206   5741853     0
 France         Male  31772665  32045129  32174258  32247386     0
 France       Female  33827685  34120851  34283895  34391005     0
Germany         Male  39380976  39556923  39835457  40514123     0
Germany       Female  41142770  41210540  41362080  41661561     0
[32]:
# the data from the original array have not been modified
population
[32]:
country  gender\time      2013      2014      2015      2016      2017
Belgium         Male   5472856   5493792   5524068   5569264   5589272
Belgium       Female   5665118   5687048   5713206   5741853   5762455
 France         Male  31772665  32045129  32174258  32247386  32318973
 France       Female  33827685  34120851  34283895  34391005  34485148
Germany         Male  39380976  39556923  39835457  40514123  40697118
Germany       Female  41142770  41210540  41362080  41661561  41824535

Assigning Arrays And Broadcasting

Instead of a value, we can also assign an array to a subset. In that case, that array can have less axes than the target but those which are present must be compatible with the subset being targeted.

[33]:
# select population for the year 2015
population_2015 = population[2015]

# propagate population for the year 2015 to all next years
population[2016:] = population_2015

population
[33]:
country  gender\time      2013      2014      2015      2016      2017
Belgium         Male   5472856   5493792   5524068   5524068   5524068
Belgium       Female   5665118   5687048   5713206   5713206   5713206
 France         Male  31772665  32045129  32174258  32174258  32174258
 France       Female  33827685  34120851  34283895  34283895  34283895
Germany         Male  39380976  39556923  39835457  39835457  39835457
Germany       Female  41142770  41210540  41362080  41362080  41362080

Warning: The array being assigned must have compatible axes (i.e. same axes names and same labels) with the target subset.

[34]:
# replace 'Male' and 'Female' labels by 'M' and 'F'
population_2015 = population_2015.set_labels('gender', 'M,F')
population_2015
[34]:
country\gender         M         F
       Belgium   5524068   5713206
        France  32174258  34283895
       Germany  39835457  41362080
[35]:
# now let's try to repeat the assignement operation above with the new labels.
# An error is raised because of incompatible axes
try:
    population[2016:] = population_2015
except Exception as e:
    print(type(e).__name__, ':', e)
ValueError : incompatible axes:
Axis(['Male', 'Female'], 'gender')
vs
Axis(['M', 'F'], 'gender')