{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Indexing, Selecting and Assigning\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import the LArray library:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from larray import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import the test array ``population``:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# let's start with\n", "population = load_example_data('demography_eurostat').population\n", "population" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting (Subsets)\n", "\n", "The ``Array`` class allows to select a subset either by labels or indices (positions)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting by Labels\n", "\n", "To take a subset of an array using labels, use brackets [ ].\n", "\n", "Let's start by selecting a single element:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "population['Belgium', 'Female', 2017]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As long as there is no ambiguity (i.e. axes sharing one or several same label(s)), the order of indexing does not matter. \n", "So you usually do not care/have to remember about axes positions during computation. It only matters for output." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# order of index doesn't matter\n", "population['Female', 2017, 'Belgium']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting a subset is done by using slices or lists of labels:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "population[['Belgium', 'Germany'], 2014:2016]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Slices bounds are optional:\n", "if not given, start is assumed to be the first label and stop is the last one." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# select all years starting from 2015\n", "population[2015:]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# select all first years until 2015\n", "population[:2015]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Slices can also have a step (defaults to 1), to take every Nth labels:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# select all even years starting from 2014\n", "population[2014::2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "**Warning:** Selecting by labels as in above examples works well as long as there is no ambiguity.\n", " When two or more axes have common labels, it leads to a crash.\n", " The solution is then to precise to which axis belong the labels.\n", "
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "immigration = load_example_data('demography_eurostat').immigration\n", "\n", "# the 'immigration' array has two axes (country and citizenship) which share the same labels\n", "immigration" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# LArray doesn't use the position of the labels used inside the brackets \n", "# to determine the corresponding axes. Instead LArray will try to guess the \n", "# corresponding axis for each label whatever is its position.\n", "# Then, if a label is shared by two or more axes, LArray will not be able \n", "# to choose between the possible axes and will raise an error.\n", "try:\n", " immigration['Belgium', 'Netherlands']\n", "except Exception as e:\n", " print(type(e).__name__, ':', e)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the solution is simple. You need to precise the axes on which you make a selection\n", "immigration[immigration.country['Belgium'], immigration.citizenship['Netherlands']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ambiguous Cases - Specifying Axes Using The Special Variable X\n", "\n", "When selecting, assigning or using aggregate functions, an axis can be\n", "referred via the special variable ``X``:\n", "\n", "- population[X.time[2015:]]\n", "- population.sum(X.time)\n", "\n", "This gives you access to axes of the array you are manipulating. The main\n", "drawback of using ``X`` is that you lose the autocompletion available from\n", "many editors. It only works with non-anonymous axes for which names do not contain whitespaces or special characters.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the previous example can also be written as\n", "immigration[X.country['Belgium'], X.citizenship['Netherlands']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting by Indices\n", "\n", "Sometimes it is more practical to use indices (positions) along the axis, instead of labels.\n", "You need to add the character ``i`` before the brackets: ``.i[indices]``.\n", "As for selection with labels, you can use a single index, a slice or a list of indices.\n", "Indices can be also negative (-1 represent the last element of an axis).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "**Note:** Remember that indices (positions) are always **0-based** in Python.\n", "So the first element is at index 0, the second is at index 1, etc.\n", "
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# select the last year\n", "population[X.time.i[-1]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# same but for the last 3 years\n", "population[X.time.i[-3:]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# using a list of indices\n", "population[X.time.i[0, 2, 4]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "**Warning:** The end *indice* (position) is EXCLUSIVE while the end label is INCLUSIVE.\n", "
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "year = 2015\n", "\n", "# with labels\n", "population[X.time[:year]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# with indices (i.e. using the .i[indices] syntax)\n", "index_year = population.time.index(year)\n", "population[X.time.i[:index_year]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use ``.i[]`` selection directly on array instead of axes.\n", "In this context, if you want to select a subset of the first and third axes for example, you must use a full slice ``:`` for the second one.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# select first country and last three years\n", "population.i[0, :, -3:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using Groups In Selections\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "even_years = population.time[2014::2]\n", "\n", "population[even_years]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Boolean Filtering\n", "\n", "Boolean filtering can be used to extract subsets. Filtering can be done on axes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# select even years\n", "population[X.time % 2 == 0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# select population for the year 2017\n", "population_2017 = population[2017]\n", "\n", "# select all data with a value greater than 30 million\n", "population_2017[population_2017 > 30e6]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "**Note:** Be aware that after boolean filtering, several axes may have merged.\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Arrays can also be used to create boolean filters:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "start_year = Array([2015, 2016, 2017], axes=population.country)\n", "start_year" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "population[X.time >= start_year]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Iterating over an axis\n", "\n", "Iterating over an axis is straightforward:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for year in population.time:\n", " print(year)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Assigning subsets\n", "\n", "### Assigning A Value\n", "\n", "Assigning a value to a subset is simple:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "population[2017] = 0\n", "population" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's store a subset in a new variable and modify it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# store the data associated with the year 2016 in a new variable\n", "population_2016 = population[2016]\n", "population_2016" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# now, we modify the new variable\n", "population_2016['Belgium'] = 0\n", "\n", "# and we can see that the original array has been also modified\n", "population" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One very important gotcha though...\n", "\n", "
\n", "**Warning:** Storing a subset of an array in a new variable and modifying it after may also impact the original array. The reason is that selecting a contiguous subset of the data does not return a copy of the selected subset, but rather a view on a subset of the array. To avoid such behavior, use the ``.copy()`` method.\n", "
\n", "\n", "Remember:\n", "\n", "- taking a contiguous subset of an array is extremely fast (no data is copied)\n", "- if one modifies that subset, one also **modifies the original array**\n", "- **.copy()** returns a copy of the subset (takes speed and memory) but\n", " allows you to change the subset without modifying the original array\n", " in the same time\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same warning apply for entire arrays:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# reload the 'population' array\n", "population = load_example_data('demography_eurostat').population\n", "\n", "# create a second 'population2' variable\n", "population2 = population\n", "population2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# set all data corresponding to the year 2017 to 0\n", "population2[2017] = 0\n", "population2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# and now take a look of what happened to the original array 'population'\n", "# after modifying the 'population2' array\n", "population" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "**Warning:** The syntax ``new_array = old_array`` does not create a new array but rather an 'alias' variable. To actually create a new array as a copy of a previous one, the ``.copy()`` method must be called.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# reload the 'population' array\n", "population = load_example_data('demography_eurostat').population\n", "\n", "# copy the 'population' array and store the copy in a new variable\n", "population2 = population.copy()\n", "\n", "# modify the copy\n", "population2[2017] = 0\n", "population2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# the data from the original array have not been modified\n", "population" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Assigning Arrays And Broadcasting\n", "\n", "Instead of a value, we can also assign an array to a subset. In that\n", "case, that array can have less axes than the target but those which are\n", "present must be compatible with the subset being targeted.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# select population for the year 2015\n", "population_2015 = population[2015]\n", "\n", "# propagate population for the year 2015 to all next years\n", "population[2016:] = population_2015\n", "\n", "population" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "**Warning:** The array being assigned must have compatible axes (i.e. same axes names and same labels) with the target subset.\n", "
\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# replace 'Male' and 'Female' labels by 'M' and 'F'\n", "population_2015 = population_2015.set_labels('gender', 'M,F')\n", "population_2015" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# now let's try to repeat the assignement operation above with the new labels.\n", "# An error is raised because of incompatible axes\n", "try:\n", " population[2016:] = population_2015\n", "except Exception as e:\n", " print(type(e).__name__, ':', e)" ] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "livereveal": { "autolaunch": false, "scroll": true } }, "nbformat": 4, "nbformat_minor": 2 }