Interactive online version: Binder badge

Getting Started

The purpose of the present Getting Started section is to give a quick overview of the main objects and features of the LArray library. To get a more detailed presentation of all capabilities of LArray, read the next sections of the tutorial.

The API Reference section of the documentation give you the list of all objects, methods and functions with their individual documentation and examples.

To use the LArray library, the first thing to do is to import it:

[1]:
from larray import *

To know the version of the LArray library installed on your machine, type:

[2]:
from larray import __version__
__version__
[2]:
'0.32'

Create an array

Working with the LArray library mainly consists of manipulating Array data structures. They represent N-dimensional labelled arrays and are composed of raw data (NumPy ndarray), axes and optionally some metadata.

An Axis object represents a dimension of an array. It contains a list of labels and has a name:

[3]:
# define some axes to be used later
age = Axis(['0-9', '10-17', '18-66', '67+'], 'age')
gender = Axis(['female', 'male'], 'gender')
time = Axis([2015, 2016, 2017], 'time')

The labels allow to select subsets and to manipulate the data without working with the positions of array elements directly.

To create an array from scratch, you need to supply data and axes:

[4]:
# define some data. This is the belgian population (in thousands). Source: eurostat.
data = [[[633, 635, 634],
         [663, 665, 664]],
        [[484, 486, 491],
         [505, 511, 516]],
        [[3572, 3581, 3583],
         [3600, 3618, 3616]],
        [[1023, 1038, 1053],
         [756, 775, 793]]]

# create an Array object
pop = Array(data, axes=[age, gender, time])
pop
[4]:
  age  gender\time  2015  2016  2017
  0-9       female   633   635   634
  0-9         male   663   665   664
10-17       female   484   486   491
10-17         male   505   511   516
18-66       female  3572  3581  3583
18-66         male  3600  3618  3616
  67+       female  1023  1038  1053
  67+         male   756   775   793

You can optionally attach some metadata to an array:

[5]:
# attach some metadata to the pop array
pop.meta.title = 'population by age, sex and year'
pop.meta.source = 'Eurostat'

# display metadata
pop.meta
[5]:
title: population by age, sex and year
source: Eurostat

To get a short summary of an array, type:

[6]:
# Array summary: metadata + dimensions + description of axes
pop.info
[6]:
title: population by age, sex and year
source: Eurostat
4 x 2 x 3
 age [4]: '0-9' '10-17' '18-66' '67+'
 gender [2]: 'female' 'male'
 time [3]: 2015 2016 2017
dtype: int64
memory used: 192 bytes

Create an array filled with predefined values

Arrays filled with predefined values can be generated through dedicated functions:

  • zeros : creates an array filled with 0

  • ones : creates an array filled with 1

  • full : creates an array filled with a given value

  • sequence : creates an array by sequentially applying modifications to the array along axis.

  • ndtest : creates a test array with increasing numbers as data

[7]:
zeros([age, gender])
[7]:
age\gender  female  male
       0-9     0.0   0.0
     10-17     0.0   0.0
     18-66     0.0   0.0
       67+     0.0   0.0
[8]:
ones([age, gender])
[8]:
age\gender  female  male
       0-9     1.0   1.0
     10-17     1.0   1.0
     18-66     1.0   1.0
       67+     1.0   1.0
[9]:
full([age, gender], fill_value=10.0)
[9]:
age\gender  female  male
       0-9    10.0  10.0
     10-17    10.0  10.0
     18-66    10.0  10.0
       67+    10.0  10.0
[10]:
sequence(age)
[10]:
age  0-9  10-17  18-66  67+
       0      1      2    3
[11]:
ndtest([age, gender])
[11]:
age\gender  female  male
       0-9       0     1
     10-17       2     3
     18-66       4     5
       67+       6     7

Save/Load an array

The LArray library offers many I/O functions to read and write arrays in various formats (CSV, Excel, HDF5). For example, to save an array in a CSV file, call the method to_csv:

[12]:
# save our pop array to a CSV file
pop.to_csv('belgium_pop.csv')

The content of the CSV file is then:

age,gender\time,2015,2016,2017
0-9,female,633,635,634
0-9,male,663,665,664
10-17,female,484,486,491
10-17,male,505,511,516
18-66,female,3572,3581,3583
18-66,male,3600,3618,3616
67+,female,1023,1038,1053
67+,male,756,775,793

Note: In CSV or Excel files, the last dimension is horizontal and the names of the last two dimensions are separated by a backslash .

To load a saved array, call the function read_csv:

[13]:
pop = read_csv('belgium_pop.csv')
pop
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-87e83451a034> in <module>
----> 1 pop = read_csv('belgium_pop.csv')
      2 pop

~/checkouts/readthedocs.org/user_builds/larray/conda/0.32/lib/python3.6/site-packages/larray-0.32-py3.6.egg/larray/util/misc.py in wrapper(*args, **kwargs)
    700                 else:
    701                     kwargs[new_arg_name] = new_arg_value
--> 702             return func(*args, **kwargs)
    703         return wrapper
    704     return _deprecate_kwarg

~/checkouts/readthedocs.org/user_builds/larray/conda/0.32/lib/python3.6/site-packages/larray-0.32-py3.6.egg/larray/inout/csv.py in read_csv(filepath_or_buffer, nb_axes, index_col, sep, headersep, fill_value, na, sort_rows, sort_columns, wide, dialect, **kwargs)
    231         raw = False
    232
--> 233     return df_asarray(df, sort_rows=sort_rows, sort_columns=sort_columns, fill_value=fill_value, raw=raw, wide=wide)
    234
    235

~/checkouts/readthedocs.org/user_builds/larray/conda/0.32/lib/python3.6/site-packages/larray-0.32-py3.6.egg/larray/inout/pandas.py in df_asarray(df, sort_rows, sort_columns, raw, parse_header, wide, cartesian_prod, **kwargs)
    338         unfold_last_axis_name = isinstance(axes_names[-1], basestring) and '\\' in axes_names[-1]
    339         res = from_frame(df, sort_rows=sort_rows, sort_columns=sort_columns, parse_header=parse_header,
--> 340                          unfold_last_axis_name=unfold_last_axis_name, cartesian_prod=cartesian_prod, **kwargs)
    341
    342     # ugly hack to avoid anonymous axes converted as axes with name 'Unnamed: x' by pandas

~/checkouts/readthedocs.org/user_builds/larray/conda/0.32/lib/python3.6/site-packages/larray-0.32-py3.6.egg/larray/inout/pandas.py in from_frame(df, sort_rows, sort_columns, parse_header, unfold_last_axis_name, fill_value, meta, cartesian_prod, **kwargs)
    236     if cartesian_prod:
    237         df, axes_labels = cartesian_product_df(df, sort_rows=sort_rows, sort_columns=sort_columns,
--> 238                                                fill_value=fill_value, **kwargs)
    239     else:
    240         if sort_rows or sort_columns:

~/checkouts/readthedocs.org/user_builds/larray/conda/0.32/lib/python3.6/site-packages/larray-0.32-py3.6.egg/larray/inout/pandas.py in cartesian_product_df(df, sort_rows, sort_columns, fill_value, **kwargs)
     54 def cartesian_product_df(df, sort_rows=False, sort_columns=False, fill_value=nan, **kwargs):
     55     idx = df.index
---> 56     labels = index_to_labels(idx, sort=sort_rows)
     57     if isinstance(idx, pd.core.index.MultiIndex):
     58         if sort_rows:

~/checkouts/readthedocs.org/user_builds/larray/conda/0.32/lib/python3.6/site-packages/larray-0.32-py3.6.egg/larray/inout/pandas.py in index_to_labels(idx, sort)
     41     Returns unique labels for each dimension.
     42     """
---> 43     if isinstance(idx, pd.core.index.MultiIndex):
     44         if sort:
     45             return list(idx.levels)

AttributeError: module 'pandas.core' has no attribute 'index'

Other input/output functions are described in the Input/Output section of the API documentation.

Selecting a subset

To select an element or a subset of an array, use brackets [ ]. In Python we usually use the term indexing for this operation.

Let us start by selecting a single element:

[14]:
pop['67+', 'female', 2017]
[14]:
1053

Labels can be given in arbitrary order:

[15]:
pop[2017, 'female', '67+']
[15]:
1053

When selecting a larger subset the result is an array:

[16]:
pop['female']
[16]:
age\time  2015  2016  2017
     0-9   633   635   634
   10-17   484   486   491
   18-66  3572  3581  3583
     67+  1023  1038  1053

When selecting several labels for the same axis, they must be given as a list (enclosed by [ ])

[17]:
pop['female', ['0-9', '10-17']]
[17]:
age\time  2015  2016  2017
     0-9   633   635   634
   10-17   484   486   491

You can also select slices, which are all labels between two bounds (we usually call them the start and stop bounds). Specifying the start and stop bounds of a slice is optional: when not given, start is the first label of the corresponding axis, stop the last one:

[18]:
# in this case '10-17':'67+' is equivalent to ['10-17', '18-66', '67+']
pop['female', '10-17':'67+']
[18]:
age\time  2015  2016  2017
   10-17   484   486   491
   18-66  3572  3581  3583
     67+  1023  1038  1053
[19]:
# :'18-66' selects all labels between the first one and '18-66'
# 2017: selects all labels between 2017 and the last one
pop[:'18-66', 2017:]
[19]:
  age  gender\time  2017
  0-9       female   634
  0-9         male   664
10-17       female   491
10-17         male   516
18-66       female  3583
18-66         male  3616

Note: Contrary to slices on normal Python lists, the stop bound is included in the selection.

Selecting by labels as above only works as long as there is no ambiguity. When several axes have some labels in common and you do not specify explicitly on which axis to work, it fails with an error ending with something like

ValueError: is ambiguous (valid in , ).

For example, imagine you need to work with an ‘immigration’ array containing two axes sharing some common labels:

[20]:
country = Axis(['Belgium', 'Netherlands', 'Germany'], 'country')
citizenship = Axis(['Belgium', 'Netherlands', 'Germany'], 'citizenship')

immigration = ndtest((country, citizenship, time))

immigration
[20]:
    country  citizenship\time  2015  2016  2017
    Belgium           Belgium     0     1     2
    Belgium       Netherlands     3     4     5
    Belgium           Germany     6     7     8
Netherlands           Belgium     9    10    11
Netherlands       Netherlands    12    13    14
Netherlands           Germany    15    16    17
    Germany           Belgium    18    19    20
    Germany       Netherlands    21    22    23
    Germany           Germany    24    25    26

If we try to get the number of Belgians living in the Netherlands for the year 2017, we might try something like:

immigration['Netherlands', 'Belgium', 2017]

… but we receive back a volley of insults:

[some long error message ending with the line below]
[...]
ValueError: Netherlands is ambiguous (valid in country, citizenship)

In that case, we have to specify explicitly which axes the ‘Netherlands’ and ‘Belgium’ labels we want to select belong to:

[21]:
immigration[country['Netherlands'], citizenship['Belgium'], 2017]
[21]:
11

Aggregation

The LArray library includes many aggregations methods: sum, mean, min, max, std, var, …

For example, assuming we still have an array in the pop variable:

[22]:
pop
[22]:
  age  gender\time  2015  2016  2017
  0-9       female   633   635   634
  0-9         male   663   665   664
10-17       female   484   486   491
10-17         male   505   511   516
18-66       female  3572  3581  3583
18-66         male  3600  3618  3616
  67+       female  1023  1038  1053
  67+         male   756   775   793

We can sum along the ‘sex’ axis using:

[23]:
pop.sum(gender)
[23]:
age\time  2015  2016  2017
     0-9  1296  1300  1298
   10-17   989   997  1007
   18-66  7172  7199  7199
     67+  1779  1813  1846

Or sum along both ‘age’ and ‘sex’:

[24]:
pop.sum(age, gender)
[24]:
time   2015   2016   2017
      11236  11309  11350

It is sometimes more convenient to aggregate along all axes except some. In that case, use the aggregation methods ending with _by. For example:

[25]:
pop.sum_by(time)
[25]:
time   2015   2016   2017
      11236  11309  11350

Groups

A Group object represents a subset of labels or positions of an axis:

[26]:
children = age['0-9', '10-17']
children
[26]:
age['0-9', '10-17']

It is often useful to attach them an explicit name using the >> operator:

[27]:
working = age['18-66'] >> 'working'
working
[27]:
age['18-66'] >> 'working'
[28]:
nonworking = age['0-9', '10-17', '67+'] >> 'nonworking'
nonworking
[28]:
age['0-9', '10-17', '67+'] >> 'nonworking'

Still using the same pop array:

[29]:
pop
[29]:
  age  gender\time  2015  2016  2017
  0-9       female   633   635   634
  0-9         male   663   665   664
10-17       female   484   486   491
10-17         male   505   511   516
18-66       female  3572  3581  3583
18-66         male  3600  3618  3616
  67+       female  1023  1038  1053
  67+         male   756   775   793

Groups can be used in selections:

[30]:
pop[working]
[30]:
gender\time  2015  2016  2017
     female  3572  3581  3583
       male  3600  3618  3616
[31]:
pop[nonworking]
[31]:
  age  gender\time  2015  2016  2017
  0-9       female   633   635   634
  0-9         male   663   665   664
10-17       female   484   486   491
10-17         male   505   511   516
  67+       female  1023  1038  1053
  67+         male   756   775   793

or aggregations:

[32]:
pop.sum(nonworking)
[32]:
gender\time  2015  2016  2017
     female  2140  2159  2178
       male  1924  1951  1973

When aggregating several groups, the names we set above using >> determines the label on the aggregated axis. Since we did not give a name for the children group, the resulting label is generated automatically :

[33]:
pop.sum((children, working, nonworking))
[33]:
       age  gender\time  2015  2016  2017
 0-9,10-17       female  1117  1121  1125
 0-9,10-17         male  1168  1176  1180
   working       female  3572  3581  3583
   working         male  3600  3618  3616
nonworking       female  2140  2159  2178
nonworking         male  1924  1951  1973

Grouping arrays in a Session

Arrays may be grouped in Session objects. A session is an ordered dict-like container of Array objects with special I/O methods. To create a session, you need to pass a list of pairs (array_name, array):

[34]:
pop = zeros([age, gender, time])
births = zeros([age, gender, time])
deaths = zeros([age, gender, time])

# create a session containing the three arrays 'pop', 'births' and 'deaths'
demo = Session(pop=pop, births=births, deaths=deaths)

# displays names of arrays contained in the session
demo.names
# get an array
demo['pop']
# add/modify an array
demo['foreigners'] = zeros([age, gender, time])

If you are using a Python version prior to 3.6, you will have to pass a list of pairs to the Session constructor otherwise the arrays will be stored in an arbitrary order in the new session. For example, the session above must be created using the syntax:

demo=Session([('pop', pop), ('births', births), ('deaths', deaths)]).

One of the main interests of using sessions is to save and load many arrays at once:

[35]:
# dump all arrays contained in the session 'demo' in one HDF5 file
demo.save('demo.h5')
# load all arrays saved in the HDF5 file 'demo.h5' and store them in the session 'demo'
demo = Session('demo.h5')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-35-bd6f192707b4> in <module>
      2 demo.save('demo.h5')
      3 # load all arrays saved in the HDF5 file 'demo.h5' and store them in the session 'demo'
----> 4 demo = Session('demo.h5')

~/checkouts/readthedocs.org/user_builds/larray/conda/0.32/lib/python3.6/site-packages/larray-0.32-py3.6.egg/larray/core/session.py in __init__(self, *args, **kwargs)
     94             if isinstance(a0, str):
     95                 # assume a0 is a filename
---> 96                 self.load(a0)
     97             else:
     98                 # iterable of tuple or dict-like

~/checkouts/readthedocs.org/user_builds/larray/conda/0.32/lib/python3.6/site-packages/larray-0.32-py3.6.egg/larray/core/session.py in load(self, fname, names, engine, display, **kwargs)
    426         else:
    427             handler = handler_cls(fname)
--> 428         metadata, objects = handler.read(names, display=display, **kwargs)
    429         for k, v in objects.items():
    430             self[k] = v

~/checkouts/readthedocs.org/user_builds/larray/conda/0.32/lib/python3.6/site-packages/larray-0.32-py3.6.egg/larray/inout/common.py in read(self, keys, *args, **kwargs)
    128                 print("loading", type, "object", key, "...", end=' ')
    129             try:
--> 130                 res[key] = self._read_item(key, type, *args, **kwargs)
    131             except Exception:
    132                 if not ignore_exceptions:

~/checkouts/readthedocs.org/user_builds/larray/conda/0.32/lib/python3.6/site-packages/larray-0.32-py3.6.egg/larray/inout/hdf.py in _read_item(self, key, type, *args, **kwargs)
    137         else:
    138             raise TypeError()
--> 139         return read_hdf(self.handle, hdf_key, *args, **kwargs)
    140
    141     def _dump_item(self, key, value, *args, **kwargs):

~/checkouts/readthedocs.org/user_builds/larray/conda/0.32/lib/python3.6/site-packages/larray-0.32-py3.6.egg/larray/inout/hdf.py in read_hdf(filepath_or_buffer, key, fill_value, na, sort_rows, sort_columns, name, **kwargs)
     81             cartesian_prod = writer != 'LArray'
     82             res = df_asarray(pd_obj, sort_rows=sort_rows, sort_columns=sort_columns, fill_value=fill_value,
---> 83                              parse_header=False, cartesian_prod=cartesian_prod)
     84             if _meta is not None:
     85                 res.meta = _meta

~/checkouts/readthedocs.org/user_builds/larray/conda/0.32/lib/python3.6/site-packages/larray-0.32-py3.6.egg/larray/inout/pandas.py in df_asarray(df, sort_rows, sort_columns, raw, parse_header, wide, cartesian_prod, **kwargs)
    338         unfold_last_axis_name = isinstance(axes_names[-1], basestring) and '\\' in axes_names[-1]
    339         res = from_frame(df, sort_rows=sort_rows, sort_columns=sort_columns, parse_header=parse_header,
--> 340                          unfold_last_axis_name=unfold_last_axis_name, cartesian_prod=cartesian_prod, **kwargs)
    341
    342     # ugly hack to avoid anonymous axes converted as axes with name 'Unnamed: x' by pandas

~/checkouts/readthedocs.org/user_builds/larray/conda/0.32/lib/python3.6/site-packages/larray-0.32-py3.6.egg/larray/inout/pandas.py in from_frame(df, sort_rows, sort_columns, parse_header, unfold_last_axis_name, fill_value, meta, cartesian_prod, **kwargs)
    241             raise ValueError('sort_rows and sort_columns cannot not be used when cartesian_prod is set to False. '
    242                              'Please call the method sort_axes on the returned array to sort rows or columns')
--> 243         axes_labels = index_to_labels(df.index, sort=False)
    244
    245     # Pandas treats column labels as column names (strings) so we need to convert them to values

~/checkouts/readthedocs.org/user_builds/larray/conda/0.32/lib/python3.6/site-packages/larray-0.32-py3.6.egg/larray/inout/pandas.py in index_to_labels(idx, sort)
     41     Returns unique labels for each dimension.
     42     """
---> 43     if isinstance(idx, pd.core.index.MultiIndex):
     44         if sort:
     45             return list(idx.levels)

AttributeError: module 'pandas.core' has no attribute 'index'

Graphical User Interface (viewer)

The LArray project provides an optional package called larray-editor allowing users to explore and edit arrays through a graphical interface.

The larray-editor tool is automatically available when installing the larrayenv metapackage from conda.

To explore the content of arrays in read-only mode, call the view function:

# shows the arrays of a given session in a graphical user interface
view(demo)

# the session may be directly loaded from a file
view('demo.h5')

# creates a session with all existing arrays from the current namespace
# and shows its content
view()

To open the user interface in edit mode, call the edit function instead.

compare

Finally, you can also visually compare two arrays or sessions using the compare function:

arr0 = ndtest((3, 3))
arr1 = ndtest((3, 3))
arr1[['a1', 'a2']] = -arr1[['a1', 'a2']]
compare(arr0, arr1)

compare

For Windows Users

Installing the larray-editor package on Windows will create a LArray menu in the Windows Start Menu. This menu contains:

  • a shortcut to open the documentation of the last stable version of the library

  • a shortcut to open the graphical interface in edit mode.

  • a shortcut to update larrayenv.

menu_windows

editor_new

Once the graphical interface is open, all LArray objects and functions are directly accessible. No need to start by from larray import *.