Load And Dump Arrays
The LArray library provides methods and functions to load and dump Array, Session, Axis Group objects to several formats such as Excel, CSV and HDF5. The HDF5 file format is designed to store and organize large amounts of data. It allows to read and write data much faster than when working with CSV and Excel files.
[1]:
# first of all, import the LArray library
from larray import *
Loading Arrays - Basic Usage (CSV, Excel, HDF5)
To read an array from a CSV file, you must use the read_csv
function:
[2]:
csv_dir = get_example_filepath('examples')
# read the array population from the file 'population.csv'.
# The data of the array below is derived from a subset of the demo_pjan table from Eurostat
population = read_csv(csv_dir / 'population.csv')
population
[2]:
country gender\time 2013 2014 2015
Belgium Male 5472856 5493792 5524068
Belgium Female 5665118 5687048 5713206
France Male 31772665 32045129 32174258
France Female 33827685 34120851 34283895
Germany Male 39380976 39556923 39835457
Germany Female 41142770 41210540 41362080
To read an array from a sheet of an Excel file, you can use the read_excel
function:
[3]:
filepath_excel = get_example_filepath('examples.xlsx')
# read the array from the sheet 'births' of the Excel file 'examples.xlsx'
# The data of the array below is derived from a subset of the demo_fasec table from Eurostat
births = read_excel(filepath_excel, 'births')
births
[3]:
country gender\time 2013 2014 2015
Belgium Male 64371 64173 62561
Belgium Female 61235 60841 59713
France Male 415762 418721 409145
France Female 396581 400607 390526
Germany Male 349820 366835 378478
Germany Female 332249 348092 359097
The open_excel
function in combination with the load
method allows you to load several arrays from the same Workbook without opening and closing it several times:
# open the Excel file 'population.xlsx' and let it opened as long as you keep the indent.
# The Python keyword ``with`` ensures that the Excel file is properly closed even if an error occurs
with open_excel(filepath_excel) as wb:
# load the array 'population' from the sheet 'population'
population = wb['population'].load()
# load the array 'births' from the sheet 'births'
births = wb['births'].load()
# load the array 'deaths' from the sheet 'deaths'
deaths = wb['deaths'].load()
# the Workbook is automatically closed when getting out the block defined by the with statement
Warning: open_excel
requires to work on Windows and to have the library xlwings
installed.
The HDF5
file format is specifically designed to store and organize large amounts of data. Reading and writing data in this file format is much faster than with CSV or Excel. An HDF5 file can contain multiple arrays, each array being associated with a key. To read an array from an HDF5 file, you must use the read_hdf
function and provide the key associated with the array:
[4]:
filepath_hdf = get_example_filepath('examples.h5')
# read the array from the file 'examples.h5' associated with the key 'deaths'
# The data of the array below is derived from a subset of the demo_magec table from Eurostat
deaths = read_hdf(filepath_hdf, 'deaths')
deaths
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[4], line 5
1 filepath_hdf = get_example_filepath('examples.h5')
3 # read the array from the file 'examples.h5' associated with the key 'deaths'
4 # The data of the array below is derived from a subset of the demo_magec table from Eurostat
----> 5 deaths = read_hdf(filepath_hdf, 'deaths')
6 deaths
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/larray/inout/hdf.py:80, in read_hdf(filepath_or_buffer, key, fill_value, na, sort_rows, sort_columns, name, **kwargs)
78 key = _translate_group_key_hdf(key)
79 res = None
---> 80 with LHDFStore(filepath_or_buffer, mode='r') as store:
81 try:
82 pd_obj = store.get(key)
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/larray/util/misc.py:806, in LHDFStore.__init__(self, filepath_or_buffer, **kwargs)
804 self.close_store = False
805 else:
--> 806 self.store = pd.HDFStore(filepath_or_buffer, **kwargs)
807 self.close_store = True
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/pandas/io/pytables.py:566, in HDFStore.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
563 if "format" in kwargs:
564 raise ValueError("format is not a defined argument for HDFStore")
--> 566 tables = import_optional_dependency("tables")
568 if complib is not None and complib not in tables.filters.all_complibs:
569 raise ValueError(
570 f"complib only supports {tables.filters.all_complibs} compression."
571 )
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/pandas/compat/_optional.py:135, in import_optional_dependency(name, extra, errors, min_version)
130 msg = (
131 f"Missing optional dependency '{install_name}'. {extra} "
132 f"Use pip or conda to install {install_name}."
133 )
134 try:
--> 135 module = importlib.import_module(name)
136 except ImportError:
137 if errors == "raise":
File ~/.asdf/installs/python/3.11.9/lib/python3.11/importlib/__init__.py:126, in import_module(name, package)
124 break
125 level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)
File <frozen importlib._bootstrap>:1204, in _gcd_import(name, package, level)
File <frozen importlib._bootstrap>:1176, in _find_and_load(name, import_)
File <frozen importlib._bootstrap>:1147, in _find_and_load_unlocked(name, import_)
File <frozen importlib._bootstrap>:690, in _load_unlocked(spec)
File <frozen importlib._bootstrap_external>:940, in exec_module(self, module)
File <frozen importlib._bootstrap>:241, in _call_with_frames_removed(f, *args, **kwds)
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/tables/__init__.py:44
40 raise RuntimeError("Blosc2 library not found. "
41 f"I looked for \"{', '.join(blosc2_search_paths)}\"")
43 # Necessary imports to get versions stored on the cython extension
---> 44 from .utilsextension import get_hdf5_version as _get_hdf5_version
46 from ._version import __version__
48 hdf5_version = _get_hdf5_version()
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/tables/utilsextension.pyx:1, in init tables.utilsextension()
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
Dumping Arrays - Basic Usage (CSV, Excel, HDF5)
To write an array in a CSV file, you must use the to_csv
method:
[5]:
# save the array population in the file 'population.csv'
population.to_csv('population.csv')
To write an array to a sheet of an Excel file, you can use the to_excel
method:
[6]:
# save the array population in the sheet 'population' of the Excel file 'population.xlsx'
population.to_excel('population.xlsx', 'population')
Note that to_excel
create a new Excel file if it does not exist yet. If the file already exists, a new sheet is added after the existing ones if that sheet does not already exists:
[7]:
# add a new sheet 'births' to the file 'population.xlsx' and save the array births in it
births.to_excel('population.xlsx', 'births')
To reset an Excel file, you simply need to set the overwrite_file
argument as True:
[8]:
# 1. reset the file 'population.xlsx' (all sheets are removed)
# 2. create a sheet 'population' and save the array population in it
population.to_excel('population.xlsx', 'population', overwrite_file=True)
The open_excel
function in combination with the dump()
method allows you to open a Workbook and to export several arrays at once. If the Excel file doesn’t exist, the overwrite_file
argument must be set to True.
Warning: The save
method must be called at the end of the block defined by the with statement to actually write data in the Excel file, otherwise you will end up with an empty file.
# to create a new Excel file, argument overwrite_file must be set to True
with open_excel('population.xlsx', overwrite_file=True) as wb:
# add a new sheet 'population' and dump the array population in it
wb['population'] = population.dump()
# add a new sheet 'births' and dump the array births in it
wb['births'] = births.dump()
# add a new sheet 'deaths' and dump the array deaths in it
wb['deaths'] = deaths.dump()
# actually write data in the Workbook
wb.save()
# the Workbook is automatically closed when getting out the block defined by the with statement
To write an array in an HDF5 file, you must use the to_hdf
function and provide the key that will be associated with the array:
[9]:
# save the array population in the file 'population.h5' and associate it with the key 'population'
population.to_hdf('population.h5', 'population')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[9], line 2
1 # save the array population in the file 'population.h5' and associate it with the key 'population'
----> 2 population.to_hdf('population.h5', 'population')
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/larray/core/array.py:6922, in Array.to_hdf(self, filepath, key)
6889 r"""
6890 Write array to a HDF file.
6891
(...)
6919 >>> a.to_hdf('test.h5', 'arrays/a') # doctest: +SKIP
6920 """
6921 key = _translate_group_key_hdf(key)
-> 6922 with LHDFStore(filepath) as store:
6923 store.put(key, self.to_frame())
6924 attrs = store.get_storer(key).attrs
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/larray/util/misc.py:806, in LHDFStore.__init__(self, filepath_or_buffer, **kwargs)
804 self.close_store = False
805 else:
--> 806 self.store = pd.HDFStore(filepath_or_buffer, **kwargs)
807 self.close_store = True
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/pandas/io/pytables.py:566, in HDFStore.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
563 if "format" in kwargs:
564 raise ValueError("format is not a defined argument for HDFStore")
--> 566 tables = import_optional_dependency("tables")
568 if complib is not None and complib not in tables.filters.all_complibs:
569 raise ValueError(
570 f"complib only supports {tables.filters.all_complibs} compression."
571 )
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/pandas/compat/_optional.py:135, in import_optional_dependency(name, extra, errors, min_version)
130 msg = (
131 f"Missing optional dependency '{install_name}'. {extra} "
132 f"Use pip or conda to install {install_name}."
133 )
134 try:
--> 135 module = importlib.import_module(name)
136 except ImportError:
137 if errors == "raise":
File ~/.asdf/installs/python/3.11.9/lib/python3.11/importlib/__init__.py:126, in import_module(name, package)
124 break
125 level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)
File <frozen importlib._bootstrap>:1204, in _gcd_import(name, package, level)
File <frozen importlib._bootstrap>:1176, in _find_and_load(name, import_)
File <frozen importlib._bootstrap>:1147, in _find_and_load_unlocked(name, import_)
File <frozen importlib._bootstrap>:690, in _load_unlocked(spec)
File <frozen importlib._bootstrap_external>:940, in exec_module(self, module)
File <frozen importlib._bootstrap>:241, in _call_with_frames_removed(f, *args, **kwds)
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/tables/__init__.py:44
40 raise RuntimeError("Blosc2 library not found. "
41 f"I looked for \"{', '.join(blosc2_search_paths)}\"")
43 # Necessary imports to get versions stored on the cython extension
---> 44 from .utilsextension import get_hdf5_version as _get_hdf5_version
46 from ._version import __version__
48 hdf5_version = _get_hdf5_version()
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/tables/utilsextension.pyx:1, in init tables.utilsextension()
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
Specifying Wide VS Narrow format (CSV, Excel)
By default, all reading functions assume that arrays are stored in the wide
format, meaning that their last axis is represented horizontally:
country \ time |
2013 |
2014 |
2015 |
---|---|---|---|
Belgium |
11137974 |
11180840 |
11237274 |
France |
65600350 |
65942267 |
66456279 |
By setting the wide
argument to False, reading functions will assume instead that arrays are stored in the narrow
format, i.e. one column per axis plus one value column:
country |
time |
value |
---|---|---|
Belgium |
2013 |
11137974 |
Belgium |
2014 |
11180840 |
Belgium |
2015 |
11237274 |
France |
2013 |
65600350 |
France |
2014 |
65942267 |
France |
2015 |
66456279 |
[10]:
# set 'wide' argument to False to indicate that the array is stored in the 'narrow' format
population_BE_FR = read_csv(csv_dir / 'population_narrow_format.csv', wide=False)
population_BE_FR
[10]:
country\time 2013 2014 2015
Belgium 11137974 11180840 11237274
France 65600350 66165980 66458153
[11]:
# same for the read_excel function
population_BE_FR = read_excel(filepath_excel, sheet='population_narrow_format', wide=False)
population_BE_FR
[11]:
country\time 2013 2014 2015
Belgium 11137974 11180840 11237274
France 65600350 66165980 66458153
By default, writing functions will set the name of the column containing the data to ‘value’. You can choose the name of this column by using the value_name
argument. For example, using value_name='population'
you can export the previous array as:
country |
time |
population |
---|---|---|
Belgium |
2013 |
11137974 |
Belgium |
2014 |
11180840 |
Belgium |
2015 |
11237274 |
France |
2013 |
65600350 |
France |
2014 |
65942267 |
France |
2015 |
66456279 |
[12]:
# dump the array population_BE_FR in a narrow format (one column per axis plus one value column).
# By default, the name of the column containing data is set to 'value'
population_BE_FR.to_csv('population_narrow_format.csv', wide=False)
# same but replace 'value' by 'population'
population_BE_FR.to_csv('population_narrow_format.csv', wide=False, value_name='population')
[13]:
# same for the to_excel method
population_BE_FR.to_excel('population.xlsx', 'population_narrow_format', wide=False, value_name='population')
Like with the to_excel
method, it is possible to export arrays in a narrow
format using open_excel
. To do so, you must set the wide
argument of the dump
method to False:
with open_excel('population.xlsx') as wb:
# dump the array population_BE_FR in a narrow format:
# one column per axis plus one value column.
# Argument value_name can be used to change the name of the
# column containing the data (default name is 'value')
wb['population_narrow_format'] = population_BE_FR.dump(wide=False, value_name='population')
# don't forget to call save()
wb.save()
# in the sheet 'population_narrow_format', data is written as:
# | country | time | population |
# | ------- | ---- | ---------- |
# | Belgium | 2013 | 11137974 |
# | Belgium | 2014 | 11180840 |
# | Belgium | 2015 | 11237274 |
# | France | 2013 | 65600350 |
# | France | 2014 | 65942267 |
# | France | 2015 | 66456279 |
Specifying Position in Sheet (Excel)
If you want to read an array from an Excel sheet which does not start at cell A1
(when there is more than one array stored in the same sheet for example), you will need to use the range
argument.
Warning: Note that the range
argument is only available if you have the library xlwings
installed (Windows).
# the 'range' argument must be used to load data not starting at cell A1.
# This is useful when there is several arrays stored in the same sheet
births = read_excel(filepath_excel, sheet='population_births_deaths', range='A9:E15')
Using open_excel
, ranges are passed in brackets:
with open_excel(filepath_excel) as wb:
# store sheet 'population_births_deaths' in a temporary variable sh
sh = wb['population_births_deaths']
# load the array population from range A1:E7
population = sh['A1:E7'].load()
# load the array births from range A9:E15
births = sh['A9:E15'].load()
# load the array deaths from range A17:E23
deaths = sh['A17:E23'].load()
# the Workbook is automatically closed when getting out the block defined by the with statement
When exporting arrays to Excel files, data is written starting at cell A1
by default. Using the position
argument of the to_excel
method, it is possible to specify the top left cell of the dumped data. This can be useful when you want to export several arrays in the same sheet for example
Warning: Note that the position
argument is only available if you have the library xlwings
installed (Windows).
filename = 'population.xlsx'
sheetname = 'population_births_deaths'
# save the arrays population, births and deaths in the same sheet 'population_births_and_deaths'.
# The 'position' argument is used to shift the location of the second and third arrays to be dumped
population.to_excel(filename, sheetname)
births.to_excel(filename, sheetname, position='A9')
deaths.to_excel(filename, sheetname, position='A17')
Using open_excel
, the position is passed in brackets (this allows you to also add extra informations):
with open_excel('population.xlsx') as wb:
# add a new sheet 'population_births_deaths' and write 'population' in the first cell
# note: you can use wb['new_sheet_name'] = '' to create an empty sheet
wb['population_births_deaths'] = 'population'
# store sheet 'population_births_deaths' in a temporary variable sh
sh = wb['population_births_deaths']
# dump the array population in sheet 'population_births_deaths' starting at cell A2
sh['A2'] = population.dump()
# add 'births' in cell A10
sh['A10'] = 'births'
# dump the array births in sheet 'population_births_deaths' starting at cell A11
sh['A11'] = births.dump()
# add 'deaths' in cell A19
sh['A19'] = 'deaths'
# dump the array deaths in sheet 'population_births_deaths' starting at cell A20
sh['A20'] = deaths.dump()
# don't forget to call save()
wb.save()
# the Workbook is automatically closed when getting out the block defined by the with statement
Exporting data without headers (Excel)
For some reasons, you may want to export only the data of an array without axes. For example, you may want to insert a new column containing extra information. As an exercise, let us consider we want to add the capital city for each country present in the array containing the total population by country:
country |
capital city |
2013 |
2014 |
2015 |
---|---|---|---|---|
Belgium |
Brussels |
11137974 |
11180840 |
11237274 |
France |
Paris |
65600350 |
65942267 |
66456279 |
Germany |
Berlin |
80523746 |
80767463 |
81197537 |
Assuming you have prepared an excel sheet as below:
country |
capital city |
2013 |
2014 |
2015 |
---|---|---|---|---|
Belgium |
Brussels |
|||
France |
Paris |
|||
Germany |
Berlin |
you can then dump the data at right place by setting the header
argument of to_excel
to False and specifying the position of the data in sheet:
population_by_country = population.sum('gender')
# export only the data of the array population_by_country starting at cell C2
population_by_country.to_excel('population.xlsx', 'population_by_country', header=False, position='C2')
Using open_excel
, you can easily prepare the sheet and then export only data at the right place by either setting the header
argument of the dump
method to False or avoiding to call dump
:
with open_excel('population.xlsx') as wb:
# create new empty sheet 'population_by_country'
wb['population_by_country'] = ''
# store sheet 'population_by_country' in a temporary variable sh
sh = wb['population_by_country']
# write extra information (description)
sh['A1'] = 'Population at 1st January by country'
# export column names
sh['A2'] = ['country', 'capital city']
sh['C2'] = population_by_country.time.labels
# export countries as first column
sh['A3'].options(transpose=True).value = population_by_country.country.labels
# export capital cities as second column
sh['B3'].options(transpose=True).value = ['Brussels', 'Paris', 'Berlin']
# export only data of population_by_country
sh['C3'] = population_by_country.dump(header=False)
# or equivalently
sh['C3'] = population_by_country
# don't forget to call save()
wb.save()
# the Workbook is automatically closed when getting out the block defined by the with statement
Specifying the Number of Axes at Reading (CSV, Excel)
By default, read_csv
and read_excel
will search the position of the first cell containing the special character \
in the header line in order to determine the number of axes of the array to read. The special character \
is used to separate the name of the two last axes. If there is no special character \
, read_csv
and read_excel
will consider that the array to read has only one dimension. For an array stored as:
country |
gender \ time |
2013 |
2014 |
2015 |
---|---|---|---|---|
Belgium |
Male |
5472856 |
5493792 |
5524068 |
Belgium |
Female |
5665118 |
5687048 |
5713206 |
France |
Male |
31772665 |
31936596 |
32175328 |
France |
Female |
33827685 |
34005671 |
34280951 |
Germany |
Male |
39380976 |
39556923 |
39835457 |
Germany |
Female |
41142770 |
41210540 |
41362080 |
read_csv
and read_excel
will find the special character \
in the second cell meaning it expects three axes (country, gender and time).
Sometimes, you need to read an array for which the name of the last axis is implicit:
country |
gender |
2013 |
2014 |
2015 |
---|---|---|---|---|
Belgium |
Male |
5472856 |
5493792 |
5524068 |
Belgium |
Female |
5665118 |
5687048 |
5713206 |
France |
Male |
31772665 |
31936596 |
32175328 |
France |
Female |
33827685 |
34005671 |
34280951 |
Germany |
Male |
39380976 |
39556923 |
39835457 |
Germany |
Female |
41142770 |
41210540 |
41362080 |
For such case, you will have to inform read_csv
and read_excel
of the number of axes of the output array by setting the nb_axes
argument:
[14]:
# read the 3 x 2 x 3 array stored in the file 'population_missing_axis_name.csv' wihout using 'nb_axes' argument.
population = read_csv(csv_dir / 'population_missing_axis_name.csv')
# shape and data type of the output array are not what we expected
population.info
[14]:
6 x 4
country [6]: 'Belgium' 'Belgium' 'France' 'France' 'Germany' 'Germany'
{1} [4]: 'gender' '2013' '2014' '2015'
dtype: object
memory used: 192 bytes
[15]:
# by setting the 'nb_axes' argument, you can indicate to read_csv the number of axes of the output array
population = read_csv(csv_dir / 'population_missing_axis_name.csv', nb_axes=3)
# give a name to the last axis
population = population.rename(-1, 'time')
# shape and data type of the output array are what we expected
population.info
[15]:
3 x 2 x 3
country [3]: 'Belgium' 'France' 'Germany'
gender [2]: 'Male' 'Female'
time [3]: 2013 2014 2015
dtype: int64
memory used: 144 bytes
[16]:
# same for the read_excel function
population = read_excel(filepath_excel, sheet='population_missing_axis_name', nb_axes=3)
population = population.rename(-1, 'time')
population.info
[16]:
3 x 2 x 3
country [3]: 'Belgium' 'France' 'Germany'
gender [2]: 'Male' 'Female'
time [3]: 2013 2014 2015
dtype: int64
memory used: 144 bytes
NaNs and Missing Data Handling at Reading (CSV, Excel)
Sometimes, there is no data available for some label combinations. In the example below, the rows corresponding to France - Male
and Germany - Female
are missing:
country |
gender \ time |
2013 |
2014 |
2015 |
---|---|---|---|---|
Belgium |
Male |
5472856 |
5493792 |
5524068 |
Belgium |
Female |
5665118 |
5687048 |
5713206 |
France |
Female |
33827685 |
34005671 |
34280951 |
Germany |
Male |
39380976 |
39556923 |
39835457 |
By default, read_csv
and read_excel
will fill cells associated with missing label combinations with nans. Be aware that, in that case, an int array will be converted to a float array.
[17]:
# by default, cells associated will missing label combinations are filled with nans.
# In that case, the output array is converted to a float array
read_csv(csv_dir / 'population_missing_values.csv')
[17]:
country gender\time 2013 2014 2015
Belgium Male 5472856.0 5493792.0 5524068.0
Belgium Female 5665118.0 5687048.0 5713206.0
France Male nan nan nan
France Female 33827685.0 34120851.0 34283895.0
Germany Male 39380976.0 39556923.0 39835457.0
Germany Female nan nan nan
However, it is possible to choose which value to use to fill missing cells using the fill_value
argument:
[18]:
read_csv(csv_dir / 'population_missing_values.csv', fill_value=0)
[18]:
country gender\time 2013 2014 2015
Belgium Male 5472856 5493792 5524068
Belgium Female 5665118 5687048 5713206
France Male 0 0 0
France Female 33827685 34120851 34283895
Germany Male 39380976 39556923 39835457
Germany Female 0 0 0
[19]:
# same for the read_excel function
read_excel(filepath_excel, sheet='population_missing_values', fill_value=0)
[19]:
country gender\time 2013 2014 2015
Belgium Male 5472856 5493792 5524068
Belgium Female 5665118 5687048 5713206
France Male 0 0 0
France Female 33827685 34120851 34283895
Germany Male 39380976 39556923 39835457
Germany Female 0 0 0
Sorting Axes at Reading (CSV, Excel, HDF5)
The sort_rows
and sort_columns
arguments of the reading functions allows you to sort rows and columns alphabetically:
[20]:
# sort labels at reading --> Male and Female labels are inverted
read_csv(csv_dir / 'population.csv', sort_rows=True)
[20]:
country gender\time 2013 2014 2015
Belgium Female 5665118 5687048 5713206
Belgium Male 5472856 5493792 5524068
France Female 33827685 34120851 34283895
France Male 31772665 32045129 32174258
Germany Female 41142770 41210540 41362080
Germany Male 39380976 39556923 39835457
[21]:
read_excel(filepath_excel, sheet='births', sort_rows=True)
[21]:
country gender\time 2013 2014 2015
Belgium Female 61235 60841 59713
Belgium Male 64371 64173 62561
France Female 396581 400607 390526
France Male 415762 418721 409145
Germany Female 332249 348092 359097
Germany Male 349820 366835 378478
[22]:
read_hdf(filepath_hdf, key='deaths').sort_labels()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[22], line 1
----> 1 read_hdf(filepath_hdf, key='deaths').sort_labels()
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/larray/inout/hdf.py:80, in read_hdf(filepath_or_buffer, key, fill_value, na, sort_rows, sort_columns, name, **kwargs)
78 key = _translate_group_key_hdf(key)
79 res = None
---> 80 with LHDFStore(filepath_or_buffer, mode='r') as store:
81 try:
82 pd_obj = store.get(key)
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/larray/util/misc.py:806, in LHDFStore.__init__(self, filepath_or_buffer, **kwargs)
804 self.close_store = False
805 else:
--> 806 self.store = pd.HDFStore(filepath_or_buffer, **kwargs)
807 self.close_store = True
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/pandas/io/pytables.py:566, in HDFStore.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
563 if "format" in kwargs:
564 raise ValueError("format is not a defined argument for HDFStore")
--> 566 tables = import_optional_dependency("tables")
568 if complib is not None and complib not in tables.filters.all_complibs:
569 raise ValueError(
570 f"complib only supports {tables.filters.all_complibs} compression."
571 )
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/pandas/compat/_optional.py:135, in import_optional_dependency(name, extra, errors, min_version)
130 msg = (
131 f"Missing optional dependency '{install_name}'. {extra} "
132 f"Use pip or conda to install {install_name}."
133 )
134 try:
--> 135 module = importlib.import_module(name)
136 except ImportError:
137 if errors == "raise":
File ~/.asdf/installs/python/3.11.9/lib/python3.11/importlib/__init__.py:126, in import_module(name, package)
124 break
125 level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)
File <frozen importlib._bootstrap>:1204, in _gcd_import(name, package, level)
File <frozen importlib._bootstrap>:1176, in _find_and_load(name, import_)
File <frozen importlib._bootstrap>:1147, in _find_and_load_unlocked(name, import_)
File <frozen importlib._bootstrap>:690, in _load_unlocked(spec)
File <frozen importlib._bootstrap_external>:940, in exec_module(self, module)
File <frozen importlib._bootstrap>:241, in _call_with_frames_removed(f, *args, **kwds)
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/tables/__init__.py:44
40 raise RuntimeError("Blosc2 library not found. "
41 f"I looked for \"{', '.join(blosc2_search_paths)}\"")
43 # Necessary imports to get versions stored on the cython extension
---> 44 from .utilsextension import get_hdf5_version as _get_hdf5_version
46 from ._version import __version__
48 hdf5_version = _get_hdf5_version()
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/tables/utilsextension.pyx:1, in init tables.utilsextension()
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
Metadata (HDF5)
It is possible to add metadata to arrays:
[23]:
population.meta.title = 'Population at 1st January'
population.meta.origin = 'Table demo_jpan from Eurostat'
population.info
[23]:
title: Population at 1st January
origin: Table demo_jpan from Eurostat
3 x 2 x 3
country [3]: 'Belgium' 'France' 'Germany'
gender [2]: 'Male' 'Female'
time [3]: 2013 2014 2015
dtype: int64
memory used: 144 bytes
These metadata are automatically saved and loaded when working with the HDF5 file format:
[24]:
population.to_hdf('population.h5', 'population')
new_population = read_hdf('population.h5', 'population')
new_population.info
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[24], line 1
----> 1 population.to_hdf('population.h5', 'population')
3 new_population = read_hdf('population.h5', 'population')
4 new_population.info
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/larray/core/array.py:6922, in Array.to_hdf(self, filepath, key)
6889 r"""
6890 Write array to a HDF file.
6891
(...)
6919 >>> a.to_hdf('test.h5', 'arrays/a') # doctest: +SKIP
6920 """
6921 key = _translate_group_key_hdf(key)
-> 6922 with LHDFStore(filepath) as store:
6923 store.put(key, self.to_frame())
6924 attrs = store.get_storer(key).attrs
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/larray/util/misc.py:806, in LHDFStore.__init__(self, filepath_or_buffer, **kwargs)
804 self.close_store = False
805 else:
--> 806 self.store = pd.HDFStore(filepath_or_buffer, **kwargs)
807 self.close_store = True
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/pandas/io/pytables.py:566, in HDFStore.__init__(self, path, mode, complevel, complib, fletcher32, **kwargs)
563 if "format" in kwargs:
564 raise ValueError("format is not a defined argument for HDFStore")
--> 566 tables = import_optional_dependency("tables")
568 if complib is not None and complib not in tables.filters.all_complibs:
569 raise ValueError(
570 f"complib only supports {tables.filters.all_complibs} compression."
571 )
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/pandas/compat/_optional.py:135, in import_optional_dependency(name, extra, errors, min_version)
130 msg = (
131 f"Missing optional dependency '{install_name}'. {extra} "
132 f"Use pip or conda to install {install_name}."
133 )
134 try:
--> 135 module = importlib.import_module(name)
136 except ImportError:
137 if errors == "raise":
File ~/.asdf/installs/python/3.11.9/lib/python3.11/importlib/__init__.py:126, in import_module(name, package)
124 break
125 level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)
File <frozen importlib._bootstrap>:1204, in _gcd_import(name, package, level)
File <frozen importlib._bootstrap>:1176, in _find_and_load(name, import_)
File <frozen importlib._bootstrap>:1147, in _find_and_load_unlocked(name, import_)
File <frozen importlib._bootstrap>:690, in _load_unlocked(spec)
File <frozen importlib._bootstrap_external>:940, in exec_module(self, module)
File <frozen importlib._bootstrap>:241, in _call_with_frames_removed(f, *args, **kwds)
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/tables/__init__.py:44
40 raise RuntimeError("Blosc2 library not found. "
41 f"I looked for \"{', '.join(blosc2_search_paths)}\"")
43 # Necessary imports to get versions stored on the cython extension
---> 44 from .utilsextension import get_hdf5_version as _get_hdf5_version
46 from ._version import __version__
48 hdf5_version = _get_hdf5_version()
File ~/checkouts/readthedocs.org/user_builds/larray/envs/0.34.3/lib/python3.11/site-packages/tables/utilsextension.pyx:1, in init tables.utilsextension()
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
Warning: Currently, metadata associated with arrays cannot be saved and loaded when working with CSV and Excel files. This restriction does not apply however to metadata associated with sessions.