Skip to content

Latest commit

 

History

History
363 lines (340 loc) · 19 KB

README.md

File metadata and controls

363 lines (340 loc) · 19 KB

ArrayViews

Introduction

There exists many array libraries that implement objects for storing data in allocated memory areas. Already in Python ecosystem, the number of such libraries is more than just few (see below), some of them are designed for referencing the memory of both host RAM and the memory of accelerator devices such as GPUs. Such Python packages implement various computational algorithms that one would wish to apply on the data stored in some other array object than what the algoritms use.

Many of the array object implementations support Python Buffer Protocol PEP-3118 that makes it possible to create array objects from other implementations of array objects without actually copying the memory - this is called creating array views.

As a side note, unfortunately, the Python Buffer Protocol is incomplete when considering data storage in devices memory. The PEP-3118 lacks the device concept which makes it almost impossible to use existing array storage implementations to hold the memory pointers of such devices. This has resulted in a emergence of a number of new array libraries specifically designed for holding pointers to device memory. However, the approach of reimplementing the array storage objects for each different device from scatch does not scale well as the only essential restriction is about the interpretation of a memory pointer - whether the pointer value can be dereferenced in a (host or device) process to use the data, or not. The rest of the array object implementation would remain the same. Instead, the Python Buffer Protocol should be extended with the device concept. Hopefully we'll see it happen in future. Meanwhile...

The aim of this project is to establish a connection between different data storage object implementations while avoiding copying the data in host or device memory. The following packages are supported:

Package Tested versions Storage on host Storage on CUDA device
numpy 1.16.1 ndarray N/A
pandas 0.24.1 Series N/A
pyarrow 0.12.1.dev120+g7f9... Array CudaBuffer
xnd 0.2.0dev3 xnd xnd
numba 0.41.0 N/A DeviceNDArray
cupy 5.2.0 N/A ndarray, cuda.MemoryPointer
cudf 0.6-branch N/A Series

Basic usage

To use arrayviews package for host memory, import the needed data storage support modules, for instance,

from arrayviews import (
  numpy_ndarray_as,
  pandas_series_as,
  pyarrow_array_as,
  xnd_xnd_as
  )

For CUDA based device memory, one can use the following import statement:

from arrayviews.cuda import (
  cupy_ndarray_as,
  numba_cuda_DeviceNDArray,
  pyarrow_cuda_buffer_as,
  xnd_xnd_cuda_as,
  cudf_Series_as,
  )
...

The general pattern of creating a specific view of another storage object is:

data_view = <data storage object>_as.<view data storage object>(data)

For example,

>>> import numpy as np
>>> import pyarrow as pa
>>> from arrayviews import numpy_ndarray_as
>>> np_arr = np.arange(5)
>>> pa_arr = numpy_ndarray_as.pyarrow_array(np_arr)
>>> print(pa_arr)
[
  0,
  1,
  2,
  3,
  4
]
>>> np_arr[2] = 999    # change numpy array
>>> print(pa_arr)
[
  0,
  1,
  999,
  3,
  4
]

Supported array views - host memory

The following table summarizes the support of creating a specific array view (top-row) for the given array storage objects (left-hand-side column).

ObjectsViews
numpy.ndarraypandas.Seriespyarrow.Arrayxnd.xnd
numpy.ndarrayOPTIMAL, FULLGENBITMAP, FULLOPTIMAL, PARTIAL
pandas.SeriesOPTIMAL, FULLGENBITMAP, FULLOPTIMAL, PARTIAL
pyarrow.ArrayOPTIMAL, PARTIALOPTIMAL, PARTIALOPTIMAL, PARTIAL
xnd.xndOPTIMAL, PARTIALOPTIMAL, PARTIALOPTIMAL, PARTIAL

Comments

  1. In numpy.ndarray and pandas.Series, the numpy.nan value is interpreted as null value.
  2. OPTIMAL means that view creation does not require processing of array data
  3. GENBITMAP means that view creation does requires processing of array data in the presence of null or nan values. By default, such processing is disabled.
  4. FULL means that view creation supports the inputs with null values.
  5. PARTIAL means that view creation does not support the inputs with null values.
  6. For the implementation of view constructions, hover over table cell or click on the links to arrayviews package source code.

Benchmark: creating array views - host memory

ObjectsViews
numpy.ndarraypandas.Seriespyarrow.Arrayxnd.xnd
numpy.ndarray0.99(0.98)304.57(304.49)54.38(54.58)14.97(14.93)
pandas.Series29.86(29.68)1.01(1.0)110.25(110.86)48.47(48.37)
pyarrow.Array17.61(N/A)350.51(N/A)1.0(1.0)25.71(N/A)
xnd.xnd14.22(N/A)331.47(N/A)80.88(N/A)1.0(1.0)

Comments

  1. The numbers in the table are <elapsed time to create a view of an obj>/<elapsed time to call 'def dummy(obj): return obj'>.
  2. Results in the parenthesis correspond to objects with nulls or nans. No attempts are made to convert nans to nulls.
  3. Test arrays are 64-bit float arrays of size 51200.

Supported array views - CUDA device memory

ObjectsViews
pyarrow CudaBuffernumba DeviceNDArraycupy.ndarraycupy MemoryPointerxnd.xnd CUDAcudf Series
pyarrow CudaBufferOPTIMAL, FULLOPTIMAL, FULLOPTIMAL, FULLOPTIMAL, FULLNOT IMPL
numba DeviceNDArrayOPTIMAL, FULLOPTIMAL, FULLOPTIMAL, FULLDERIVED, FULLOPTIMAL, FULL
cupy.ndarrayOPTIMAL, FULLOPTIMAL, FULLOPTIMAL, FULLNOT IMPLNOT IMPL
cupy MemoryPointerNOT IMPLNOT IMPLNOT IMPLNOT IMPLNOT IMPL
xnd.xnd CUDAOPTIMAL, FULLDERIVED, FULLOPTIMAL, FULLOPTIMAL, FULLNOT IMPL
cudf SeriesNOT IMPLOPTIMAL, FULLNOT IMPLNOT IMPLNOT IMPL

Benchmark: creating array views - CUDA device memory

ObjectsViews
pyarrow CudaBuffernumba DeviceNDArraycupy.ndarraycupy MemoryPointerxnd.xnd CUDAcudf Series
pyarrow CudaBuffer0.99381.5725.8914.3222.34NOT IMPL
numba DeviceNDArray53.680.9937.5522.6892.66154.96
cupy.ndarray34.96356.441.01.17NOT IMPLNOT IMPL
cupy MemoryPointerNOT IMPLNOT IMPLNOT IMPL1.0NOT IMPLNOT IMPL
xnd.xnd CUDA45.92452.2142.6429.790.99NOT IMPL
cudf SeriesNOT IMPL3.39NOT IMPLNOT IMPLNOT IMPL0.99

Comments

  1. Test arrays are 8-bit unsigned integer arrays of size 51200.