-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the use case of xnd over apache arrow? #6
Comments
Primarily any container that is described by datashape which is more
general than arrow --- as far as I am aware.
We will focus initially on tensor use-cases and look to be compatible in
areas of overlap.
Xnd is a general purpose meta-container more than a single container.
Travis
…On Mar 7, 2018 5:19 PM, "Naveen Michaud-Agrawal" ***@***.***> wrote:
Hello, was wondering what you plan on doing with xnd that isn't well
supported by the arrow format. A github issue is probably not the best
place for this discussion, but I couldn't find a mailing list. Thanks.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#6>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPjoDfrgN_8J5rJoRFYdTEkDIHbig91ks5tb_fdgaJpZM4SgnHR>
.
|
Ah ok, it seems most of the underlying types and structured types (lists, structs, ragged hierarchies) are already well supported in arrow. Anyway looking at the docs it should be pretty cheap to convert from one format to another through memory copying. |
Are they? I thought Arrow was limited to Also, the types that xnd uses ( |
That said, we might translate Arrow to |
This isn't quite true, see https://issues.apache.org/jira/browse/ARROW-750 -- support for very large variable-length collections is something we will eventually need to add to the format whenever there is demand for it. In general, datasets will not be expected to be in a contiguous columnar memory block, but instead split across a collection of smaller chunks. We have discussed the 32- vs 64-byte issue for encoding collection lengths and the consensus has been that it is not worth the extra 4 bytes of overhead per value when the "large collection" case represents a very small percentage of use cases. |
Additionally, we have changed 1-dimensional array sizes to use int64 almost a year ago apache/arrow@ced9d76#diff-520b20e87eb508faa3cc7aa9855030d7 |
On Wed, Mar 07, 2018 at 07:49:25PM +0000, Wes McKinney wrote:
Additionally, we have changed 1-dimensional array sizes to use int64 almost a year ago apache/arrow@ced9d76#diff-520b20e87eb508faa3cc7aa9855030d7
Thanks, I think I looked at the Arrow layout shortly before that date. We still need n-dimensional large fixed arrays though.
BTW, our ragged array type also uses int32_t, and is Arrow compatible in the offset and bitmap layout. I agree that it makes sense to use int32_t for the offsets, since they can get quite large.
|
Makes sense. In our experience, Tensors are a different beast and use case from structured columnar data, so we are handling ndarrays / tensors with metadata separate from 1D record batches: https://github.com/apache/arrow/blob/master/format/Tensor.fbs#L35. These use 64-bit shape and strides. This is used actively by the Ray project |
Ideally we can convert without memory copying. I have not seen that arrow
is general enough.
Ndtypes should support an arrow memory container. Arrow does not support
general ndtypes descriptors.
One simplistic analogy: Arrow is a generalization of pandas. Xnd is a generalization of NumPy.
Arrow could use libxnd for functionality, and the two should be complementary. There is overlap but the tradeoffs are quite different.
Also, if we construct libgumath correctly, Arrow could use it for chaining graphs of functionality -- or at least the two could be used together.
I am happy to be shown how the vision of xnd could be accomplished with just
arrow. But, Stefan and I don't see that yet.
…On Mar 7, 2018 8:08 PM, "Naveen Michaud-Agrawal" ***@***.***> wrote:
Ah ok, it seems most of the underlying types and structured types (lists,
structs, ragged hierarchies) are already well supported in arrow. Anyway
looking at the docs it should be pretty cheap to convert from one format to
another through memory copying.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#6 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAPjoD9AiYjW59ku_oe0cFcp03xkuO8gks5tcB7xgaJpZM4SgnHR>
.
|
Agreed. We should look for opportunities to share code and infrastructure where possible. Note that the Arrow columnar format is but one type of data structure that we support -- it's a very important one for databases, Spark, pandas, etc. In order to implement zero-overhead memory sharing for structured datasets, many lower-levels of platform tooling must be created. I want to make sure we don't miss out on the collaboration opportunities for not having agreed on a "universal" data structure. The Arrow columnar format was never intended to be a universal data structure. |
[Repost because of broken markdown in email replies.] I agree it would be nice to have a standard low-level data structure. For C, Ndtypes is pretty standard: It describes all basic C types (including nested types, pointer types) using a regular algebraic data type. One could use it e.g. for the type part in "Modern Compiler Implementation in C" (Appel et al.) without changes. The tagged union convention is also the same as in the quoted book, and incidentally also the same as in Python's own compiler (whose author probably also read Appel, given that he used ASDL to describe the AST :). I think columnar data can be modeled in ndtypes as a record of arrays. The example from the Arrow home page: >>> data = {'session_id': [1331247700, 1331247702, 1331247709, 1331247799],
... 'timestamp': [1515529735.4895875, 1515529746.2128427, 1515529756.4485607, 1515529766.2181058],
... 'source_ip': ['8.8.8.100', '100.2.0.11', '99.101.22.222', '12.100.111.200']}
x = xnd(data)
>>> x.type
ndt("{session_id : 4 * int64, timestamp : 4 * float64, source_ip : 4 * string}") There is categorical data, the representation of which is an array of indices into the categories: >>> levels = ['January', 'August', 'December', None]
>>> x = xnd(['January', 'January', None, 'December', 'August', 'December', 'December'], levels=levels)
>>> x.value
['January', 'January', None, 'December', 'August', 'December', 'December']
>>> x.type
ndt("7 * categorical('January', 'August', 'December', NA)") There are nested tuples, which are more general than ragged arrays: >>> unbalanced_tree = (((1.0, 2.0), (3.0)), 4.0, ((5.0, 6.0, 7.0), ()))
>>> x = xnd(unbalanced_tree)
>>> x.value
(((1.0, 2.0), 3.0), 4.0, ((5.0, 6.0, 7.0), ()))
>>> x.type
ndt("(((float64, float64), float64), float64, ((float64, float64, float64), ()))")
>>>
>>> x[0]
xnd(((1.0, 2.0), 3.0), type="((float64, float64), float64)")
>>> x[0][0]
xnd((1.0, 2.0), type="(float64, float64)") In general, xnd just takes any basic Python value -- nested or not -- and unpacks |
I am skeptical about the idea of an all-powerful / can-describe-anything data structure. With generalization comes added complexity for computational frameworks and producers/consumers. @teoliphant stated "I have not seen that arrow is general enough." What does this mean? At this point, the Arrow columnar format is only one part of a much larger project. I think this means "the Arrow columnar format is not a universal data structure", which I agree with, but that was never the goal. I see the work here in libndtypes / xnd as complementary and not in conflict -- there are problems being solved (extending the notion of NumPy's structured dtypes to support things like variable-length cells and pointers) that were never in scope for Arrow's columnar format. The columnar format was the focus of the project at the outset because that was the most immediate and high value problem to solve around data interoperability and in-memory analytics. The rapid uptake of the project and developer community growth suggests we made a good bet on this. At this point Arrow a multi-layered project of memory management, shared memory (Plasma), metadata serialization, IO, streaming messaging, memory formats (including the columnar format), file format interop, computation kernels, etc. The work that is being done here could even become an additional component of Apache Arrow if you wanted to work with a larger developer community. At minimum it would be helpful to have a broader design/architecture discourse about problems and use cases in a public venue. |
As demonstrated in ArrayViews, one can wrap Arrow arrays with xnd, and vice versa, without memory copying. However, currently the wrapping does not support null buffers (Arrow) or bitmaps (xnd) because xnd does not expose bitmaps. |
@pearu the cases where the memory is compatible IMHO reflect a minority (and a small minority at that) of real world use of Arrow. To suggest "compatible, with some exceptions" will mislead people |
@wesm, I am not sure that I follow your comments meaning. If you refer to the fact xnd does not expose bitmaps, then this issue can be easily fixed as xnd bitmap is compatible with Arrow null buffer. I guess the reason of not exposing xnd bitmaps is that it is consider as internal structure while in Arrow null buffer is not that. |
You stated
I think it's worth making a list of different Arrow use cases:
Do you support them all and export all of their semantics in xnd? If the answer is "no", then I think you need to qualify the statement to say that "In certain limited cases, one can wrap Arrow arrays [and expose their semantics], without memory copying" |
Hello, was wondering what you plan on doing with xnd that isn't well supported by the arrow format. A github issue is probably not the best place for this discussion, but I couldn't find a mailing list. Thanks.
The text was updated successfully, but these errors were encountered: