Questions regarding using Lance for OLAP workloads #3252

manhld0206 · 2024-12-16T05:21:44Z

manhld0206
Dec 16, 2024

Hello everyone. Recently I had the opportunity to consider Lance as a data format for building data lake for both ML and OLAP workloads. I loved that Lance has much better support for both quick random access and huge column compared to parquet!

Through the experiment I have some question regarding using Lance for OLAP workloads and would love to hear about them 🙏

Lance doesn't have any notion of partitioning or clustering (for example hive partitioning or data layout optimization like Z-ordering in Delta Lake). I assume that we can achieve the same optimization with secondary indexes. Is my assumption correct?
For querying lance data using query engine like Datafusion or Duckdb, I think for now I need to get all the data into pyarrow format first. I wonder what is the the better way in the future for querying lance data? Right now I can think of 2 ways but I lack the technical understanding to know which is easier:
- Converting lance to pyarrow dataset (like delta rs) so query engine can do data skipping at the storage level. The beauty is many query engines already know how to query pyarrow dataset so the transition is clear.
- Create specific table provider extension specific so that the query engine can start reading lance data better.

I'm looking forward to the answer and thank you all for your work!

Answered by wjones127

Dec 16, 2024

Lance doesn't have any notion of partitioning or clustering (for example hive partitioning or data layout optimization like Z-ordering in Delta Lake). I assume that we can achieve the same optimization with secondary indexes. Is my assumption correct?

Yes. We've previously considered partitioning or clustering, but likely won't implement that. For fast filter performance, secondary indices are generally better.

I wonder what is the the better way in the future for querying lance data?

I think having a proper table provider is probably the best future. We've started some of that work recently, but I think there's more work to do for improved pushdown.

View full answer

wjones127 · 2024-12-16T19:22:12Z

wjones127
Dec 16, 2024
Maintainer

Lance doesn't have any notion of partitioning or clustering (for example hive partitioning or data layout optimization like Z-ordering in Delta Lake). I assume that we can achieve the same optimization with secondary indexes. Is my assumption correct?

Yes. We've previously considered partitioning or clustering, but likely won't implement that. For fast filter performance, secondary indices are generally better.

I wonder what is the the better way in the future for querying lance data?

I think having a proper table provider is probably the best future. We've started some of that work recently, but I think there's more work to do for improved pushdown.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions regarding using Lance for OLAP workloads #3252

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Questions regarding using Lance for OLAP workloads #3252

manhld0206 Dec 16, 2024

Replies: 1 comment

wjones127 Dec 16, 2024 Maintainer

manhld0206
Dec 16, 2024

wjones127
Dec 16, 2024
Maintainer