Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

map_partitions (almost) only uses single core #319

Open
sehHeiden opened this issue Nov 30, 2024 · 0 comments
Open

map_partitions (almost) only uses single core #319

sehHeiden opened this issue Nov 30, 2024 · 0 comments

Comments

@sehHeiden
Copy link

I have two DataFrames that I want to apply an overlay (difference) on.
As GeoPandas internally uses an index, I try to use the map_partitions, to partition the larger GeoDataFrame and try to execute on several cores.

I once archived with intersection github-project code.

On my current dataset, I use:

Left larger GeoDataFrame has 944'420 rows.
Right smaller GeoDataFrame has 265'691 rows.

From with my functions I call fast_difference:

import dask_geopandas as d_gpd
import geopandas as gpd

def difference_partitions(part: gpd.GeoDataFrame, right: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """Helper function to calculate the difference overlay with dask."""
    return gpd.overlay(part, right, how="difference", keep_geom_type=True, make_valid=True)


def fast_difference(left: gpd.GeoDataFrame, right: gpd.GeoDataFrame) -> gpd.GeoDataFrame:
    """Execute the difference overlay with dask_geopandas."""
    left_dgdf = d_gpd.from_geopandas(left, npartitions=8)
    logging.info(left_dgdf.crs)
    return left_dgdf.map_partitions(difference_partitions, right).compute()

But it only runs on a single core. I set the logging level to DEBUG, with that I get:
pyproj - DEBUG - PROJ_ERROR: proj_create: unrecognized format / unknown name

about the time I call the from_geopandas method. That might be the only hint got. All vector files are in EPSG:25832 (tested via logging).

I tried 3, 4, 6, 8, 16 partitions in run and debug modes. Furthermore, I assume multiple cores are used (from btop4win I see cores reaching a lot of 10 to 30 %, a single core reaches up to 50 %), seldom I see spikes, on the overall load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant