You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have two DataFrames that I want to apply an overlay (difference) on.
As GeoPandas internally uses an index, I try to use the map_partitions, to partition the larger GeoDataFrame and try to execute on several cores.
Left larger GeoDataFrame has 944'420 rows.
Right smaller GeoDataFrame has 265'691 rows.
From with my functions I call fast_difference:
importdask_geopandasasd_gpdimportgeopandasasgpddefdifference_partitions(part: gpd.GeoDataFrame, right: gpd.GeoDataFrame) ->gpd.GeoDataFrame:
"""Helper function to calculate the difference overlay with dask."""returngpd.overlay(part, right, how="difference", keep_geom_type=True, make_valid=True)
deffast_difference(left: gpd.GeoDataFrame, right: gpd.GeoDataFrame) ->gpd.GeoDataFrame:
"""Execute the difference overlay with dask_geopandas."""left_dgdf=d_gpd.from_geopandas(left, npartitions=8)
logging.info(left_dgdf.crs)
returnleft_dgdf.map_partitions(difference_partitions, right).compute()
But it only runs on a single core. I set the logging level to DEBUG, with that I get: pyproj - DEBUG - PROJ_ERROR: proj_create: unrecognized format / unknown name
about the time I call the from_geopandas method. That might be the only hint got. All vector files are in EPSG:25832 (tested via logging).
I tried 3, 4, 6, 8, 16 partitions in run and debug modes. Furthermore, I assume multiple cores are used (from btop4win I see cores reaching a lot of 10 to 30 %, a single core reaches up to 50 %), seldom I see spikes, on the overall load.
The text was updated successfully, but these errors were encountered:
I have two DataFrames that I want to apply an overlay (difference) on.
As GeoPandas internally uses an index, I try to use the map_partitions, to partition the larger GeoDataFrame and try to execute on several cores.
I once archived with intersection github-project code.
On my current dataset, I use:
Left larger GeoDataFrame has 944'420 rows.
Right smaller GeoDataFrame has 265'691 rows.
From with my functions I call
fast_difference
:But it only runs on a single core. I set the logging level to DEBUG, with that I get:
pyproj - DEBUG - PROJ_ERROR: proj_create: unrecognized format / unknown name
about the time I call the
from_geopandas
method. That might be the only hint got. All vector files are in EPSG:25832 (tested via logging).I tried 3, 4, 6, 8, 16 partitions in run and debug modes. Furthermore, I assume multiple cores are used (from btop4win I see cores reaching a lot of 10 to 30 %, a single core reaches up to 50 %), seldom I see spikes, on the overall load.
The text was updated successfully, but these errors were encountered: