[RFC] Handling predictions too big for the database #147

aboucaud · 2018-09-24T13:09:12Z

This is a summary of a discussion we just had with @kegl on which we'd like to have comments, opinions, ideas @jorisvandenbossche @glemaitre @agramfort.

In the close future, we might be faced with RAMP problems whose target dimension is too big to be handled by the existing workflow without making the database explode. Simple example is an image-to-image workflow. These problems need a huge training / testing sample, making each predictions equally as big (order of a few Gb), while the current database size is 100 Gb.

Which brings us down to two options:

modify the database model and migrate it,
find a smart way of storing and scoring the predictions for these specific problems.

We would like for now to avoid option 1 if possible, so here is our take on option 2.

Since the target is a pixel-by-pixel prediction, we would sample the prediction, e.g. take a sub-grid of pixels to compute the score. To avoid cheating, we would use a different random sub-grid for the public and the backend datasets.
Practically, this would mean creating a specific SamplingScore class which uses a hash of the input dataset as a seed to generate the scoring grid. It then passes the grid to the scoring method in y_pred.

The text was updated successfully, but these errors were encountered:

aboucaud added the enhancement label Sep 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Handling predictions too big for the database #147

[RFC] Handling predictions too big for the database #147

aboucaud commented Sep 24, 2018 •

edited by kegl

Loading

[RFC] Handling predictions too big for the database #147

[RFC] Handling predictions too big for the database #147

Comments

aboucaud commented Sep 24, 2018 • edited by kegl Loading

aboucaud commented Sep 24, 2018 •

edited by kegl

Loading