Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store path provenance tracking #11749

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

edolstra
Copy link
Member

Motivation

Nix historically has been bad at being able to answer the question "where did this store path come from", i.e. to provide traceability from a store path back to the Nix expression from which is was built. Nix tracks the "deriver" of a store path (the .drv file that built it) but that's pretty useless in practice, since it doesn't link back to the Nix expressions.

So this PR adds a "provenance" field (a JSON object) to the ValidPaths table and to .narinfo files that describes where the store path came from and how it can be reproduced.

There are currently 3 types of provenance:

  • copied: Records that the store path was copied or substituted from another store (typically a binary cache). Its "from" field is the URL of the origin store. Its "provenance" field propagates the provenance of the store path on the origin store.

  • derivation: Records that the store path is the output of a .drv file. This is equivalent for the "deriver" field, but it has a
    nested "provenance" field that records how the .drv file was created.

  • flake: Records that the store path was created during the evaluation of a flake output.

Example:

$ nix path-info --json /nix/store/xcqzb13bd60zmfw6wv0z4242b9mfw042-patchelf-0.18.0
{
  "/nix/store/xcqzb13bd60zmfw6wv0z4242b9mfw042-patchelf-0.18.0": {
    "provenance": {
      "from": "https://cache.example.org/",
      "provenance": {
        "drv": "rlabxgjx88bavjkc694v1bqbwslwivxs-patchelf-0.18.0.drv",
        "output": "out",
        "provenance": {
          "flake": {
            "lastModified": 1729856604,
            "narHash": "sha256-obmE2ZI9sTPXczzGMerwQX4SALF+ABL9J0oB371yvZE=",
            "owner": "NixOS",
            "repo": "patchelf",
            "rev": "689f19e499caee8e5c3d387008bbd4ed7f8dc3a9",
            "type": "github",
          },
          "output": "packages.x86_64-linux.default",
          "type": "flake"
        },
        "type": "derivation"
       },
       "type": "copied"
    },
    ...
  }
}

This specifies that the store path was copied from the binary cache https://cache.example.org/ and it's the "out" output of a store derivation that was produced by evaluating the flake ouput packages.x86_64-linux.default of some revision of the patchelf GitHub repository.

Depends on #11668.

Context

Priorities and Process

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@github-actions github-actions bot added store Issues and pull requests concerning the Nix store fetching Networking with the outside (non-Nix) world, input locking labels Oct 25, 2024
@edolstra edolstra marked this pull request as draft October 25, 2024 12:53
@edolstra edolstra force-pushed the provenance branch 5 times, most recently from f2b796f to 31d1d7e Compare October 26, 2024 15:49
@github-actions github-actions bot added the with-tests Issues related to testing. PRs with tests have some priority label Oct 26, 2024
@johnrichardrinehart
Copy link

johnrichardrinehart commented Oct 26, 2024

This looks like a cool idea. How does it help me determine which expression (which line of which file) in the checkout of some repository defines the .drv?

Like, you implied this would support tracking the store path back to the expression. And, in the flake case I guess someone could make an argument that that's good enough. But, what about in the case of an ad-hoc derivation floating around on my filesystem that I realise with nix-build and which gets post-build-hooked to a substituter? Seems like the provenance might be hard in that case? I should play around with this because I'll probably be able to answer my own questions.

Nix historically has been bad at being able to answer the question
"where did this store path come from", i.e. to provide traceability
from a store path back to the Nix expression from which is was
built. Nix tracks the "deriver" of a store path (the .drv file that
built it) but that's pretty useless in practice, since it doesn't link
back to the Nix expressions.

So this PR adds a "provenance" field (a JSON object) to the ValidPaths
table and to .narinfo files that describes where the store path came
from and how it can be reproduced.

There are currently 3 types of provenance:

* "copied": Records that the store path was copied or substituted from
  another store (typically a binary cache). Its "from" field is the
  URL of the origin store. Its "provenance" field propagates the
  provenance of the store path on the origin store.

* "derivation": Records that the store path is the output of a .drv
  file. This is equivalent for the "deriver" field, but it has a
  nested "provenance" field that records how the .drv file was
  created.

* "flake": Records that the store path was created during the
  evaluation of a flake output.

Example:

  $ nix path-info --json /nix/store/xcqzb13bd60zmfw6wv0z4242b9mfw042-patchelf-0.18.0
  {
    "/nix/store/xcqzb13bd60zmfw6wv0z4242b9mfw042-patchelf-0.18.0": {
      "provenance": {
        "from": "https://cache.example.org",
        "provenance": {
          "drv": "rlabxgjx88bavjkc694v1bqbwslwivxs-patchelf-0.18.0.drv",
          "output": "out",
          "provenance": {
            "flake": {
              "lastModified": 1729856604,
              "narHash": "sha256-obmE2ZI9sTPXczzGMerwQX4SALF+ABL9J0oB371yvZE=",
              "owner": "NixOS",
              "repo": "patchelf",
              "rev": "689f19e499caee8e5c3d387008bbd4ed7f8dc3a9",
              "type": "github",
            },
            "output": "packages.x86_64-linux.default",
            "type": "flake"
          },
          "type": "derivation"
        },
        "type": "copied"
      },
      ...
    }
  }

This specifies that the store path was copied from the binary cache
https://cache.example.org and it's the "out" output of a store
derivation that was produced by evaluating the flake ouput
`packages.x86_64-linux.default` of some revision of the patchelf
GitHub repository.
@edolstra
Copy link
Member Author

How does it help me determine which expression (which line of which file) in the checkout of some repository defines the .drv?

It doesn't currently, since that information wouldn't be enough to reproduce the store derivation (i.e. a package function in Nixpkgs requires arguments to be able to reproduce its output, not to mention stuff like overrides). But storing the top-level flake + flake output name that caused the store derivation to be created does allow the store derivation to be reproduced.

But, what about in the case of an ad-hoc derivation floating around on my filesystem

The problem there is that evaluation of non-flake expressions is not hermetic, so we really do need something like flakes for provenance.

@roberth
Copy link
Member

roberth commented Nov 6, 2024

not hermetic

It will be less likely that you can verify the provenance, but something could be recorded nonetheless.
Expressions written with "purity" in mind may actually verify just fine if, say, a git revision is stored when e.g. a default.nix is in a git repo.

@roberth
Copy link
Member

roberth commented Nov 6, 2024

(I haven't read the whole diff yet, so apologies for questions I could have answered myself, but these will need to be documented anyway, so also you're welcome :) )

  • flake: Records that the store path was created during the evaluation of a flake output.

Many evaluations will produce the same paths. How do we deal with that? I suppose we only need a flake provenance for the outputs that are immediately in the flake outputs, and we can find provenance of the closure by following the referrers relation.
Denormalizing all this into the closure is too expensive.

Another solution is to only store the first provenance, but this is too arbitrary IMO, and can also be achieved with a first referrer field if we feel like storing all referrers edges is too expensive or impractical for "non-enumerating" stores like the binary cache stores.

Putting new appendable data into the stores including the binary caches stores is quite a step.

Do we really need this to be in the binary cache?

A lot of the value of this feature could instead be produced by a local database, since that's where evaluation and realisation ultimately happen anyway.
It's only when you're doing deployments with store-level-only operations like closure copying that you lose this info, but I think this is fine. Deployment targets don't need to know their evaluation provenance; only the machines that manage those targets really need to know.

Some questions

Things to be documented and/or implemented

  • How do we deal with the many-to-one relationship between evaluations and a product of those evaluations?
  • How does this work for ca-derivations realisations?
  • Documentation in the protocols section of the manual

struct ProvFlake
{
std::shared_ptr<nlohmann::json> flake; // FIXME: change to Attrs
std::string flakeOutput;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::string flakeOutput;
std::vector<std::string> flakeOutput;

* derivation input source) that was produced by the evaluation of
* a flake.
*/
struct ProvFlake
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a layer violation. We could define something like struct ProvOther { std::string type; nlohmann::json value; } at the store layer and refine this in upper layers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about getting rid of all the Prov* types and just passing provenance around as a JSON value.

@edolstra
Copy link
Member Author

edolstra commented Nov 6, 2024

It will be less likely that you can verify the provenance, but something could be recorded nonetheless.

Indeed provenance doesn't need to be hermetic or reproducible, so we could certainly have a provenance type for non-flake evaluations.

Many evaluations will produce the same paths. How do we deal with that?

The provenance is the evaluation that produced the store path, i.e. the first one. There can of course be many other evaluations that produce the same store path, but those are not the provenance for that particular store / binary cache. (The same applies to other types of provenance like substitution: a path can be substituted from many binary caches, but we only record the one we actually used.)

Recording other provenances makes the metadata for a store path potentially grow without bounds. And in the case of .narinfo files, we really don't want to update them after creation due to caching etc.

This is the same semantics as the deriver field BTW.

Do we really need this to be in the binary cache?

I think so, because without that you can't query the ultimate provenance of a store path in a binary cache like cache.nixos.org.

@mschwaig
Copy link
Member

mschwaig commented Nov 7, 2024

I do not like this PR as a solution to the problem of provenance tracking. I think this approach is something that could be implemented in any ecosystem, while Nix is in the unique position that it could really do so much better.

On the origin of build outputs

Do we really need this to be in the binary cache?

I think so, because without that you can't query the ultimate provenance of a store path in a binary cache like cache.nixos.org.

The signatures that we already use for transport security in binary caches do provide this kind of information already, because they are evidence of where you got an output, and I think it is a mistake to design a solution to this problem that just sidesteps them.
I did some work recently on how we could extend the existing signing scheme to be even more suitable for solving issues like this end to end, by making it possible for the signer to claim to be the builder (and generally attach other arbitrary metadata to the signature to make it attributable/verifiable). See my recent paper about this, or my talk about that work at NixCon 2024. #9644 is a bit out of date, but relevant, and #3023 would nicely complement that approach by making it possible to track local builds in the same way.

Extending the signing scheme would give us an actual cryptographic basis on which we can attribute the outputs of individual build steps to their actual builders, while this PR only propagates second-hand information down the chain with a kind of attribution that is not really trustworthy, because any link in the chain can just alter it.

Because signed information is attributable to the signer, it works much better across systems.

On attributing build outputs to derivations, and derivations to flakes

Attributing build outputs to derivations, and derivations to flakes, is a problem that can be solved better locally.
I see the fundamental data structure in Nix as mappings from derivation hashes to either store contents or NAR hashes.

derivation hash -> store content / NAR hash of output

This naturally makes it difficult to keep track of derivations themselves, because the derivation hash is computed from the derivation, and so you already have to know the derivation to look up anything.

Instead of attaching flake references to the destination of this mapping, we can view flakes as a higher-level mapping, which summarizes and tracks subtrees of build steps, including the derivations involved.
This mapping would be reflected in a new table in the DB, where we add new entries every time we evaluate a flake.

flake-url, commmit hash, flake-output -> derivation hash of flake output

It would also be possible to record dependencies between flakes that way, by making a DB entry whenever we cross a boundary to another upstream flake while building or substituting.

Similarly, during evaluation we can record 'kind of but not reallly' the inverse of derivation hash -> NAR hash of output (which we have evidence for in the form of a signature) in another DB table, because we still have access to the derivation we are evaluating at that point:

derivation hash -> [ NAR hash of each input ] 

Based on all three of these relations, you can start at the NAR hash of any output, and walk through its reverse dependencies, until you hit a flake output, or continue walking until you find them all.

One side benefit of attributing paths to flakes in any way, would be that it makes the contents of the local store of a system less opaque.
This same information could be used to outline how much each version of each flake takes up in disk space exclusively and in total, and to prioritize flake versions with no commit hash in version control during GC.

EDIT: It might also be possible to use .drv files for this instead of the second proposed table, but I am not familiar with the lifecycle of those.


I did read through the code in this PR a few days ago. I hope that I have understood the gist of it correctly, and I hope this makes sense to you. In any case I would really appreciate if you could give me the benefit of the doubt and we could further discuss this/my work on issues like this somewhere.

Copy link

dpulls bot commented Nov 20, 2024

🎉 All dependencies have been resolved !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fetching Networking with the outside (non-Nix) world, input locking store Issues and pull requests concerning the Nix store with-tests Issues related to testing. PRs with tests have some priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants