Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling and indexing #38

Open
draggett opened this issue May 9, 2021 · 0 comments
Open

Scaling and indexing #38

draggett opened this issue May 9, 2021 · 0 comments

Comments

@draggett
Copy link
Member

draggett commented May 9, 2021

This is in respect to enabling large scale cognitive databases and large rule sets. For now, I am focusing on RAM based databases, but I also want to understand the design choices for secondary storage, which could be measured in terabytes.

Right now chunks and rules are implemented as an open source JavaScript library with the following maps:

  • From chunk ID to the corresponding chunk, where IDs uniquely identify chunks within a given chunk graph
  • From chunk type to the set of chunks with that type
  • From chunk type and non-literal property values to the set of chunks in which they appear

This is just a start and needs improvement. The latter two maps currently use simple arrays for the set of chunks. If you want to get chunks with given values for the type and multiple properties, this involves the need to compute set intersection given multiple arrays where you are looking for the IDs that occur in each of the arrays.

What are good algorithms for that? One idea is to use a JavaScript associative array that maps an ID to an integer. You then iterate through each list and use the integer to count the number of arrays it appears in. When the count equals the number of arrays, you push the ID to an array for the results. The map can then be discarded.

This relies on JavaScript's implementation of maps. Would Bloom filters be better for really large databases? These sometimes give false positives, but never any false negatives. You thus need to verify the results, but the storage required is limited and lends itself to use in RAM even for sets that are too large to hold in RAM.

Another challenge is how to efficiently find which rules match the chunks held in the cognitive module buffers. Charles Forgy’s Rete algorithm is promising. A related idea is to compile rule conditions into a discrimination network. Whenever a module buffer is updated, this triggers the discrimination network to update the set of matching rules.

Forgy’s OPS5 rule language applied to the entire state of a database. For the cognitive rule language, rule conditions operate on module buffers that each hold a single chunk. There are only a few such modules, so in principle, we should be able to scale to much larger databases than OPS5. One complication is the need to update the discrimination network when adding, removing or revising rules.

I think that there are opportunities to improve performance using information about the expected utility of rules. This can be computed using reinforcement learning following the work done by John Anderson and colleagues at CMU.

Any advice you have on indexing algorithms would be much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant