Use Prolog to improve LLM's reasoning
https://shchegrikovich.substack.com/p/use-prolog-to-improve-llms-reasoningIt's one of few languages that is simultaneously a standalone logical formalism, and a standalone representation of computation. (With caveats and exceptions, I know). So a Prolog program can stand in as a document of all facts, rules and relations that a person/organization understands/declares to be true. Even if AI writes code for us, we should expect to have it presented and manipulated as a logical formalism.
Now if someone cares to argue that some other language/compiler is better at generating more performant code on certain architectures, then that person can declare their arguments in a logical formalism (Prolog) and we can use Prolog to translate between language representations, compile, optimize, etc.
Also, the experiment method has some flaws. Problems are hand-picked out of a random subset of the full set. Why not run the full set?
It seems like you used an interpreter in the loop which is likely to help. I'd also be interested to see how o1 would do in a task like this or if it even makes sense to use something like prolog if the models can backtrack during the "thinking" phase
I’ve been playing with using GPT-4 together with the Wolfram Alpha plugin, and the combo of the two can reliably solve difficult quantitative problems that neither can individually by working together, much like a human using a calculator.
Some thoughts:
1. Getting an LLM to model a problem accurately is a significant prompting exercise. Bridging casual logical statements and formal logic is difficult. E.g., "or" statements in English usually mean "xor" in logic.
2. Domains usually have their own language expectations. I was doing Zebra puzzles (https://en.wikipedia.org/wiki/Zebra_Puzzle) and they have a very specific pattern and language. I don't think it's fair to really call it intuitive or even entirely unambiguous, it's something you have to learn. The LLM has to learn it too. They have seen this kind of puzzle (and I think most can reproduce the original Zebra puzzle from memory), but they lack a really firm familiarity.
3. Arguably some of the familiarity is about contextualizing the problem, which is itself a prompting task. People don't naturally solve Zebra puzzles that we find organically, it's something we encounter in specific contexts (like a puzzle book) which is not so dissimilar from prompting.
4. Incidentally Claude Sonnet 3.5 has a substantial lead. And GPT o1 is not much better than GPT 4o. In some sense I think o1 is a kind of self-prompting, an attempt to create its own context; so if you already have a well-worded prompt with instructions then o1 isn't that good at improving performance over 4o.
5. A lot of the prompting is really intended to slow down the LLM, to keep it from jumping to conclusions or solving a task too quickly (and incorrectly). Which again is a case of the prompt doing what o1 tries to do generally.
6. I'm not sure what tasks call for this kind of logical reasoning. Not that I don't think they exist, I just don't know how to recognize them. Planning tasks? Highly formalized and artificially constructed problems don't seem all that interesting... and the whole point of adding an LLM to the process is to formalize the informal.
7. Perhaps it's hard to see because real-world problems seldom have conveniently exact solutions. But that's not a blocker... Prolog (and Z3) can take constraints as a form of elimination, providing lists of possible answers, and maybe just reducing the search space is enough to move forward on some kinds of problems.
8. For instance when I give my pipeline really hard Zebra problems it usually doesn't succeed; one bug in one rule will kill the whole thing. Also I think the LLMs have a hard time keeping track of large problems; a context size problem, even though the problems don't approach their formal context limits. But I can imagine building the pipeline so it also tries to mark low-confidence rules. Given that I can imagine removing those rules, sampling the resulting (non-unique, sometimes incorrect) answers and using that to revisit and perhaps correct some of those rules.
Really I'd be most interested to hear thoughts on where this logic programming might actually be applied... artificial puzzles are an interesting exercise, but I can't really motivate myself to go too deep.
Garbage is garbage and failure to reason is failure to reason no matter the language. If your LLM can't translate your problem to a Prolog program that solves your problem- Prolog can't solve your problem.
Using just an LLM did not produce reliable queries, despite trying many many prompts, so being an old Prolog hacker I wondered if using it might impose more 'logic' on the LLM. So we precede the textual description of the constraints with the following prompt:
-------------
Now consider the following Prolog predicates:
biomarker(Name, Status) where Status will be one of the following integers -
Wildtype = 0 Mutated = 1 Methylated = 2 Unmethylated = 3 Amplified = 4 Deleted = 5 Positive = 6 Negative = 7
tumor(Name, Status) where Status will be one of the following integers if know else left unbound -
Newly diagnosed = 1 Recurrence = 2 Metastasized = 3 Progression = 4
chemo(Name)
surgery(Name) Where Name may be an unbound variable
other_treatment(Name)
radiation(Name) Where Name may be an unbound variable
Assume you are given predicate atMost(T, N) where T is a compound term and N is an integer. It will return true if the number of 'occurences' of T is less than or equal N else it will fail.
Assume you are given a predicate atLeastOneOf(L) where L is a list of compound terms. It will succeed if at least one of the compound terms, when executed as a predicate returns true.
Assume you are given a predicate age(Min, Max) which will return true if the patient's age is in between Min and Max.
Assume you have a predicate not(T) which returns true if predicate T evaluates false and vice versa. i.e. rather than '\\+ A' use not(A).
Do not implement the above helper functions.
VERY IMPORTANT: Use 'atLeastOneOf()' whenever you would otherwise use ';' to represent 'OR'. i.e. rather than 'A ; B' use atLeastOneOf([A, B]).
EXAMPLE INPUT: Patient must have recurrent GBM, methylated MGMT and wildtype EGFR. Patient must not have mutated KRAS.
EXAMPLE OUTPUT: tumor('gbm', 2), biomarker('MGMT', 2), biomarker('EGFR', 0), not(biomarker('KRAS', 1))
------------------
The Prolog predicates, when evaluated generate the required underlying query (of course the Prolog is itself a form of query).
Anyway - the upshot was a vast improvement in the accuracy of the generated query (I've yet to see a bad one). Somewhere in its bowels, being told to generate Prolog 'focused' the LLM. Perhaps LLMs are happier with declarative languages rather than imperative ones (I know I am :) ).