Use Prolog to improve LLM's reasoning

https://shchegrikovich.substack.com/p/use-prolog-to-improve-llms-reasoning

197shchegrikovich | 4 days ago | 54 | HN

i've come to appreciate, over the past 2 years of heavy Prolog use, that all coding should be (eventually) be done in Prolog.

It's one of few languages that is simultaneously a standalone logical formalism, and a standalone representation of computation. (With caveats and exceptions, I know). So a Prolog program can stand in as a document of all facts, rules and relations that a person/organization understands/declares to be true. Even if AI writes code for us, we should expect to have it presented and manipulated as a logical formalism.

Now if someone cares to argue that some other language/compiler is better at generating more performant code on certain architectures, then that person can declare their arguments in a logical formalism (Prolog) and we can use Prolog to translate between language representations, compile, optimize, etc.

loading story #41875236

loading story #41874985

loading story #41875196

loading story #41874594

loading story #41874164

loading story #41874229

gorkempacaci4 hours ago | parent | next

The generated programs are only technically Prolog programs. They use CLPFD, which makes these constraint programs. Prolog programs are quite a bit more tricky with termination issues. I wouldn’t have nitpicked if it wasn’t in the title.

Also, the experiment method has some flaws. Problems are hand-picked out of a random subset of the full set. Why not run the full set?

fsndz5 hours ago | parent | next

This is basically the LLM modulo approach recommended by Prof. Subbarao Kambhampati. Interesting but only works mostly for problems that have some math/first degree logic puzzle at their heart. Will fail at improving perf at ARC-AGI for example... Difficult to mimic reasoning by basic trial and error then hoping for the best: https://www.lycee.ai/blog/why-sam-altman-is-wrong

pjmlp7 hours ago | parent | next

So we are back to Japanese Fifth Generation plan from 1980's. :)

loading story #41874052

linguae7 hours ago | parent | next

This time around we have all sorts of parallel processing capabilities in the form of GPUs. If I recall correctly, the Fifth Generation project envisioned highly parallel machines performing symbolic AI. From a hardware standpoint, those researchers were way ahead of their time.

loading story #41873372

loading story #41873890

tokinonagare7 hours ago | parent | next

Missing some LISP but yeah it's funny how old things are new again (same story with wasm, RISC archs, etc.)

loading story #41873328

thelastparadise7 hours ago | parent

Watson did it too, a while back.

a1j9o947 hours ago | parent | next

I tried an experiment with this using a Prolog interpreter with GPT-4 to try to answer complex logic questions. I found that it was really difficult because the model didn't seem to know Prolog well enough to write a description of any complexity.

It seems like you used an interpreter in the loop which is likely to help. I'd also be interested to see how o1 would do in a task like this or if it even makes sense to use something like prolog if the models can backtrack during the "thinking" phase

loading story #41873700

lukasb6 hours ago | parent

I bet one person could probably build a pretty good synthetic NL->Prolog dataset. ROI for paying that person would be high if you were building a foundation model (ie benefits beyond being able to output Prolog.)

UniverseHacker5 hours ago | parent | next

I think this general idea is going to be the key to really making LLMs widely useful for solving real problems.

I’ve been playing with using GPT-4 together with the Wolfram Alpha plugin, and the combo of the two can reliably solve difficult quantitative problems that neither can individually by working together, much like a human using a calculator.

nonamepcbrand16 hours ago | parent | next

This is why GitHub CodeQL and Co-Pilot assistance is working better for everyone? basically codeql uses variant of Prolog (datalog) to query source code to generate better results.

DeborahWrites5 hours ago | parent | next

You're telling me the seemingly arbitrary 6 weeks of Prolog on my comp sci course 11yrs ago is suddenly about to be relevant? I did not see this one coming . . .

loading story #41874466

baq7 hours ago | parent | next

Patiently waiting for z3-guided generation, but this is a welcome, if obvious, development. Results are a bit surprising and sound too optimistic, though.

de6u99er5 hours ago | parent | next

I always thought that Prolog is great for reasoning in the semantic web. It doesn't surprise me that LLM people stumble on it.

ianbicking4 hours ago | parent | next

I made a pipeline using Z3 (another prover language) to get LLMs to solve very specific puzzle problems: https://youtu.be/UjSf0rA1blc (and a presentation: https://youtu.be/TUAmfi8Ws1g)

Some thoughts:

1. Getting an LLM to model a problem accurately is a significant prompting exercise. Bridging casual logical statements and formal logic is difficult. E.g., "or" statements in English usually mean "xor" in logic.

2. Domains usually have their own language expectations. I was doing Zebra puzzles (https://en.wikipedia.org/wiki/Zebra_Puzzle) and they have a very specific pattern and language. I don't think it's fair to really call it intuitive or even entirely unambiguous, it's something you have to learn. The LLM has to learn it too. They have seen this kind of puzzle (and I think most can reproduce the original Zebra puzzle from memory), but they lack a really firm familiarity.

3. Arguably some of the familiarity is about contextualizing the problem, which is itself a prompting task. People don't naturally solve Zebra puzzles that we find organically, it's something we encounter in specific contexts (like a puzzle book) which is not so dissimilar from prompting.

4. Incidentally Claude Sonnet 3.5 has a substantial lead. And GPT o1 is not much better than GPT 4o. In some sense I think o1 is a kind of self-prompting, an attempt to create its own context; so if you already have a well-worded prompt with instructions then o1 isn't that good at improving performance over 4o.

5. A lot of the prompting is really intended to slow down the LLM, to keep it from jumping to conclusions or solving a task too quickly (and incorrectly). Which again is a case of the prompt doing what o1 tries to do generally.

6. I'm not sure what tasks call for this kind of logical reasoning. Not that I don't think they exist, I just don't know how to recognize them. Planning tasks? Highly formalized and artificially constructed problems don't seem all that interesting... and the whole point of adding an LLM to the process is to formalize the informal.

7. Perhaps it's hard to see because real-world problems seldom have conveniently exact solutions. But that's not a blocker... Prolog (and Z3) can take constraints as a form of elimination, providing lists of possible answers, and maybe just reducing the search space is enough to move forward on some kinds of problems.

8. For instance when I give my pipeline really hard Zebra problems it usually doesn't succeed; one bug in one rule will kill the whole thing. Also I think the LLMs have a hard time keeping track of large problems; a context size problem, even though the problems don't approach their formal context limits. But I can imagine building the pipeline so it also tries to mark low-confidence rules. Given that I can imagine removing those rules, sampling the resulting (non-unique, sometimes incorrect) answers and using that to revisit and perhaps correct some of those rules.

Really I'd be most interested to hear thoughts on where this logic programming might actually be applied... artificial puzzles are an interesting exercise, but I can't really motivate myself to go too deep.

loading story #41874857

sgt1017 hours ago | parent | next

Building on this idea people have grounded LLM generated reasoning logic with perceptual information from other networks : https://web.stanford.edu/~joycj/projects/left_neurips_2023

mise_en_place5 hours ago | parent | next

I really enjoyed tinkering with languages like Prolog and Coq. Interactive theorem proving with LLMs would be awesome to try out, if possible.

YeGoblynQueenne5 hours ago | parent | next

That's not going to work. Garbage in - Garbage out is success-set equivalent to Garbage in - Prolog out.

Garbage is garbage and failure to reason is failure to reason no matter the language. If your LLM can't translate your problem to a Prolog program that solves your problem- Prolog can't solve your problem.

loading story #41875551

loading story #41874322

arjun_khamkar5 hours ago | parent | next

Would Creating a prolog dataset would be beneficial, so that future LLM's can be trained on it and then they would be able to output prolog code.

bytebach4 hours ago | parent | next

An application I am developing for a customer needed to read constraints around clinical trials and essentially build a query from them. Constraints involve prior treatments, biomarkers, type of disease (cancers) etc.

Using just an LLM did not produce reliable queries, despite trying many many prompts, so being an old Prolog hacker I wondered if using it might impose more 'logic' on the LLM. So we precede the textual description of the constraints with the following prompt:

-------------

Now consider the following Prolog predicates:

biomarker(Name, Status) where Status will be one of the following integers -

Wildtype = 0 Mutated = 1 Methylated = 2 Unmethylated = 3 Amplified = 4 Deleted = 5 Positive = 6 Negative = 7

tumor(Name, Status) where Status will be one of the following integers if know else left unbound -

Newly diagnosed = 1 Recurrence = 2 Metastasized = 3 Progression = 4

chemo(Name)

surgery(Name) Where Name may be an unbound variable

other_treatment(Name)

radiation(Name) Where Name may be an unbound variable

Assume you are given predicate atMost(T, N) where T is a compound term and N is an integer. It will return true if the number of 'occurences' of T is less than or equal N else it will fail.

Assume you are given a predicate atLeastOneOf(L) where L is a list of compound terms. It will succeed if at least one of the compound terms, when executed as a predicate returns true.

Assume you are given a predicate age(Min, Max) which will return true if the patient's age is in between Min and Max.

Assume you have a predicate not(T) which returns true if predicate T evaluates false and vice versa. i.e. rather than '\\+ A' use not(A).

Do not implement the above helper functions.

VERY IMPORTANT: Use 'atLeastOneOf()' whenever you would otherwise use ';' to represent 'OR'. i.e. rather than 'A ; B' use atLeastOneOf([A, B]).

EXAMPLE INPUT: Patient must have recurrent GBM, methylated MGMT and wildtype EGFR. Patient must not have mutated KRAS.

EXAMPLE OUTPUT: tumor('gbm', 2), biomarker('MGMT', 2), biomarker('EGFR', 0), not(biomarker('KRAS', 1))

------------------

The Prolog predicates, when evaluated generate the required underlying query (of course the Prolog is itself a form of query).

Anyway - the upshot was a vast improvement in the accuracy of the generated query (I've yet to see a bad one). Somewhere in its bowels, being told to generate Prolog 'focused' the LLM. Perhaps LLMs are happier with declarative languages rather than imperative ones (I know I am :) ).

anthk6 hours ago | parent

Use Constraint Satisfaction Problem Solvers. It commes up with Common Lisp with ease.

#visit	10087822
#session	44449
#live-session	0