I tried an experiment with this using a Prolog interpreter with GPT-4 to try to answer complex logic questions. I found that it was really difficult because the model didn't seem to know Prolog well enough to write a description of any complexity.
It seems like you used an interpreter in the loop which is likely to help. I'd also be interested to see how o1 would do in a task like this or if it even makes sense to use something like prolog if the models can backtrack during the "thinking" phase
I also wrote wrote an LLM to Prolog interpreter for a hackathon called "Logical".
With a few hours effort I'm sure it could be improved.
https://github.com/Hendler/logical
I think while LLMs may approach completeness here, it's good to have an interpretable system to audit/verify and reproduce results.
I bet one person could probably build a pretty good synthetic NL->Prolog dataset. ROI for paying that person would be high if you were building a foundation model (ie benefits beyond being able to output Prolog.)
loading story #41875620