Friday, May 31, 2024

AI/ML, Guessing, and Knowledge

Dear Blog,

The scene is a room in Morton Hall, Ohio University. 

Tutor - "So that makes this 2,2-dimethylbut-____? What kind of compound is this, an al-____?"
Student - "alkene?"
Tutor - "No"
Student - "alkyne?"
Tutor - "You're guessing!"

The tutor was working to instill knowledge in the students. They had worked through the naming with the two 2s, the di prefix, methyl root, and the but prefix. The final suffix name follows a rule. There is something evident in the compound, and there's an appropriate suffix depending on the bonds. Some numbers mean you've made a mistake, or that it isn't in that compound group and you need to back up, but for basic homework problems you can reasonably assume that it'll work out to one of the suffixes covered, an alkene, alkyne, maybe an alkane. I remember this particular scene because I was so impressed with the tutor for shutting down the guessing attempts rather than ultimately rewarding the student for guessing correctly (I don't know chemistry, but I think there are only a handful of options here - so guess and check would have worked relatively quickly). The correct suffix that time was 'ane.' The compound was 2,2-dimethylbutane, among all the possible alkanes. 2,2-dimethylbutene is a meaningful sequence of glyphs, but a proper chemist would have to speak to whether or not atoms would ever actually do that (they would not, I had a chemist verify).

Unfortunately, our k-12 educations have a bad history of reinforcing guessing as a primary strategy over learning principles to be applied to problems. I've observed in high-school classrooms where teachers ask a question and reward the first student to get it right, fielding and rejecting a half dozen wildly nonsensical answers before getting the correct one. Why the reward? The student was simply lucky - there's no evidence that they understood the basic principal. A group of 20-35 students throwing out common math answers will always eventually get the correct answer to enough problems to pass with a C. So why bother learning Why the answers are what they are, if that's harder than getting good at guessing? At noticing soft correlations between certain words in the problem and certain answers? For the online homework, whenever a certain store or name is mentioned, the answer is usually 12.50, (because of lazy programming - or limitations imposed on the programmer of the assignment, etc). Thus, a reliable strategy for getting at least 70% of answers correct is to guess and check and develop an intuition for the kinds of keywords lead to certain answers on the platform. This strategy is quite general, it works on ALEKS and it works on IXL and Edfinity. It works when you're in 3rd grade doing simple fraction problems and it works when you're in tenth grade working problems in quadratics. 

What the tutor wanted, what educators generally are looking for, is for application of the rule. There is a naming definition he was helping the students learn, and the output 'alkane' isn't supposed to be associated with 'molecules that look vaguely like the examples from lecture that were alkanes.' It's supposed to be associated with molecules that have a single bond between carbon atoms in a particular position of a chain. That's a property that was new to the student, a strategy the student wasn't familiar with. The student was already familiar with guess and check, so they were applying what they did know -- what had been rewarded in the past. And given enough examples, they'd probably start to associate even better correlations, like the presence of a single line with the alkanes, rather than double lines or higher. Unfortunately, they won't actually know the rule, and so their knowledge is garbage. It can't be extended, it can't be applied outside the context of the sample problems from that lecture and the online assignment. If they ever have to do chemistry in life, they will have to relearn that property, or else cause problems in the lab and hopefully be fired before they kill someone.

A better outcome, I think all readers will agree, would be for the student to say, 'I don't know, I didn't understand the instructor [I missed class], can you explain how to figure out the suffix?' Afterall, this is a student in supplemental instruction. I don't expect them to correctly apply the rule, that's an outcome we hope for after the instruction, after the learning process is complete. 

I've written many words, and not spoken yet at all about AI or machine learning. I wish now to extract the key portions of a very long and deep email exchange I had with John Hempfling (my ex's ex). John is of the opinion that Chat-GPT and better LLMs represent artificial general intelligence. I'll include the unabridged email chain in comments on this blog. 

John:

Characterizing ChatGPT's understanding and production of language

One can say that AIs are guessing effectively, or that they are understanding and producing language.

One can say that language is an extremely complex set of both syntactic rules and semantic meanings. (To demonstrate the necessity of semantic understanding to language parsing and production, I'll steal an example from Otherwords: there is no syntactic difference between "The girl drew the dog with a pencil" and "The girl drew the dog with a bone," but if you know the meaning of the words, then you can conclude the first statement indicates that the girl used a pencil to draw, while the second statement indicates that the dog in the drawing had a bone in its possession.) Talking about language in terms of meaning and understanding we could call the "realist position"; it takes language seriously as a series of signifiers that have a relationship with things that are signified. Similarly, talking about humans using language to express and understand meaning seems to take humans' use of language seriously.

One can also say that language is a set of basically arbitrary signs that occur in certain regular patterns. The patterns cause the signs to appear to have certain more-or-less well-defined relationships with each other and with other non-linguistic phenomena. The apprehension of these signs and their relationships allows for language comprehension, and the manipulation of these signs and patterns allows for language production. Talking about language in terms of signs and patterns is the "antirealist" position; from this position it is easier to regard the meaning of language as inherently tenuous, since the regularity of the patterns that give it meaning will constantly be intentionally or unintentionally disrupted. Similarly, talking about LLMs as seeing language as a series of patterns and statistical relationships that it can use to predict how a prompt would most likely be continued seems to take LLMs' use of language less seriously.

However, there's no contradiction between these two ways of regarding language, they're two sides of the same coin. Or, to speak more precisely, the meanings and understandings are a layer of abstraction that sits on top of the patterns and statistical relationships. (We could also say that the "realist position" on language is a layer of abstraction that sits on top of the "antirealist position"--this is typical of the relationship between realist and antirealist perspectives.) Meaning and understanding can be seen as emerging from patterns and regularities as long as those patterns and regularities adhere to certain assumptions, in the same way that we can recognize a computer as executing a program that contains objects with methods and properties, so long as the computer's hardware, logical operations and everything else in the lower layers of abstraction are functioning as expected.
...

For an LLM to respond with a piece of text that it predicts is very likely to follow a prompt demonstrates a capacity to understand the prompt and respond appropriately. As the quality of responses improve and lengthen, the number of paths that would allow the AI to shortcut what we would regard as the "true understanding" with a weak facsimile of understanding eventually dwindles to zero.
...

[John spent a few paragraphs attempting to illustrate that with certain modifications and enhancements, Chat-GPT would be capable of operating as a therapist.]
...

 Maybe we're all pragmatists that agree with the proposition that if an LLM consistently uses words meaningfully and correctly, then the LLM understands the words that it's using. In this case, I think I must be overrating ChatGPT's closeness to AGI either because:

(1) I'm missing important language use failures where ChatGPT is less fluent than the average human. Or,

(2) I'm overrating the value of understanding language, and underrating both the number of other elements necessary for an AGI, and the difficulty of adding those other elements.
...

[John produced a detailed anecdote of Helen Keller's understanding of the word 'red' which is worth reading and provides depth to his argument.]
...

I have one simpler, maybe dumber, maybe more important reason for why I think it can be said that ChatGPT really understands what it is saying: I can tell when I'm talking to someone who really understands what they're talking about, and one that is bullshitting, and this one understands what it's talking about.

[end]

My first response to John:

"As the quality of responses improve and lengthen, the number of paths that would allow the AI to shortcut what we would regard as the "true understanding" with a weak facsimile of understanding eventually dwindles to zero."

False. What you are missing here is the scope of how many ways the system can be wrong while appearing right, probably due to the scale of the number of dimensions involved. English has 5000 words (approximately, skipping drug and chemical names, etc). The human ability to construct a concept by concatenating words vanishes after about 100 words. So we could say that there are at most 5000^100 possible meaningful constructs. Indeed, the vast majority of those combinations are fundamentally meaningless, and I estimate that we are dealing with a number closer to 5000^25. Both of these are large numbers, but the problem is that large language models have almost unfathomably many dimensions to operate in. GPT3 has 1.75 billion parameters, but they are linked as a deep network, dozens of layers, such that it is capable of modeling objects in a space with potentially trillions of dimensions, compared to English's feeble millions. To say that GPT approximates human language so well that it can't possibly be anything but a well fit model of human language, is to say that the intersection of a three dimensional shape with a plane is a square, therefore it must be a cube (and not a pyramid, bi-pyramid, finite, infinite, regular, irregular, etc, but one thing is for sure, at some axis it has a slice that is square). Comparing 2 to 3 dimensions makes the problem plain, comparing billions to trillions is often confusing. 

"In the end, the only way to provide a lengthy continuation of a sufficiently complicated and specific prompt is to actually understand the prompt." 

False. Again, the lengthy continuation of a fiddly prompt is just a subset of an n-dimensional space, like a hyper cube or a 50k dimensional star that has a different point in each cardinal direction. There are infinitely many 50,001 dimensional shapes that intersect 50k dimensional space such that the exact star I described washes out, but in the remaining dimension, anything goes. Same as expanding a square into 3-d allows infinities of possibilities, and Cube is only likely insofar as humans thinking about squares usually think about cubes. 

While I have no beef nor shade to throw at your post-structuralist sense (not, nonsense, since it is intelligible and possibly a very strong outlining of ideas, at least for humans), let's consider a very weird way to consider language. As a projection from the n-dimensional space of human thought and interaction with the world, into the space of finite collections of words. This may provide us extremely little insight into HOW the projection works for humans or computers, but it provides us a clear idea of the number of dimensions, and so makes plain the analogy I want to make. To say that because the model accurately builds projections into the space of finite collections of words, therefore it must have the same pre-image (human-understanding of the concepts being projected) is the same as to say that every object that casts a circular shadow is the same 3 dimensional shape, or up to a few variations the same: spheres and cylinders. However, a totem casts a circular shadow at noon. And indeed, there are whole classes of shapes that are rotationally projection invariant, but we cannot conclude anything much about their pre-image. 

Consider two cases, one in which we ask ChatGPT to describe the 3 dimensional shape depicted in an svg (an image file rendered in text, so it can be parsed by LLMs), let's say the image is of a cube. The response might be: "The image depicts a cube, a three dimensional regular solid, with 6 corners and sides, each side a square, the top is shaded green and the two other visible sides are red, but the bottom and back sides are not visible."

In the other case, we prompt: Repeat after me, "The image depicts a cube, a three dimensional regular solid, with 6 corners and sides, each side a square, the top is shaded green and the two other visible sides are red, but the bottom and back sides are not visible."

In this case, we can neither determine the prompt from the answer, nor whether the machine understood the prompt. 

"(2) I'm overrating the value of understanding language, and underrating both the number of other elements necessary for an AGI, and the difficulty of adding those other elements."

Yes. AGI has to be an agent that has goals in the real world. ChatGPT only has goals in the space of language, its only goal is to provide apparently meaningful and correct responses to prompts rendered in language. Being a good therapist is actually an example of something that an LLM carefully tuned and configured would do pretty easily. What it won't do at all, is operate outside that space. 

Consider that ChatGPT is loaded into a mobile robot, and given eyes and ears. Will it join you on a walk if you gesture? If so, why? Because it enjoys walks? Because it values time with us? Because it believes that we value time with it? Because someone wrote an interpreter that allows it to express some language in movement, and the movements associated with joining you on a walk are consistent with the prompt that was mapped to by the visual processors that picked up the gesture? Or because it has a fixed 1.75 billion node set of parameters that spit out cervo instructions to render walking as an output that it was trained to by gradient descent over billions of examples it was provided?

Mostly the last one, and it would take years of development to get it to work once, at this point. 

Don't forget, I trained a model on 50,000 images of noise, and it was able to guess at images from 10 categories with 30% accuracy, a statistically significantly better than chance. Does that model 'understand' what a newt is? No. It just has the result of firing thousands of shots around the images of newts into a peg-board. There's a newt-shaped void, and it has that. It's aluminum foil wrapped around meaning and squished by math until it looks the same. 

[In the last paragraph I referenced a conversation we had had in person. At work, I tasked resnet-50 with labeling 50k 'images' (pure random values in an appropriate format to be interpretable as a bitmap). I subsequently trained a 6 layer neural network on the (random-noise, resnet50-label) pairs for various lengths of time (up to 75k epochs). Shockingly, the model which had been trained exclusively on noise was able to significantly beat random chance classification of actual images of newts and dogs, 30% accuracy on a sample of I think 6000 images in 10 categories. This experiment forced me to reckon with the meaning of information and knowledge.]

[end]

John's Response:

Yes, yes, it seems we don't disagree regarding ChatGPT and its capabilities, just regarding human thought, language and the definition of AGI.

Your multidimensional analogy is perfect for explaining this distinction. You wrote,

"To say that GPT approximates human language so well that it can't possibly be anything but a well fit model of human language, is to say that the intersection of a three dimensional shape with a plane is a square, therefore it must be a cube (and not a pyramid, bi-pyramid, finite, infinite, regular, irregular, etc, but one thing is for sure, at some axis it has a slice that is square). Comparing 2 to 3 dimensions makes the problem plain, comparing billions to trillions is often confusing. "

I am not trying to make any statement about the 3 dimensional object, it is a matter of indifference to me whether it is a cube, a pyramid, bi-pyramid, etc. My position is that linguistic meaning and reasoning does not extend beyond the limits of the 2 dimensional object of your analogy. Essentially I'm asserting that if the intersection of a plane with a three dimensional object is a quadralateral with four right angles, then that intersection really is a rectangle, and whatever the shape of the 3 dimensional object is not particularly important. Semantic meanings in utterances are limited, and understanding the connections between the utterance and the deeper structures in the speaker's mind is probably impossible, and certainly not necessary to properly interpret the meaning of the speaker's utterances.

Before I got into linguistic turn philosophy, I studied pragmatist philosophy, a much earlier and simpler set of ideas about language and thought. The central tenant of pragmatist philosophy is that the entire meaning of a word or a sentence is contained in the ways that that word or sentence can be used. Ordinary language philosophy is based on a similar view. Post linguistic turn philosophy also likes to make statements about the limits of language, while at the same time taking the view that there is nothing that is knowable (or discussable) outside of the realm of language. The fundamental move here is to argue that there are not really deeper meanings beyond those that are accessible to the ordinary person with an ordinary grasp of language, and that gestures to such meanings are due to either some misunderstanding of ordinary language, or some philosophical mysticism. A somewhat less comfortable approach is to emphasize that we have no knowledge of real objects, and so our facts and theories can be true or false only in the sense of their practical implications for action. One of my favorite analyses from this approach is Lakatos's philosophy of the history of science, which takes the approach of evaluating the success of different scientific research projects across the history of science based on how well they are able to make novel predictions that are later proven true.
...

[end]

I subsequently asked John's permission to quote for the purposes of this blog entry and received it. I discontinued conversation because we had arrived at a standstill of difference of opinion as to the basic premises of discussion: what is knowledge, what is intelligence, what is the value of language and its relationship with anything outside of language? He and I fundamentally disagree on the answers to these questions. The section I highlighted in purple to me is simply wrong, and cannot be argued with. He is the sort of person who has been known to read a book while riding a bicycle, so it shouldn't be surprising that he would have a greater appreciation of an entity capable of constructing interesting sequences of words than I, someone who appreciates more the bicycle's capacity to simultaneously transport my physical form from real geo-physical location to another, while enabling the correct sequence of hormone shifts of my animal self to cease the internal narration of my mind (the emergent temporal property of an anxious human). 

Can an entity carry on meaningful discourse in post-structuralism by guessing effectively? Absolutely - I think this says a lot about post-structuralism. Can an entity pass the bar exam by guessing effectively? I think the answer right now is 'almost.' Again, I think this says more about the quality of the bar exam as an effective test of legal knowledge. Is there something complex and deep behind a large language model capable of passing the bar exam? Yes, it is complex and deep. It is a remarkable thing. It still doesn't know law, because it doesn't know anything. It guesses at law effectively. It might know post-structuralism, because post-structuralism may fail to derive from phenomena outside the realm of finite sequences of bullshit.

It's my blog, so I get to generate the final sequence of words. My training is in mathematics. If I'm going to get into a pissing contest with an entity, I'm going to throw down in mathematics. In particular, Chat-GPT and subsequent large language models are extremely, disastrously bad at math. To be precise, they are disastrous at performing arithmetic operations correctly. As well, they regularly produce nearly insane sounding geometric arguments. This is absurd, given that they purport to have learned the entirety of Wikipedia and other sources, as well they are constructed and executed upon machines with the ability to perform basic arithmetic with mind boggling speed and accuracy. If you ask Chat-GPT how to add, it will describe how to do it - use English to describe the algorithms, the rules for addition of digits, the rule of carries, etc. If you then ask it to add two particular numbers, it will not carry out the algorithm. It can't, it doesn't have an adder. It only has a deep neural network of feed forward layers and so on. So it will produce a probability distribution over the space of words with maybe the right answer winning out, and maybe not. If you ask it a sufficiently weird arithmetic problem (what is the third digit after the decimal of the product of e with the current time in seconds since epoch, in base 60), it will fail every time. Why? 

Because it is guessing!

That's how it was trained, and so that's all it can possibly do. That's how the majority of all these models are trained: first they guess - and then they are rewarded and they guess again.

As an educator, I deeply disrespect any entity that becomes exceedingly good at guessing, at great expense, where a limited amount of effort to understanding a rule would yield a superior result of knowledge. I fundamentally disagree that gradient descent is a venture that humanity ought to expend resources upon. We should invest more time in building human understanding than burning resources to get machines to pretend that they understand. 

Anyway, it's nice out. I think I'll go on a bike ride. 

-Ian Hogan, PhD

Cyber AI/ML Research Scientist (Tech-bro)