Leben Ist Änderung

Saturday, November 2, 2024

Bang Bang

Dear Blog,

I discharged a firearm for the first time today. So did my girlfriend (hot date, yeah?)

I signed up for an introduction class -- one that seemed to advertise and focus on safety over fun or 'tactical situations'. The lesson included a rental firearm (Glock 44, .22 cal.), 50 rounds of ammunition, safety pamphlets, safety instruction, and instruction on how to load, arm, stand, aim, and fire.

My first impression, after discharging a round: people shouldn't have these things. Citizens should not have these things - cops shouldn't either (if a traffic stop requires firearms, call in SWAT). I've never fired one before, and after 90 minutes of instruction, I unloaded 50 rounds into a piece of paper no larger than a person, at various distances 3-8 meters, and didn't miss once. (Really it was 47 rounds, the piece jammed every five fires and some good rounds ended up ejected on the floor - cheap ammo, cheap class.) It's really upsetting. It was upsetting to discharge. It is upsetting to think about.

But

People do have these things. Short of moving to Australia, that isn't super likely to change. As well, a large number of people who have these things are listening to a political candidate who incites violence against 'the enemy within.' I feel, sadly, that I may need to be able to use a firearm against a mislead Trump supporter who thinks my family represents an enemy of the state, or against someone 'just following orders' to take out me or my family.

I also want to get into hunting. I eat meat, I eat not very much, maybe a pound per week, but it's all factory farm (organic, but still factory farm). Our state has to spend large sums rounding up and executing deer who have become a pest species due to our elimination of wolves. We might as well eat them. After today, I may consider bow hunting. I'm not sure I want to deal with a rifle or shot gun. Then again, I may have to. At least I have started to learn how.

'Merica,

right?

-Ian Hogan

Monday, July 8, 2024

Vlog Vier, Hurrican Ridge

Dear Blog,

Your Obedient,

Ian Hogan, PhD

Friday, May 31, 2024

AI/ML, Guessing, and Knowledge

Dear Blog,

The scene is a room in Morton Hall, Ohio University.

Tutor - "So that makes this 2,2-dimethylbut-____? What kind of compound is this, an al-____?"
Student - "alkene?"
Tutor - "No"
Student - "alkyne?"
Tutor - "You're guessing!"

The tutor was working to instill knowledge in the students. They had worked through the naming with the two 2s, the di prefix, methyl root, and the but prefix. The final suffix name follows a rule. There is something evident in the compound, and there's an appropriate suffix depending on the bonds. Some numbers mean you've made a mistake, or that it isn't in that compound group and you need to back up, but for basic homework problems you can reasonably assume that it'll work out to one of the suffixes covered, an alkene, alkyne, maybe an alkane. I remember this particular scene because I was so impressed with the tutor for shutting down the guessing attempts rather than ultimately rewarding the student for guessing correctly (I don't know chemistry, but I think there are only a handful of options here - so guess and check would have worked relatively quickly). The correct suffix that time was 'ane.' The compound was 2,2-dimethylbutane, among all the possible alkanes. 2,2-dimethylbutene is a meaningful sequence of glyphs, but a proper chemist would have to speak to whether or not atoms would ever actually do that (they would not, I had a chemist verify).

Unfortunately, our k-12 educations have a bad history of reinforcing guessing as a primary strategy over learning principles to be applied to problems. I've observed in high-school classrooms where teachers ask a question and reward the first student to get it right, fielding and rejecting a half dozen wildly nonsensical answers before getting the correct one. Why the reward? The student was simply lucky - there's no evidence that they understood the basic principal. A group of 20-35 students throwing out common math answers will always eventually get the correct answer to enough problems to pass with a C. So why bother learning Why the answers are what they are, if that's harder than getting good at guessing? At noticing soft correlations between certain words in the problem and certain answers? For the online homework, whenever a certain store or name is mentioned, the answer is usually 12.50, (because of lazy programming - or limitations imposed on the programmer of the assignment, etc). Thus, a reliable strategy for getting at least 70% of answers correct is to guess and check and develop an intuition for the kinds of keywords lead to certain answers on the platform. This strategy is quite general, it works on ALEKS and it works on IXL and Edfinity. It works when you're in 3rd grade doing simple fraction problems and it works when you're in tenth grade working problems in quadratics.

What the tutor wanted, what educators generally are looking for, is for application of the rule. There is a naming definition he was helping the students learn, and the output 'alkane' isn't supposed to be associated with 'molecules that look vaguely like the examples from lecture that were alkanes.' It's supposed to be associated with molecules that have a single bond between carbon atoms in a particular position of a chain. That's a property that was new to the student, a strategy the student wasn't familiar with. The student was already familiar with guess and check, so they were applying what they did know -- what had been rewarded in the past. And given enough examples, they'd probably start to associate even better correlations, like the presence of a single line with the alkanes, rather than double lines or higher. Unfortunately, they won't actually know the rule, and so their knowledge is garbage. It can't be extended, it can't be applied outside the context of the sample problems from that lecture and the online assignment. If they ever have to do chemistry in life, they will have to relearn that property, or else cause problems in the lab and hopefully be fired before they kill someone.

A better outcome, I think all readers will agree, would be for the student to say, 'I don't know, I didn't understand the instructor [I missed class], can you explain how to figure out the suffix?' Afterall, this is a student in supplemental instruction. I don't expect them to correctly apply the rule, that's an outcome we hope for after the instruction, after the learning process is complete.

I've written many words, and not spoken yet at all about AI or machine learning. I wish now to extract the key portions of a very long and deep email exchange I had with John Hempfling (my ex's ex). John is of the opinion that Chat-GPT and better LLMs represent artificial general intelligence. I'll include the unabridged email chain in comments on this blog.

John:

Characterizing ChatGPT's understanding and production of language

One can say that AIs are guessing effectively, or that they are understanding and producing language.

One can say that language is an extremely complex set of both syntactic rules and semantic meanings. (To demonstrate the necessity of semantic understanding to language parsing and production, I'll steal an example from Otherwords: there is no syntactic difference between "The girl drew the dog with a pencil" and "The girl drew the dog with a bone," but if you know the meaning of the words, then you can conclude the first statement indicates that the girl used a pencil to draw, while the second statement indicates that the dog in the drawing had a bone in its possession.) Talking about language in terms of meaning and understanding we could call the "realist position"; it takes language seriously as a series of signifiers that have a relationship with things that are signified. Similarly, talking about humans using language to express and understand meaning seems to take humans' use of language seriously.

One can also say that language is a set of basically arbitrary signs that occur in certain regular patterns. The patterns cause the signs to appear to have certain more-or-less well-defined relationships with each other and with other non-linguistic phenomena. The apprehension of these signs and their relationships allows for language comprehension, and the manipulation of these signs and patterns allows for language production. Talking about language in terms of signs and patterns is the "antirealist" position; from this position it is easier to regard the meaning of language as inherently tenuous, since the regularity of the patterns that give it meaning will constantly be intentionally or unintentionally disrupted. Similarly, talking about LLMs as seeing language as a series of patterns and statistical relationships that it can use to predict how a prompt would most likely be continued seems to take LLMs' use of language less seriously.

However, there's no contradiction between these two ways of regarding language, they're two sides of the same coin. Or, to speak more precisely, the meanings and understandings are a layer of abstraction that sits on top of the patterns and statistical relationships. (We could also say that the "realist position" on language is a layer of abstraction that sits on top of the "antirealist position"--this is typical of the relationship between realist and antirealist perspectives.) Meaning and understanding can be seen as emerging from patterns and regularities as long as those patterns and regularities adhere to certain assumptions, in the same way that we can recognize a computer as executing a program that contains objects with methods and properties, so long as the computer's hardware, logical operations and everything else in the lower layers of abstraction are functioning as expected.
...

For an LLM to respond with a piece of text that it predicts is very likely to follow a prompt demonstrates a capacity to understand the prompt and respond appropriately. As the quality of responses improve and lengthen, the number of paths that would allow the AI to shortcut what we would regard as the "true understanding" with a weak facsimile of understanding eventually dwindles to zero.
...

[John spent a few paragraphs attempting to illustrate that with certain modifications and enhancements, Chat-GPT would be capable of operating as a therapist.]
...

Maybe we're all pragmatists that agree with the proposition that if an LLM consistently uses words meaningfully and correctly, then the LLM understands the words that it's using. In this case, I think I must be overrating ChatGPT's closeness to AGI either because:

(1) I'm missing important language use failures where ChatGPT is less fluent than the average human. Or,

(2) I'm overrating the value of understanding language, and underrating both the number of other elements necessary for an AGI, and the difficulty of adding those other elements.
...

[John produced a detailed anecdote of Helen Keller's understanding of the word 'red' which is worth reading and provides depth to his argument.]
...

I have one simpler, maybe dumber, maybe more important reason for why I think it can be said that ChatGPT really understands what it is saying: I can tell when I'm talking to someone who really understands what they're talking about, and one that is bullshitting, and this one understands what it's talking about.

[end]

My first response to John:

"As the quality of responses improve and lengthen, the number of paths that would allow the AI to shortcut what we would regard as the "true understanding" with a weak facsimile of understanding eventually dwindles to zero."

False. What you are missing here is the scope of how many ways the system can be wrong while appearing right, probably due to the scale of the number of dimensions involved. English has 5000 words (approximately, skipping drug and chemical names, etc). The human ability to construct a concept by concatenating words vanishes after about 100 words. So we could say that there are at most 5000^100 possible meaningful constructs. Indeed, the vast majority of those combinations are fundamentally meaningless, and I estimate that we are dealing with a number closer to 5000^25. Both of these are large numbers, but the problem is that large language models have almost unfathomably many dimensions to operate in. GPT3 has 1.75 billion parameters, but they are linked as a deep network, dozens of layers, such that it is capable of modeling objects in a space with potentially trillions of dimensions, compared to English's feeble millions. To say that GPT approximates human language so well that it can't possibly be anything but a well fit model of human language, is to say that the intersection of a three dimensional shape with a plane is a square, therefore it must be a cube (and not a pyramid, bi-pyramid, finite, infinite, regular, irregular, etc, but one thing is for sure, at some axis it has a slice that is square). Comparing 2 to 3 dimensions makes the problem plain, comparing billions to trillions is often confusing.

"In the end, the only way to provide a lengthy continuation of a sufficiently complicated and specific prompt is to actually understand the prompt."

False. Again, the lengthy continuation of a fiddly prompt is just a subset of an n-dimensional space, like a hyper cube or a 50k dimensional star that has a different point in each cardinal direction. There are infinitely many 50,001 dimensional shapes that intersect 50k dimensional space such that the exact star I described washes out, but in the remaining dimension, anything goes. Same as expanding a square into 3-d allows infinities of possibilities, and Cube is only likely insofar as humans thinking about squares usually think about cubes.

While I have no beef nor shade to throw at your post-structuralist sense (not, nonsense, since it is intelligible and possibly a very strong outlining of ideas, at least for humans), let's consider a very weird way to consider language. As a projection from the n-dimensional space of human thought and interaction with the world, into the space of finite collections of words. This may provide us extremely little insight into HOW the projection works for humans or computers, but it provides us a clear idea of the number of dimensions, and so makes plain the analogy I want to make. To say that because the model accurately builds projections into the space of finite collections of words, therefore it must have the same pre-image (human-understanding of the concepts being projected) is the same as to say that every object that casts a circular shadow is the same 3 dimensional shape, or up to a few variations the same: spheres and cylinders. However, a totem casts a circular shadow at noon. And indeed, there are whole classes of shapes that are rotationally projection invariant, but we cannot conclude anything much about their pre-image.

Consider two cases, one in which we ask ChatGPT to describe the 3 dimensional shape depicted in an svg (an image file rendered in text, so it can be parsed by LLMs), let's say the image is of a cube. The response might be: "The image depicts a cube, a three dimensional regular solid, with 6 corners and sides, each side a square, the top is shaded green and the two other visible sides are red, but the bottom and back sides are not visible."

In the other case, we prompt: Repeat after me, "The image depicts a cube, a three dimensional regular solid, with 6 corners and sides, each side a square, the top is shaded green and the two other visible sides are red, but the bottom and back sides are not visible."

In this case, we can neither determine the prompt from the answer, nor whether the machine understood the prompt.

"(2) I'm overrating the value of understanding language, and underrating both the number of other elements necessary for an AGI, and the difficulty of adding those other elements."

Yes. AGI has to be an agent that has goals in the real world. ChatGPT only has goals in the space of language, its only goal is to provide apparently meaningful and correct responses to prompts rendered in language. Being a good therapist is actually an example of something that an LLM carefully tuned and configured would do pretty easily. What it won't do at all, is operate outside that space.

Consider that ChatGPT is loaded into a mobile robot, and given eyes and ears. Will it join you on a walk if you gesture? If so, why? Because it enjoys walks? Because it values time with us? Because it believes that we value time with it? Because someone wrote an interpreter that allows it to express some language in movement, and the movements associated with joining you on a walk are consistent with the prompt that was mapped to by the visual processors that picked up the gesture? Or because it has a fixed 1.75 billion node set of parameters that spit out cervo instructions to render walking as an output that it was trained to by gradient descent over billions of examples it was provided?

Mostly the last one, and it would take years of development to get it to work once, at this point.

Don't forget, I trained a model on 50,000 images of noise, and it was able to guess at images from 10 categories with 30% accuracy, a statistically significantly better than chance. Does that model 'understand' what a newt is? No. It just has the result of firing thousands of shots around the images of newts into a peg-board. There's a newt-shaped void, and it has that. It's aluminum foil wrapped around meaning and squished by math until it looks the same.

[In the last paragraph I referenced a conversation we had had in person. At work, I tasked resnet-50 with labeling 50k 'images' (pure random values in an appropriate format to be interpretable as a bitmap). I subsequently trained a 6 layer neural network on the (random-noise, resnet50-label) pairs for various lengths of time (up to 75k epochs). Shockingly, the model which had been trained exclusively on noise was able to significantly beat random chance classification of actual images of newts and dogs, 30% accuracy on a sample of I think 6000 images in 10 categories. This experiment forced me to reckon with the meaning of information and knowledge.]

[end]

John's Response:

Yes, yes, it seems we don't disagree regarding ChatGPT and its capabilities, just regarding human thought, language and the definition of AGI.

Your multidimensional analogy is perfect for explaining this distinction. You wrote,

"To say that GPT approximates human language so well that it can't possibly be anything but a well fit model of human language, is to say that the intersection of a three dimensional shape with a plane is a square, therefore it must be a cube (and not a pyramid, bi-pyramid, finite, infinite, regular, irregular, etc, but one thing is for sure, at some axis it has a slice that is square). Comparing 2 to 3 dimensions makes the problem plain, comparing billions to trillions is often confusing. "

I am not trying to make any statement about the 3 dimensional object, it is a matter of indifference to me whether it is a cube, a pyramid, bi-pyramid, etc. My position is that linguistic meaning and reasoning does not extend beyond the limits of the 2 dimensional object of your analogy. Essentially I'm asserting that if the intersection of a plane with a three dimensional object is a quadralateral with four right angles, then that intersection really is a rectangle, and whatever the shape of the 3 dimensional object is not particularly important. Semantic meanings in utterances are limited, and understanding the connections between the utterance and the deeper structures in the speaker's mind is probably impossible, and certainly not necessary to properly interpret the meaning of the speaker's utterances.

Before I got into linguistic turn philosophy, I studied pragmatist philosophy, a much earlier and simpler set of ideas about language and thought. The central tenant of pragmatist philosophy is that the entire meaning of a word or a sentence is contained in the ways that that word or sentence can be used. Ordinary language philosophy is based on a similar view. Post linguistic turn philosophy also likes to make statements about the limits of language, while at the same time taking the view that there is nothing that is knowable (or discussable) outside of the realm of language. The fundamental move here is to argue that there are not really deeper meanings beyond those that are accessible to the ordinary person with an ordinary grasp of language, and that gestures to such meanings are due to either some misunderstanding of ordinary language, or some philosophical mysticism. A somewhat less comfortable approach is to emphasize that we have no knowledge of real objects, and so our facts and theories can be true or false only in the sense of their practical implications for action. One of my favorite analyses from this approach is Lakatos's philosophy of the history of science, which takes the approach of evaluating the success of different scientific research projects across the history of science based on how well they are able to make novel predictions that are later proven true.
...

[end]

I subsequently asked John's permission to quote for the purposes of this blog entry and received it. I discontinued conversation because we had arrived at a standstill of difference of opinion as to the basic premises of discussion: what is knowledge, what is intelligence, what is the value of language and its relationship with anything outside of language? He and I fundamentally disagree on the answers to these questions. The section I highlighted in purple to me is simply wrong, and cannot be argued with. He is the sort of person who has been known to read a book while riding a bicycle, so it shouldn't be surprising that he would have a greater appreciation of an entity capable of constructing interesting sequences of words than I, someone who appreciates more the bicycle's capacity to simultaneously transport my physical form from real geo-physical location to another, while enabling the correct sequence of hormone shifts of my animal self to cease the internal narration of my mind (the emergent temporal property of an anxious human).

Can an entity carry on meaningful discourse in post-structuralism by guessing effectively? Absolutely - I think this says a lot about post-structuralism. Can an entity pass the bar exam by guessing effectively? I think the answer right now is 'almost.' Again, I think this says more about the quality of the bar exam as an effective test of legal knowledge. Is there something complex and deep behind a large language model capable of passing the bar exam? Yes, it is complex and deep. It is a remarkable thing. It still doesn't know law, because it doesn't know anything. It guesses at law effectively. It might know post-structuralism, because post-structuralism may fail to derive from phenomena outside the realm of finite sequences of bullshit.

It's my blog, so I get to generate the final sequence of words. My training is in mathematics. If I'm going to get into a pissing contest with an entity, I'm going to throw down in mathematics. In particular, Chat-GPT and subsequent large language models are extremely, disastrously bad at math. To be precise, they are disastrous at performing arithmetic operations correctly. As well, they regularly produce nearly insane sounding geometric arguments. This is absurd, given that they purport to have learned the entirety of Wikipedia and other sources, as well they are constructed and executed upon machines with the ability to perform basic arithmetic with mind boggling speed and accuracy. If you ask Chat-GPT how to add, it will describe how to do it - use English to describe the algorithms, the rules for addition of digits, the rule of carries, etc. If you then ask it to add two particular numbers, it will not carry out the algorithm. It can't, it doesn't have an adder. It only has a deep neural network of feed forward layers and so on. So it will produce a probability distribution over the space of words with maybe the right answer winning out, and maybe not. If you ask it a sufficiently weird arithmetic problem (what is the third digit after the decimal of the product of e with the current time in seconds since epoch, in base 60), it will fail every time. Why?

Because it is guessing!

That's how it was trained, and so that's all it can possibly do. That's how the majority of all these models are trained: first they guess - and then they are rewarded and they guess again.

As an educator, I deeply disrespect any entity that becomes exceedingly good at guessing, at great expense, where a limited amount of effort to understanding a rule would yield a superior result of knowledge. I fundamentally disagree that gradient descent is a venture that humanity ought to expend resources upon. We should invest more time in building human understanding than burning resources to get machines to pretend that they understand.

Anyway, it's nice out. I think I'll go on a bike ride.

-Ian Hogan, PhD

Cyber AI/ML Research Scientist (Tech-bro)

Wednesday, December 27, 2023

Waxing Philosophic and Dating, the Sequel

Dear Blog,

There is a perspective making phenomenon, or meme, or trend. It is a common response to people potentially being overly invested in some argument or grudge or inconvenience. So the perspective making goes, in the scheme of things, nations rising and falling, species evolving and going extinct, the universe expanding and galaxies of billions of years fading to darkness -- your hurts, your anger, your quibbles are so small. So, sure, but why is that the correct vantage? From the frame of reference of you, yourself, a human, an animal, a thinking social being interacting for finite time with other thinking, social beings, should we be predominantly preoccupied with the events at the scale of our existence? Our work, our relationships, our commutes and food prep. If your meal is burned, that's at scale of your life, in the moment, the most important thing in the whole universe to you. Don't make insignificant what is real to you, now. Exist, feel things as they are.

Since I had this one up in draft for some time, it came across, the above view, in conversation over Christmas dinner with an older boomer. She essentially rejected it out of hand, recounting a story of losing all of her teeth and telling herself to quit bitching because her next door neighbor had no legs. It occurs to me that this is how people of any amount of privilege end up denying the validity of their own trauma (big T or little t as it may be). Once you deny your own trauma, that's when it stays there and festers. This affluent woman lost all of her teeth. That's horrifying. No one wants to exist with only dentures, have to worry over every bite, worry about cleaning them, worry about their appearance. And though she may persevere for her remaining years, it's even more likely, I think, that she will have hidden horror eating her from the inside about it, because her neighbor has no legs. Well, that's horrifying too, and as I've said many times, the existence of Jupiter does not mean that the earth is not large.

(Purportedly unrelated)

I'm dating again. Or rather, I have been on the apps, a select handful of them, making steady but slow progress in accumulating the often silent 'no thanks' of mid-late thirties women in the south west of Ohio. I had adjusted my expectations significantly from the last go around, in 2019 -- I prepared myself in much better ways, ensuring adequate self-care abilities, social network foundation for support in and out of possible relationships and ends thereof. I made a list of things to do before and crossed all the items off of it.

(It goes well, only from a vantage that is designed for it to look well. If you look at a broken table deeply, with thought to its art, its being, its history, what it can uniquely state in space-time, touch it, feel its smoothness, you can be in awe. But you still need a new table.)

The reality is that I carry limited appeal. I'm decent looking and a great singer, but I'm intense and complicated -- I can't hide it and I don't really want to anyway. I have a kid and won't facilitate more, and honestly that's the biggest barrier. It seems less a barrier now that I'm a little older, more of the target demographic of het femme cis gendered folks are ok with the prospect of not having kids or more kids now than 4+ years ago. More - not all; if I weren't sterile, I'd still have a bigger pool to draw from.

The reality of the region is that it's a predominantly pretty boring area. People watch sporting events and drink beer and chat about idle bull en masse -- this is not inherently bad, but to me it's fundamentally uninteresting. The out-there artists and professors and musicians are thin on the ground here, unlike a bigger city or a coastal town. And just as the region probably wouldn't really care to hear my opinions of their flaccid appeal, so too am I tiring of being presented with daily evidence that I'm not a hot item on market.

(Anywhat)

My life is pretty much prime. I'm saving money, looking to buy a new car next year, I eat whatever I want, work flexible hours. I'm making music for my community and they are enjoying it, being fulfilled by it - this is like approximately nothing I've ever experienced before. I have so much luxury, time to drive off to woods in five different directions for a hike over lunch, sometimes with my kid, and she rambles about her BS and we look at the mushrooms and bent pieces of wood and explore together. I'm sober, damn near 9 months, and it's the best way I've been in my whole life.

But dating sucks. It's kind of unhealthy, and I'm having trouble balancing it out, turning it off when I should be doing other things. Just because my life is amazing, doesn't mean that being rejected quietly for long periods isn't it's own kind of hell. I acknowledge this little hell, and I place it now on the outside, so that it doesn't eat me from the inside.

(In other news)

I've been fostering a cat. He belongs to a church and chorus member who had a stroke and had to go to hospital and rehab for nearly 3 weeks. I am considering fostering cats rather than procuring my own. I'm also considering procuring my own.

The church voted to list the campus for sale. I get to be the treasurer during the congregation's greatest financial struggle. This is also its own precious little hell.

People are dying in wars abroad. Lots of them.

2023 was the hottest year on record.

The sun is shining, therefore, it's possible to have some hope. I prayed today, and I'll pray again. Not for peace, just for the strength to carry on myself. Peace will only happen if we do something to make it happen, to heal the wounds and set aside our pride and our hurts, as a species.

Your Obedient,

Ian Hogan, PhD

Wednesday, September 20, 2023

Deja Vu (Data De-duplication)

Dear Blog,

"So when you gonna tell her

That we did that, too?

She thinks it's special

But it's all reused"

- Olivia Rodrigo, Deja Vu

YouTube link

Spotify link

I send a lot of songs to whatever girlfriend I have at the moment when I want to share that song. More than one (ex-) girlfriend has expressed that they don't want me sending them songs that I've sent to previous girlfriends.

Repeats? Surely not. I would never re-use a song...

Guilty. As. Charged.

It gets worse, from Olivia, "I bet she knows Billy Joel..."

Lullaby (Goodnight My Angel) - Billy Joel. I even sang that song with my barbershop quartet to GF3.

Dang it, I want to send this song, and I don't remember if I sent it to someone before. How can I check? I did not have the above screen shot playlists ready when I asked this question of myself. All I had was the chat histories themselves, some of which number in the tens of thousands of words. There's no way I'm going to scroll through all that to comb out every song. Luckily, I'm a full stack web developer and a data analyst. Comparing log data and identifying duplicate references is actually a part of an ongoing research effort at work!

Let's break down the One Soul Per Song project. I exported all my chat histories, mostly on disparate systems. I have three versions of the following method, one for FB Chat export, one for the old Google Hangouts export, and one for Google Chat export.

def extractURIfromGoogleMessages(chatJson, verbose=False):
    bagOfWords = []
    for item in chatJson['messages']:
        if item.get('text'):
            print(item['text'])
            bagOfWords.extend(item['text'].split(' '))

    allURI = []
    for word in bagOfWords:
        if uri_validator(word):
        # throw out news articles and whatnot

            if "yout" in word or "sound" in word or "spotif" in word:
                allURI.append(word)

    if verbose:
        print("All urls")
        for url in allURI:
            print(url)
    
    return allURI

Hangouts required some poking around to figure out which key belonged to which person. That exploration was enlightening in itself. Finding my brother among the chats was instant. Separating one girlfriend from another took a fair amount of digging.

Once I ran the appropriate method on each particular (ex-) girlfriend history, I need a simple method for checking for repeats:

def checkDuplicates(list1, list2):
    # simple and easy to understand. 
    duplicates = []
    for item in list1:
        if item in list2:
            duplicates.append(item)
    return duplicates

Put it all together:

# testing
if __name__=="__main__":
    # read in data. 
    f = open('Hangouts.json', encoding="utf8") 
    data = json.load(f)
    gf2 = extractURLsHangoutsJson(data)
    print(len(gf2))

    html = open("message.html", "r", encoding="utf8").read()
    gf1 = extractUrlsFromFbHTML(html)
    print(len(gf1))

    g = open('messages.json', encoding ="utf-8")
    chatJson = json.load(g)
    gf3 = extractURIfromGoogleMessages(chatJson)
    print(len(gf3))

    print(checkDuplicates(gf1, gf2))
    print(checkDuplicates(gf2, gf3))
    print(checkDuplicates(gf1, gf3))

Output of the above:

122

220

263

['https://www.youtube.com/watch?v=4spkVX8z-vs',

'https://www.youtube.com/watch?v=xwtdhWltSIg',

'https://www.youtube.com/watch?v=S28-OgVDAek',

'https://www.youtube.com/watch?v=naoBTy1Rh0I',

'https://www.youtube.com/watch?v=87YL0bhqFSw',

'https://www.youtube.com/watch?v=8-FUkhVtveU',

'https://www.youtube.com/watch?v=aINFvGESX8I',

'https://www.youtube.com/watch?v=lBUUOJpFg9Y',

'https://www.youtube.com/watch?v=uLVFptybalY',

'https://www.youtube.com/watch?v=NYoTgxOQjCg']

['https://www.youtube.com/watch?v=HKlHABc8HTE']

['https://www.youtube.com/watch?v=aKJIhZh_L-s',

'https://www.youtube.com/watch?v=wm98afryPf4']

Hey, not bad! 13 repeats out of 605 songs is a pretty low reuse factor! Unfortunately, there are multiple platforms, and multiple versions of each song. So just because two links aren't identical, that doesn't mean they aren't the same song. We need to generate playlists and perform exploratory data analytics. For that consider this online tool.

I did not use this tool myself. I didn't like that the resulting playlist was anonymously owned, so I could not modify them or set privacy levels. I looked at scripts to roll my own, but also, for me, I ran into the API's 50 song cap really fast, and there were so many non-song links, dead or removed songs, and other issues, that I manually constructed all of the playlists from the raw outputs of the above methods (with verbose=True). The manual process probably amounted to about 8 hours of work, but I wanted the playlists, and I wanted them to be correct.

I don't have clean statistics for you on duplicates. I'm not getting paid, so rough numbers are what you're going to get. There are about 25. Which given 600+ songs, really isn't that bad. I have honestly taken the request to heart, and done my best to only find new songs for familiar feelings. Some of the repeats were actually sent by the girlfriend at the time, which is out of my control entirely (including Cosmic Love, the comedic triple from the head of the post).

Some of the repeats I found I was surprised to see, "I thought that was a GF2 song..." is there a re-apportionment process for this? I would like to de-allocate soul one from song 89...

I'm extremely glad that I put in the work. The playlists are long, varied, delicious, and carry an incredible history of feelings, relationship, and more. Listening to them (an ongoing process) is healing old wounds.

A technical remark: The push for migration from Hangouts to Google Chat became abundantly clear after looking at the backend data storage. Hangouts data storage was workable, but clunky, and probably really slow at scale, given how nested it was. It was a nightmare to parse at a glance. Google chat is minimally nested, clear, and can be debugged and diagnosed at a glance. It's probably a lot faster at scale.

(Anywhat)

Now I can send that new song to some future girlfriend, and she can be assured that it's not Deja Vu. And I've shared enough code snippets that if you know a touch of Python, you can too.

Your Obedient,

-Ian Hogan, PhD

Sunday, September 17, 2023

Rocket 88

Dear Blog,

Today I finished a years long endeavor to cycle through all 88 counties of Ohio. I woke up in Noble county, in a tent in the rain. I rode back to my car, and drove to Coshocton for the final lap. Stats for the weekend, 5 counties, 100 miles, 7:35 riding time, top speed 71.83kph.

If I was British, I would probably say, "I'm quite pleased with that." But I'm American so I'm going to say

BITCH, I'm a fucking LEGEND. Come at me bro! What you got, 10000 kilometers of stop signs, hairpin turns, 11 percent grades, gravel and pot-holes, assholes in trucks, bad signage? I'll grind over all your washed out coal towns, your downed trees, your 1500 sparkling creeks, past 25000 cornfields, 32000 sycamores, 35000000 ticks and mosquitos. I'll eat your headwinds for a snack without slowing down. Fuck your unleashed dogs.

In terms of cycling counties, this has been my best year to date. I biked in 20 new counties, and repeated an additional 11 (Cuyahoga, Erie, Huron, Lorain, Miami, Clark, Montgomery, Greene, Warren, Clinton, and Preble).

How long did the whole project take? Well, in years, I started when I was eight years old, so 29 years. If we count the shortest time since I've done all 88 including repeats, then I've biked all 88 in the last 10 years. In hours? I didn't track, but several hundred. A typical county would take 2-6 hours to cycle across, so 4*88=352. Several of them I did much faster.

So I'll take some time now, and revel in a win. But also, the only way this works, is if the journey is the win. So, I've been winning this whole time. It's a beautiful state. Every county has something unique to itself. I never could have guessed at so many of the things I've seen. Every view, every picture of sparkling streams, every surprising weasel or musk-rat or blue-fish, every warm breeze and relief of a cloud on a hot day. It's been so fun, so pleasant, so calming, and I would do it again, and again.

And also, I am what I am. I'm forever seeking the next goal, so, Indiana, here I come.

I have biked around Indianapolis two different times, with my cousin Kathryn each time. So much white to paint-bucket fill. I'm excited.

Very excited.

Let's go,

Ian Hogan, PhD

Monday, September 4, 2023

Elf Mix 2

Dear Blog,

I made a compilation CD many years ago, when I was approximately 19 years old. The only thing written on the disk is the title Elf Mix 2, in scrawling red marker. I've recreated the mix as Spotify and YouTube playlists, depending the reader's preference. Also, here is the song list:

Miserlou - Dick Dale

Fire - Jimi Hendrix

Mr. Brightside - The Killers

Drain You - Nirvana

Say It Ain't So - Weezer

The Man Who Sold The World - David Bowie (Nirvana Unplugged Version)

Have You Ever - The Offspring

Jeremy - Pearl Jam

Within You Without You - The Beatles

Romeo's Seance - The Juliette Letters

Black Angel's Death Song - The Velvet Underground

I Fought Piranhas - The White Stripes

I Can't Quit You Baby - Led Zeppelin

Daze and Confused - Led Zeppelin

Warmth Of The Sun - The Beach Boys

Dueling Banjos - Eric Weissberg (famously heard in Deliverance)

I have two observations. First, I'm quite a consumer of music. I listen to multiple genres of music every day, usually several hours per day, ranging from baroque, electronic, bluegrass, classical guitar, traditional and modern folk, acapella jazz, barbershop, indie pop, hard rock, blues, jazz piano/trio, musical theater and some others. I have been heard to say that some music is 'good' and other music 'not great.' I wouldn't identify as a music snob, but I'm sure at least a few people have considered me one.

Also, I've noticed that many people my age and older, Millennials and Generation X, they are often very self-conscious of their young adult and late adolescent selves. They hide their journals, their drawings, their love letters, pictures of their hair.

Somewhere between these two observations, one might expect me to have great distaste for my late teenage mix tape, to poo-poo it as adolescent pop punk bullshit. But no, it's amazing. Every song on it is fantastic. I would say its only flaw is that it's almost all Up tunes -- that is, I didn't take it down a notch until the second to last song, perhaps except for Within You Without You at track 9. I can give myself grace to have missed that subtle and important point in making a good mix-tape, on my second ever.

Teenagers know quite a lot. They feel quite a lot. It's all real, and their ability to express it is truly something to be respected. As we age, we ought pay more attention to the youth, especially if it is our own teenage selves. Everyone, love your youthful constructs. Read the journal, show the drawings to friends. Listen to your old crush songs, and old breakup songs, and old dance like your hip doesn't hurt all the time yet songs.

Your Obedient,

-Ian Hogan, PhD

Leben Ist Änderung

Saturday, November 2, 2024

Bang Bang

Monday, July 8, 2024

Vlog Vier, Hurrican Ridge

Friday, May 31, 2024

AI/ML, Guessing, and Knowledge

Wednesday, December 27, 2023

Waxing Philosophic and Dating, the Sequel

Wednesday, September 20, 2023

Deja Vu (Data De-duplication)

Sunday, September 17, 2023

Rocket 88

Monday, September 4, 2023

Elf Mix 2

Blog Archive

About Me