Wednesday, September 20, 2023

Deja Vu (Data De-duplication)

Dear Blog,

"So when you gonna tell her
That we did that, too?
She thinks it's special
But it's all reused"
- Olivia Rodrigo, Deja Vu

I send a lot of songs to whatever girlfriend I have at the moment when I want to share that song. More than one (ex-) girlfriend has expressed that they don't want me sending them songs that I've sent to previous girlfriends. 

Repeats? Surely not. I would never re-use a song...




Guilty. As. Charged.

It gets worse, from Olivia, "I bet she knows Billy Joel..."




Lullaby (Goodnight My Angel) - Billy Joel. I even sang that song with my barbershop quartet to GF3. 

Dang it, I want to send this song, and I don't remember if I sent it to someone before. How can I check? I did not have the above screen shot playlists ready when I asked this question of myself. All I had was the chat histories themselves, some of which number in the tens of thousands of words. There's no way I'm going to scroll through all that to comb out every song. Luckily, I'm a full stack web developer and a data analyst. Comparing log data and identifying duplicate references is actually a part of an ongoing research effort at work!


Let's break down the One Soul Per Song project. I exported all my chat histories, mostly on disparate systems. I have three versions of the following method, one for FB Chat export, one for the old Google Hangouts export, and one for Google Chat export. 

def extractURIfromGoogleMessages(chatJson, verbose=False):
    bagOfWords = []
    for item in chatJson['messages']:
        if item.get('text'):
            print(item['text'])
            bagOfWords.extend(item['text'].split(' '))

    allURI = []
    for word in bagOfWords:
        if uri_validator(word):
        # throw out news articles and whatnot
            if "yout" in word or "sound" in word or "spotif" in word:
                allURI.append(word)

    if verbose:
        print("All urls")
        for url in allURI:
            print(url)
    
    return allURI

Hangouts required some poking around to figure out which key belonged to which person. That exploration was enlightening in itself. Finding my brother among the chats was instant. Separating one girlfriend from another took a fair amount of digging.

Once I ran the appropriate method on each particular (ex-) girlfriend history, I need a simple method for checking for repeats:

def checkDuplicates(list1, list2):
    # simple and easy to understand. 
    duplicates = []
    for item in list1:
        if item in list2:
            duplicates.append(item)
    return duplicates
 
Put it all together:

# testing
if __name__=="__main__":
    # read in data. 
    f = open('Hangouts.json', encoding="utf8") 
    data = json.load(f)
    gf2 = extractURLsHangoutsJson(data)
    print(len(gf2))

    html = open("message.html", "r", encoding="utf8").read()
    gf1 = extractUrlsFromFbHTML(html)
    print(len(gf1))

    g = open('messages.json', encoding ="utf-8")
    chatJson = json.load(g)
    gf3 = extractURIfromGoogleMessages(chatJson)
    print(len(gf3))

    print(checkDuplicates(gf1, gf2))
    print(checkDuplicates(gf2, gf3))
    print(checkDuplicates(gf1, gf3))

Output of the above:

122
220
263
['https://www.youtube.com/watch?v=4spkVX8z-vs',
'https://www.youtube.com/watch?v=xwtdhWltSIg',  
'https://www.youtube.com/watch?v=S28-OgVDAek', 
'https://www.youtube.com/watch?v=naoBTy1Rh0I',  
'https://www.youtube.com/watch?v=87YL0bhqFSw', 
'https://www.youtube.com/watch?v=8-FUkhVtveU', 
'https://www.youtube.com/watch?v=aINFvGESX8I', 
'https://www.youtube.com/watch?v=lBUUOJpFg9Y', 
'https://www.youtube.com/watch?v=uLVFptybalY', 
'https://www.youtube.com/watch?v=NYoTgxOQjCg']
['https://www.youtube.com/watch?v=HKlHABc8HTE']
['https://www.youtube.com/watch?v=aKJIhZh_L-s', 
'https://www.youtube.com/watch?v=wm98afryPf4']

Hey, not bad! 13 repeats out of 605 songs is a pretty low reuse factor! Unfortunately, there are multiple platforms, and multiple versions of each song. So just because two links aren't identical, that doesn't mean they aren't the same song. We need to generate playlists and perform exploratory data analytics. For that consider this online tool

I did not use this tool myself. I didn't like that the resulting playlist was anonymously owned, so I could not modify them or set privacy levels. I looked at scripts to roll my own, but also, for me, I ran into the API's 50 song cap really fast, and there were so many non-song links, dead or removed songs, and other issues, that I manually constructed all of the playlists from the raw outputs of the above methods (with verbose=True). The manual process probably amounted to about 8 hours of work, but I wanted the playlists, and I wanted them to be correct. 

I don't have clean statistics for you on duplicates. I'm not getting paid, so rough numbers are what you're going to get. There are about 25. Which given 600+ songs, really isn't that bad. I have honestly taken the request to heart, and done my best to only find new songs for familiar feelings. Some of the repeats were actually sent by the girlfriend at the time, which is out of my control entirely (including Cosmic Love, the comedic triple from the head of the post). 

Some of the repeats I found I was surprised to see, "I thought that was a GF2 song..."  is there a re-apportionment process for this? I would like to de-allocate soul one from song 89...

I'm extremely glad that I put in the work. The playlists are long, varied, delicious, and carry an incredible history of feelings, relationship, and more. Listening to them (an ongoing process) is healing old wounds. 

A technical remark: The push for migration from Hangouts to Google Chat became abundantly clear after looking at the backend data storage. Hangouts data storage was workable, but clunky, and probably really slow at scale, given how nested it was. It was a nightmare to parse at a glance. Google chat is minimally nested, clear, and can be debugged and diagnosed at a glance. It's probably a lot faster at scale. 

(Anywhat)

Now I can send that new song to some future girlfriend, and she can be assured that it's not Deja Vu. And I've shared enough code snippets that if you know a touch of Python, you can too.

Your Obedient, 
-Ian Hogan, PhD

Sunday, September 17, 2023

Rocket 88

Dear Blog,

Today I finished a years long endeavor to cycle through all 88 counties of Ohio. I woke up in Noble county, in a tent in the rain. I rode back to my car, and drove to Coshocton for the final lap. Stats for the weekend, 5 counties, 100 miles, 7:35 riding time, top speed 71.83kph. 


If I was British, I would probably say, "I'm quite pleased with that." But I'm American so I'm going to say

BITCH, I'm a fucking LEGEND. Come at me bro! What you got, 10000 kilometers of stop signs, hairpin turns, 11 percent grades, gravel and pot-holes, assholes in trucks, bad signage? I'll grind over all your washed out coal towns, your downed trees, your 1500 sparkling creeks, past 25000 cornfields, 32000 sycamores, 35000000 ticks and mosquitos. I'll eat your headwinds for a snack without slowing down. Fuck your unleashed dogs.

In terms of cycling counties, this has been my best year to date. I biked in 20 new counties, and repeated an additional 11 (Cuyahoga, Erie, Huron, Lorain, Miami, Clark, Montgomery, Greene, Warren, Clinton, and Preble). 

How long did the whole project take? Well, in years, I started when I was eight years old, so 29 years. If we count the shortest time since I've done all 88 including repeats, then I've biked all 88 in the last 10 years. In hours? I didn't track, but several hundred. A typical county would take 2-6 hours to cycle across, so 4*88=352. Several of them I did much faster. 

So I'll take some time now, and revel in a win. But also, the only way this works, is if the journey is the win. So, I've been winning this whole time. It's a beautiful state. Every county has something unique to itself. I never could have guessed at so many of the things I've seen. Every view, every picture of sparkling streams, every surprising weasel or musk-rat or blue-fish, every warm breeze and relief of a cloud on a hot day.  It's been so fun, so pleasant, so calming, and I would do it again, and again.  

And also, I am what I am. I'm forever seeking the next goal, so, Indiana, here I come. 


I have biked around Indianapolis two different times, with my cousin Kathryn each time. So much white to paint-bucket fill. I'm excited. 

Very excited. 

Let's go,

Ian Hogan, PhD

Monday, September 4, 2023

Elf Mix 2

 Dear Blog,

I made a compilation CD many years ago, when I was approximately 19 years old. The only thing written on the disk is the title Elf Mix 2, in scrawling red marker. I've recreated the mix as Spotify and YouTube playlists, depending the reader's preference. Also, here is the song list: 

Miserlou - Dick Dale
Fire - Jimi Hendrix
Mr. Brightside - The Killers
Drain You - Nirvana
Say It Ain't So - Weezer
The Man Who Sold The World - David Bowie (Nirvana Unplugged Version)
Have You Ever - The Offspring
Jeremy - Pearl Jam
Within You Without You - The Beatles
Romeo's Seance - The Juliette Letters
Black Angel's Death Song - The Velvet Underground
I Fought Piranhas - The White Stripes
I Can't Quit You Baby - Led Zeppelin
Daze and Confused - Led Zeppelin
Warmth Of The Sun - The Beach Boys
Dueling Banjos - Eric Weissberg (famously heard in Deliverance)

I have two observations. First, I'm quite a consumer of music. I listen to multiple genres of music every day, usually several hours per day, ranging from baroque, electronic, bluegrass, classical guitar, traditional and modern folk, acapella jazz, barbershop, indie pop, hard rock, blues, jazz piano/trio, musical theater and some others. I have been heard to say that some music is 'good' and other music 'not great.' I wouldn't identify as a music snob, but I'm sure at least a few people have considered me one. 

Also, I've noticed that many people my age and older, Millennials and Generation X, they are often very self-conscious of their young adult and late adolescent selves. They hide their journals, their drawings, their love letters, pictures of their hair. 

Somewhere between these two observations, one might expect me to have great distaste for my late teenage mix tape, to poo-poo it as adolescent pop punk bullshit. But no, it's amazing. Every song on it is fantastic. I would say its only flaw is that it's almost all Up tunes -- that is, I didn't take it down a notch until the second to last song, perhaps except for Within You Without You at track 9. I can give myself grace to have missed that subtle and important point in making a good mix-tape, on my second ever. 

Teenagers know quite a lot. They feel quite a lot. It's all real, and their ability to express it is truly something to be respected. As we age, we ought pay more attention to the youth, especially if it is our own teenage selves. Everyone, love your youthful constructs. Read the journal, show the drawings to friends. Listen to your old crush songs, and old breakup songs, and old dance like your hip doesn't hurt all the time yet songs. 

Your Obedient,
-Ian Hogan, PhD