Wednesday, September 20, 2023

Deja Vu (Data De-duplication)

Dear Blog,

"So when you gonna tell her
That we did that, too?
She thinks it's special
But it's all reused"
- Olivia Rodrigo, Deja Vu

I send a lot of songs to whatever girlfriend I have at the moment when I want to share that song. More than one (ex-) girlfriend has expressed that they don't want me sending them songs that I've sent to previous girlfriends. 

Repeats? Surely not. I would never re-use a song...




Guilty. As. Charged.

It gets worse, from Olivia, "I bet she knows Billy Joel..."




Lullaby (Goodnight My Angel) - Billy Joel. I even sang that song with my barbershop quartet to GF3. 

Dang it, I want to send this song, and I don't remember if I sent it to someone before. How can I check? I did not have the above screen shot playlists ready when I asked this question of myself. All I had was the chat histories themselves, some of which number in the tens of thousands of words. There's no way I'm going to scroll through all that to comb out every song. Luckily, I'm a full stack web developer and a data analyst. Comparing log data and identifying duplicate references is actually a part of an ongoing research effort at work!


Let's break down the One Soul Per Song project. I exported all my chat histories, mostly on disparate systems. I have three versions of the following method, one for FB Chat export, one for the old Google Hangouts export, and one for Google Chat export. 

def extractURIfromGoogleMessages(chatJson, verbose=False):
    bagOfWords = []
    for item in chatJson['messages']:
        if item.get('text'):
            print(item['text'])
            bagOfWords.extend(item['text'].split(' '))

    allURI = []
    for word in bagOfWords:
        if uri_validator(word):
        # throw out news articles and whatnot
            if "yout" in word or "sound" in word or "spotif" in word:
                allURI.append(word)

    if verbose:
        print("All urls")
        for url in allURI:
            print(url)
    
    return allURI

Hangouts required some poking around to figure out which key belonged to which person. That exploration was enlightening in itself. Finding my brother among the chats was instant. Separating one girlfriend from another took a fair amount of digging.

Once I ran the appropriate method on each particular (ex-) girlfriend history, I need a simple method for checking for repeats:

def checkDuplicates(list1, list2):
    # simple and easy to understand. 
    duplicates = []
    for item in list1:
        if item in list2:
            duplicates.append(item)
    return duplicates
 
Put it all together:

# testing
if __name__=="__main__":
    # read in data. 
    f = open('Hangouts.json', encoding="utf8") 
    data = json.load(f)
    gf2 = extractURLsHangoutsJson(data)
    print(len(gf2))

    html = open("message.html", "r", encoding="utf8").read()
    gf1 = extractUrlsFromFbHTML(html)
    print(len(gf1))

    g = open('messages.json', encoding ="utf-8")
    chatJson = json.load(g)
    gf3 = extractURIfromGoogleMessages(chatJson)
    print(len(gf3))

    print(checkDuplicates(gf1, gf2))
    print(checkDuplicates(gf2, gf3))
    print(checkDuplicates(gf1, gf3))

Output of the above:

122
220
263
['https://www.youtube.com/watch?v=4spkVX8z-vs',
'https://www.youtube.com/watch?v=xwtdhWltSIg',  
'https://www.youtube.com/watch?v=S28-OgVDAek', 
'https://www.youtube.com/watch?v=naoBTy1Rh0I',  
'https://www.youtube.com/watch?v=87YL0bhqFSw', 
'https://www.youtube.com/watch?v=8-FUkhVtveU', 
'https://www.youtube.com/watch?v=aINFvGESX8I', 
'https://www.youtube.com/watch?v=lBUUOJpFg9Y', 
'https://www.youtube.com/watch?v=uLVFptybalY', 
'https://www.youtube.com/watch?v=NYoTgxOQjCg']
['https://www.youtube.com/watch?v=HKlHABc8HTE']
['https://www.youtube.com/watch?v=aKJIhZh_L-s', 
'https://www.youtube.com/watch?v=wm98afryPf4']

Hey, not bad! 13 repeats out of 605 songs is a pretty low reuse factor! Unfortunately, there are multiple platforms, and multiple versions of each song. So just because two links aren't identical, that doesn't mean they aren't the same song. We need to generate playlists and perform exploratory data analytics. For that consider this online tool

I did not use this tool myself. I didn't like that the resulting playlist was anonymously owned, so I could not modify them or set privacy levels. I looked at scripts to roll my own, but also, for me, I ran into the API's 50 song cap really fast, and there were so many non-song links, dead or removed songs, and other issues, that I manually constructed all of the playlists from the raw outputs of the above methods (with verbose=True). The manual process probably amounted to about 8 hours of work, but I wanted the playlists, and I wanted them to be correct. 

I don't have clean statistics for you on duplicates. I'm not getting paid, so rough numbers are what you're going to get. There are about 25. Which given 600+ songs, really isn't that bad. I have honestly taken the request to heart, and done my best to only find new songs for familiar feelings. Some of the repeats were actually sent by the girlfriend at the time, which is out of my control entirely (including Cosmic Love, the comedic triple from the head of the post). 

Some of the repeats I found I was surprised to see, "I thought that was a GF2 song..."  is there a re-apportionment process for this? I would like to de-allocate soul one from song 89...

I'm extremely glad that I put in the work. The playlists are long, varied, delicious, and carry an incredible history of feelings, relationship, and more. Listening to them (an ongoing process) is healing old wounds. 

A technical remark: The push for migration from Hangouts to Google Chat became abundantly clear after looking at the backend data storage. Hangouts data storage was workable, but clunky, and probably really slow at scale, given how nested it was. It was a nightmare to parse at a glance. Google chat is minimally nested, clear, and can be debugged and diagnosed at a glance. It's probably a lot faster at scale. 

(Anywhat)

Now I can send that new song to some future girlfriend, and she can be assured that it's not Deja Vu. And I've shared enough code snippets that if you know a touch of Python, you can too.

Your Obedient, 
-Ian Hogan, PhD

1 comment:

ElvisMansonCPA said...

F that. I have a finite number of magic songs, the ones that define my existence, that embody emotions and feelings as individual as a fingerprint. No partner gets dibs in perpetuity to one song. You break up with me, you release all ownership rights.