Repeats? Surely not. I would never re-use a song...
Let's break down the One Soul Per Song project. I exported all my chat histories, mostly on disparate systems. I have three versions of the following method, one for FB Chat export, one for the old Google Hangouts export, and one for Google Chat export.
def extractURIfromGoogleMessages(chatJson, verbose=False): bagOfWords = [] for item in chatJson['messages']: if item.get('text'): print(item['text']) bagOfWords.extend(item['text'].split(' ')) allURI = [] for word in bagOfWords: if uri_validator(word): # throw out news articles and whatnot
if "yout" in word or "sound" in word or "spotif" in word: allURI.append(word) if verbose: print("All urls") for url in allURI: print(url) return allURI
Once I ran the appropriate method on each particular (ex-) girlfriend history, I need a simple method for checking for repeats:
def checkDuplicates(list1, list2): # simple and easy to understand. duplicates = [] for item in list1: if item in list2: duplicates.append(item) return duplicates
# testing if __name__=="__main__": # read in data. f = open('Hangouts.json', encoding="utf8") data = json.load(f) gf2 = extractURLsHangoutsJson(data) print(len(gf2)) html = open("message.html", "r", encoding="utf8").read() gf1 = extractUrlsFromFbHTML(html) print(len(gf1)) g = open('messages.json', encoding ="utf-8") chatJson = json.load(g) gf3 = extractURIfromGoogleMessages(chatJson) print(len(gf3)) print(checkDuplicates(gf1, gf2)) print(checkDuplicates(gf2, gf3)) print(checkDuplicates(gf1, gf3))