Let Them Sing it For You
As I've been posting about more frequently (including just today about my Synesthesizer), I'm intrigued by the concept of automated music generation.
A few weeks ago, I stumbled upon this cool little web app (I forget where I found it), called 'Let Them Sing it For You'.
It's a Flash app that will try to form music from words. You type in words -- any set of words you might like -- and then click 'play'. LTSIFU will then process through its database, cross-referencing the words you've typed with the songs in which those words appear.
It appears that LTSIFU searches word-by-word -- that is, it does not search for whole phrases, but instead occurrences of the individual word. And, if you type in a word that doesn't exist in the database yet, it will try to break that word down into smaller words or sounds that might exist. And, if that doesn't work, it'll insert a 'beep' into your track. If you know of an occurrence of a missing word, then you can send that information to the creators for inclusion.
The result? It's a lot of fun to play with, but, well, let's just say that what they sing from you is something far short of music.
If LTSIFU searched for larger phrases (say of two to five words) then you'd have a bit more consistency -- and the result would sound better. But, of course, searching for entire phrases is a sort of cheat to create better music, given the rules of this system -- this system is assembling tunes from multiple songs, to create a new single tune, using individual words as the lookup keys.
When you treat each word as an individual and completely distinct piece of data -- as LTSIFU does -- the words lose their context, and thus their meaning. As a result, you can't really hope to establish any pattern -- the information is random, and randomness defies patterns.
Take the word 'apple'. If I enter into LTSIFU 'an apple a day keeps the doctor away' or 'I love life in the big apple' -- should LTSIFU interpret 'apple' in the same way? Although they are both instances of the same word, they clearly mean different things as part of different messages. Or, in an example that perhaps provides more clarity, take the word 'bear' in 'that bear ate all the honey' or 'the right to bear arms'. In this case, not only is the overall message different, but because bear is a homonym, each instance actually means something different that would be impossible to intuit without the larger context of the sentence in which the word 'bear' appears.
Stated another way, it is impossible to create an accurate translation of a page of text from one language to another, if you analyze and translate each word individually. Instead, the translator must inspect the sentences and paragraphs, and the ideas and emotions that are created by those messages, and attempt to translate those into the other language. That's what creates a good translation.
So, for this LTSIFU translation to work more accurately, it must interpret the words more like humans do -- as part of a larger message. How? Well, there are numerous ways. Language is, of course, a huge construct -- English alone has thousands upon thousands of words -- and they are assembled into literally infinite numbers of larger patterns. How on earth do you begin?
Well, you begin any endeavor like this looking for existing knowledge in the field. In this case, the field is linguistics. There are any number of rules from which you might begin this endeavor, but let's take one as an example. It might interest you to know that, while there are many thousands of words in English (the Second Edition of the Oxford English Dictionary contains full entries for 171,476 words in current use, and 47,156 obsolete words), but there is a very small subset of these words (fewer than 200), called non-content words, that form over 60% of the words on any given page of text. ('a' and 'the' and 'some' and 'many' and others).
Two things distinguish non-content words: first, how often they occur; and second, the fact that they have no tangible meaning on their own -- their meaning relies entirely on their context.
So these words occur a lot (both in real life, and one assumes, as input to LTSIFU), and they have no implicit meaning. So, one promising starting point would be looking for patterns of these occurrences of non-content words and seeking patterns in those occurrences. Take the word 'the' in the phrase 'the rain in spain': perhaps you'd search for instances of the word 'the' that occur at the start of sentences; or instances of 'the' that occur within two words of a weather pattern (here, 'rain'), or within five words of a country name. These are just examples that would establish more relevance in the way this translation engine interprets input.
That's part one -- interpreting the input.
Part two -- generating the output -- is a separate matter. Again, it seems to me that for the translation to be meaningful, it must consider patterns, both from the input and the output. In the input, I suggested seeking patterns from knowledge in the arena of linguistics. In this case, since the output is music, we should seek some rules from that arena on which to rely.
There are a few ways to approach this. For instance, as it stands, LTSIFU searches for instances of individual words in existing tracks, but just as it does not analyze the individual words in any broader context, nor does it analyze the songs from which it draws the utterances in any broader context. But, as explained above in one simple example, the word 'bear' can have very different meanings -- why should that word always lead to the same output, regardless of the larger message in which the word occurs?
Let's say, for example, that LTSIFU were to be able to cross-reference with a database similar to the Music Genome Project (which powers Pandora), which is establishing a genealogy of music by breaking down the individual 'DNA' strands that flow through the artists and work we enjoy, defining the connections between various works and artists and genres over time. Instead of picking from a catalog of all songs, LTSIFU could pick utterances from a subset of related songs.
Taking this a bit further, LTSIFU could pick the specific musical genre based on the overall message -- messages with depressing words (like, well, 'depression') might lead LTSIFU to select from blues works, or messages with vibrant words (such as 'explosive') might lead LTSIFU to select from more genres with more rapid BPMs. Again, these are just examples.
Now, a process such as this would establish more relevance and meaning in the translation. And it might even sound better. But we're still not generating something that we'd call 'music'.
If generating something approximating 'music' is the goal here (and I'm not claiming that it is or should be, but let's just take that as a given for what follows), then how might you consider ways of accomplishing that. Humans derive meaning from music, like language, based on larger patterns. If the goal here is to generate music, then you'd have to apply some of these patterns to the translation.
Meaning, for example, that LTSIFU would need to consider aspects such as key and rhythm when splicing together multiple tracks. If I have a choice of 8 tracks from which I can grab the utterance 'car', I should pick the one that best matches the musical key and bpm of the other utterances I'm working with.
Not claiming any of this is easy -- is labor- and processor-intensive to implement translators such as these. But this is where I'd look to take LTSIFU if it were my baby.
Share and enjoy!
-r


