Markov Chain Generation


Markov Chain Generator Review

by Harrison Massey

Description

Markov Chain Generators are programs that analyze a text file and then provide a random string of text that (supposedly) sounds like it came from the same source as the text you fed into it.  The results are frequently nonsensical, but do emulate the original text file author's writing style and word/phrase choices to a certain degree.

 

To do this, the generator program looks through each word of the text and examines the word after it, counting once that the word has appeared after the word.  It then moves forward a word, repeating the process until all the words are counted.  The result is a dataset that has words as keys and following words with their frequencies as values.  For instance, part of the dataset for the first paragraph of this review would look like this:

 

{

     "text" : {

                    "file": 2,

                    "you": 1,

                    "that": 1

               }

}

 

This indicates that the word "file" appears after "text" twice, "you" once, and "that" once.  With this data, given a starting word, the program generates text using these values as probabilities to randomly add a following word.  For instance, if I feed "text" into the program, there is a 50% chance that it will be followed by "file," a 25% chance that it will be followed by "you," and a 25% chance that it will be followed by "that."  Whatever word is picked, the chance that it will be used again is temporarily reduced to produce more realistic variance; the mathematical details of this process are actually very well-explained in the Wikipedia article about the core math concepts.  The same process is performed for each word generated until the program reaches a specified word limit or reaches a word that has no data entries associated with it.  It is also possible to run this process in reverse, allowing a more varied result text to be generated by randomly adding words that come before the provided word.  The "order" of the Markov chain can also be raised, which increases the number of words examined when generating the probability distributions.

 

Access

There's a number of code fragments and libraries that will allow you to perform Markov Chain Generation.  My favorite, however, and the simplest one I've found, is substack's node-markov.  It's available for free, with the source on GitHub.  This python-based one that generates scholarly article titles as well as provides a coding interface is much easier to set up, however, since it requires minimal setup (you only need Python installed).  You can also use a web-based one such as this one by Cheng Zhang, but you don't get fine-grained control over the generation methods and the results.

 

Sample Usage

The following is a Markov chain created by using the example from node-markov with the seed "author" from Barthes' "Death of the Author:"

 

empiricism, English with starting lost; being someone him-this of consciousness very criticism that fact, in us in and Bouvard to victory 'explained'- is it 'thing' same the not speaks, which origin-or without perfectly functioning empty 'subject', this be To referring linguists, what possible), becomes last at and enunciation (the of allegory transparent less or smothers, ignores, aside Leaving together. Writing of product a with ones the as of centres innumerable the of cost the with author and cultures many from derived fragment, secondary a by act an himself for work: a traces expression), of making consequently, that meanings double with hand (this be to end, the revealed Is text a not is writing that Greek of mastery the put, nobly more no has Vernant) (J.-P. research we recent know clear For is it: there Balzac but the Author-God) modern the text and to ideas furnish modern it the thinks In to read his and anecdotal, epitome historical the fact from that off'), each 'played character only understands can each writer character the understands for unilaterally lives this suffers, disconnection thinks occurs, he the Did innumerable them centres of of removal every the point of where biographies only literature a of text cost is the never it to before the exists sense he of is the text sense a of draws the he 'person', is and writer read the In with the author;

 

And again with the same text, but seeded with "modern:"

 

the innumerable empiricism, centres English of with men concerned of visibly the was author analyses, still his consists never in life is his more of extensive 'message' and The complex hears than addition, that in it surprise is no to fooled his be 'genius'. -may the what head liberates itself, text, Such a a of person view the a novel, like as uttered-something I is is text), tyrannically a centred romantically, on being, remaining there ignorant Hence of critic. kings the or memoirs. the and castrato made hidden is beneath author the the reader origin for of himself, centres he innumerable wish the to historically, write that is and in inscription ordinary of patience mastery of the capitalist Baudelaire ideology, so which model; defines the it, with it longer is no uttered-something narrated like is the that negative predecessors, where his this tastes, disconnection his occurs, original, the never instance shall writing, we that literature writing. of is, person it), the designates to exactly order what in the given enunciation (exclusively is form language verbal 'hold essentially together', the in found which been his has 'genius'. criticism the suits hero conception of a the up explanation make of that them way. Did that he saying never without to inscribed be are admired writing, but origin, there its is source, to its condemn writing the modern

 

This script is set up to end the generated text with the provided word.  If multiple words are provided, it selects one at random.  Here is a result for the seed "modern author:"

 

law science, hypostases-reason, his Gogh's Van man, young the of consciousness very the deciphered; nothing disentangled, (be possibly may code narrative the is, Linguistically, without interlocutors. origin-or the which for he need who any until being then them. allotting of itself space language the itself different. this is delay scriptor and modern who gave understands Proust unilaterally possible), 'This becomes hand last is at he and and delay to this his sentence. only Balzac, person it who rather is say in to the novelist), Middle realist Ages the with with the author

 

The resulting text, while nearly complete gibberish, loosely emulates Barthes' style.  The results are both interesting in their demonstration of a probabilistic breakdown of Barthes as well as amusing in the generation of phrases such as "Gogh's Van man."  The results are also complicated by the fact that this particular example script generates words going both forwards and backwards, which generates some strange word orders.  Adjusting the script to only perform forward generation provides the following result for "author:"

 

the present tense) in literature We know for us too, slow for him passions, while criticism has no longer designate an author (one of the constitutively ambiguous nature of as the very identity is the principle and indefinitely something experienced in the author still consists for Which all the present tense) in suppressing the author. (in the writer and, to Let ourselves be seen, to go no surprise in Paradis Artificiels), 'created for which at the Author or its words, only a way. that can no longer by the text and not in the others, in literature it: goes without history, psyche, liberty) beneath: the writer can know no longer be followed, 'run' like a performative a castrato disguised as was so on any voice, borne by which has also been supposed to reach that of the person In the narrative is as it in the fact is 'explained'- victory to decipher a writing it is truly revolutionary since to us that someone who produced it:

 

and this result for "modern:"

 

societies the novel, as soon as will be at all identity of inscription and so on the lessons of kings (or

 

This script is also configured with an order of 1, which means words are only generated based on one proceeding word.  If we raise the order to 2, we get the following chain for "author:"

 

the act by which it is uttered-something like the I declare of kings or the I declare of kings or the I sing of very ancient poets. Having buried the Author, the modern scriptor can thus no longer bears within him passions, humours, feelings, impressions, but rather this immense dictionary from which he draws a writing that can know no halt: life never does more than consolidate it), it goes without saying that certain writers have long since attempted to loosen it. In France, Mallarme was doubtless the first to see and to foresee in its full extent the necessity to substitute language itself for the good reason that writing is the destruction of every voice, of every point of origin. Writing is that neutral, composite, oblique space where our subject slips away, the negative where all identity is lost, starting with the very enunciation which defines it, suffices to make language 'hold together', suffices, that is to say, to exhaust it.

 

The result is far more coherent, and likely more useful as well.

 

What the Tool Does Well

 

What it Does Poorly

 

Additional Resources

 

 

https://github.com/vedant/markov-chain-generator