Home | History | Annotate | Download | only in codewalk
      1 <!--
      2 Copyright 2011 The Go Authors. All rights reserved.
      3 Use of this source code is governed by a BSD-style
      4 license that can be found in the LICENSE file.
      5 -->
      6 
      7 <codewalk title="Generating arbitrary text: a Markov chain algorithm">
      8 
      9 <step title="Introduction" src="doc/codewalk/markov.go:/Generating/,/line\./">
     10 	This codewalk describes a program that generates random text using
     11 	a Markov chain algorithm. The package comment describes the algorithm
     12 	and the operation of the program. Please read it before continuing.
     13 </step>
     14 
     15 <step title="Modeling Markov chains" src="doc/codewalk/markov.go:/	chain/">
     16 	A chain consists of a prefix and a suffix. Each prefix is a set
     17 	number of words, while a suffix is a single word.
     18 	A prefix can have an arbitrary number of suffixes.
     19 	To model this data, we use a <code>map[string][]string</code>.
     20 	Each map key is a prefix (a <code>string</code>) and its values are
     21 	lists of suffixes (a slice of strings, <code>[]string</code>).
     22 	<br/><br/>
     23 	Here is the example table from the package comment
     24 	as modeled by this data structure:
     25 	<pre>
     26 map[string][]string{
     27 	" ":          {"I"},
     28 	" I":         {"am"},
     29 	"I am":       {"a", "not"},
     30 	"a free":     {"man!"},
     31 	"am a":       {"free"},
     32 	"am not":     {"a"},
     33 	"a number!":  {"I"},
     34 	"number! I":  {"am"},
     35 	"not a":      {"number!"},
     36 }</pre>
     37 	While each prefix consists of multiple words, we
     38 	store prefixes in the map as a single <code>string</code>.
     39 	It would seem more natural to store the prefix as a
     40 	<code>[]string</code>, but we can't do this with a map because the
     41 	key type of a map must implement equality (and slices do not).
     42 	<br/><br/>
     43 	Therefore, in most of our code we will model prefixes as a
     44 	<code>[]string</code> and join the strings together with a space
     45 	to generate the map key:
     46 	<pre>
     47 Prefix               Map key
     48 
     49 []string{"", ""}     " "
     50 []string{"", "I"}    " I"
     51 []string{"I", "am"}  "I am"
     52 </pre>
     53 </step>
     54 
     55 <step title="The Chain struct" src="doc/codewalk/markov.go:/type Chain/,/}/">
     56 	The complete state of the chain table consists of the table itself and
     57 	the word length of the prefixes. The <code>Chain</code> struct stores
     58 	this data.
     59 </step>
     60 
     61 <step title="The NewChain constructor function" src="doc/codewalk/markov.go:/func New/,/\n}/">
     62 	The <code>Chain</code> struct has two unexported fields (those that
     63 	do not begin with an upper case character), and so we write a
     64 	<code>NewChain</code> constructor function that initializes the
     65 	<code>chain</code> map with <code>make</code> and sets the
     66 	<code>prefixLen</code> field.
     67 	<br/><br/>
     68 	This is constructor function is not strictly necessary as this entire
     69 	program is within a single package (<code>main</code>) and therefore
     70 	there is little practical difference between exported and unexported
     71 	fields. We could just as easily write out the contents of this function
     72 	when we want to construct a new Chain.
     73 	But using these unexported fields is good practice; it clearly denotes
     74 	that only methods of Chain and its constructor function should access
     75 	those fields. Also, structuring <code>Chain</code> like this means we
     76 	could easily move it into its own package at some later date.
     77 </step>
     78 
     79 <step title="The Prefix type" src="doc/codewalk/markov.go:/type Prefix/">
     80 	Since we'll be working with prefixes often, we define a
     81 	<code>Prefix</code> type with the concrete type <code>[]string</code>.
     82 	Defining a named type clearly allows us to be explicit when we are
     83 	working with a prefix instead of just a <code>[]string</code>.
     84 	Also, in Go we can define methods on any named type (not just structs),
     85 	so we can add methods that operate on <code>Prefix</code> if we need to.
     86 </step>
     87 
     88 <step title="The String method" src="doc/codewalk/markov.go:/func[^\n]+String/,/}/">
     89 	The first method we define on <code>Prefix</code> is
     90 	<code>String</code>. It returns a <code>string</code> representation
     91 	of a <code>Prefix</code> by joining the slice elements together with
     92 	spaces. We will use this method to generate keys when working with
     93 	the chain map.
     94 </step>
     95 
     96 <step title="Building the chain" src="doc/codewalk/markov.go:/func[^\n]+Build/,/\n}/">
     97 	The <code>Build</code> method reads text from an <code>io.Reader</code>
     98 	and parses it into prefixes and suffixes that are stored in the
     99 	<code>Chain</code>.
    100 	<br/><br/>
    101 	The <code><a href="/pkg/io/#Reader">io.Reader</a></code> is an
    102 	interface type that is widely used by the standard library and
    103 	other Go code. Our code uses the
    104 	<code><a href="/pkg/fmt/#Fscan">fmt.Fscan</a></code> function, which
    105 	reads space-separated values from an <code>io.Reader</code>.
    106 	<br/><br/>
    107 	The <code>Build</code> method returns once the <code>Reader</code>'s
    108 	<code>Read</code> method returns <code>io.EOF</code> (end of file)
    109 	or some other read error occurs.
    110 </step>
    111 
    112 <step title="Buffering the input" src="doc/codewalk/markov.go:/bufio\.NewReader/">
    113 	This function does many small reads, which can be inefficient for some
    114 	<code>Readers</code>. For efficiency we wrap the provided
    115 	<code>io.Reader</code> with
    116 	<code><a href="/pkg/bufio/">bufio.NewReader</a></code> to create a
    117 	new <code>io.Reader</code> that provides buffering.
    118 </step>
    119 
    120 <step title="The Prefix variable" src="doc/codewalk/markov.go:/make\(Prefix/">
    121 	At the top of the function we make a <code>Prefix</code> slice
    122 	<code>p</code> using the <code>Chain</code>'s <code>prefixLen</code>
    123 	field as its length.
    124 	We'll use this variable to hold the current prefix and mutate it with
    125 	each new word we encounter.
    126 </step>
    127 
    128 <step title="Scanning words" src="doc/codewalk/markov.go:/var s string/,/\n		}/">
    129 	In our loop we read words from the <code>Reader</code> into a
    130 	<code>string</code> variable <code>s</code> using
    131 	<code>fmt.Fscan</code>. Since <code>Fscan</code> uses space to
    132 	separate each input value, each call will yield just one word
    133 	(including punctuation), which is exactly what we need.
    134 	<br/><br/>
    135 	<code>Fscan</code> returns an error if it encounters a read error
    136 	(<code>io.EOF</code>, for example) or if it can't scan the requested
    137 	value (in our case, a single string). In either case we just want to
    138 	stop scanning, so we <code>break</code> out of the loop.
    139 </step>
    140 
    141 <step title="Adding a prefix and suffix to the chain" src="doc/codewalk/markov.go:/	key/,/key\], s\)">
    142 	The word stored in <code>s</code> is a new suffix. We add the new
    143 	prefix/suffix combination to the <code>chain</code> map by computing
    144 	the map key with <code>p.String</code> and appending the suffix
    145 	to the slice stored under that key.
    146 	<br/><br/>
    147 	The built-in <code>append</code> function appends elements to a slice
    148 	and allocates new storage when necessary. When the provided slice is
    149 	<code>nil</code>, <code>append</code> allocates a new slice.
    150 	This behavior conveniently ties in with the semantics of our map:
    151 	retrieving an unset key returns the zero value of the value type and
    152 	the zero value of <code>[]string</code> is <code>nil</code>.
    153 	When our program encounters a new prefix (yielding a <code>nil</code>
    154 	value in the map) <code>append</code> will allocate a new slice.
    155 	<br/><br/>
    156 	For more information about the <code>append</code> function and slices
    157 	in general see the
    158 	<a href="/doc/articles/slices_usage_and_internals.html">Slices: usage and internals</a> article.
    159 </step>
    160 
    161 <step title="Pushing the suffix onto the prefix" src="doc/codewalk/markov.go:/p\.Shift/">
    162 	Before reading the next word our algorithm requires us to drop the
    163 	first word from the prefix and push the current suffix onto the prefix.
    164 	<br/><br/>
    165 	When in this state
    166 	<pre>
    167 p == Prefix{"I", "am"}
    168 s == "not" </pre>
    169 	the new value for <code>p</code> would be
    170 	<pre>
    171 p == Prefix{"am", "not"}</pre>
    172 	This operation is also required during text generation so we put
    173 	the code to perform this mutation of the slice inside a method on
    174 	<code>Prefix</code> named <code>Shift</code>.
    175 </step>
    176 
    177 <step title="The Shift method" src="doc/codewalk/markov.go:/func[^\n]+Shift/,/\n}/">
    178 	The <code>Shift</code> method uses the built-in <code>copy</code>
    179 	function to copy the last len(p)-1 elements of <code>p</code> to
    180 	the start of the slice, effectively moving the elements
    181 	one index to the left (if you consider zero as the leftmost index).
    182 	<pre>
    183 p := Prefix{"I", "am"}
    184 copy(p, p[1:])
    185 // p == Prefix{"am", "am"}</pre>
    186 	We then assign the provided <code>word</code> to the last index
    187 	of the slice:
    188 	<pre>
    189 // suffix == "not"
    190 p[len(p)-1] = suffix
    191 // p == Prefix{"am", "not"}</pre>
    192 </step>
    193 
    194 <step title="Generating text" src="doc/codewalk/markov.go:/func[^\n]+Generate/,/\n}/">
    195 	The <code>Generate</code> method is similar to <code>Build</code>
    196 	except that instead of reading words from a <code>Reader</code>
    197 	and storing them in a map, it reads words from the map and
    198 	appends them to a slice (<code>words</code>).
    199 	<br/><br/>
    200 	<code>Generate</code> uses a conditional for loop to generate
    201 	up to <code>n</code> words.
    202 </step>
    203 
    204 <step title="Getting potential suffixes" src="doc/codewalk/markov.go:/choices/,/}\n/">
    205 	At each iteration of the loop we retrieve a list of potential suffixes
    206 	for the current prefix. We access the <code>chain</code> map at key
    207 	<code>p.String()</code> and assign its contents to <code>choices</code>.
    208 	<br/><br/>
    209 	If <code>len(choices)</code> is zero we break out of the loop as there
    210 	are no potential suffixes for that prefix.
    211 	This test also works if the key isn't present in the map at all:
    212 	in that case, <code>choices</code> will be <code>nil</code> and the
    213 	length of a <code>nil</code> slice is zero.
    214 </step>
    215 
    216 <step title="Choosing a suffix at random" src="doc/codewalk/markov.go:/next := choices/,/Shift/">
    217 	To choose a suffix we use the
    218 	<code><a href="/pkg/math/rand/#Intn">rand.Intn</a></code> function.
    219 	It returns a random integer up to (but not including) the provided
    220 	value. Passing in <code>len(choices)</code> gives us a random index
    221 	into the full length of the list.
    222 	<br/><br/>
    223 	We use that index to pick our new suffix, assign it to
    224 	<code>next</code> and append it to the <code>words</code> slice.
    225 	<br/><br/>
    226 	Next, we <code>Shift</code> the new suffix onto the prefix just as
    227 	we did in the <code>Build</code> method.
    228 </step>
    229 
    230 <step title="Returning the generated text" src="doc/codewalk/markov.go:/Join\(words/">
    231 	Before returning the generated text as a string, we use the
    232 	<code>strings.Join</code> function to join the elements of
    233 	the <code>words</code> slice together, separated by spaces.
    234 </step>
    235 
    236 <step title="Command-line flags" src="doc/codewalk/markov.go:/Register command-line flags/,/prefixLen/">
    237 	To make it easy to tweak the prefix and generated text lengths we
    238 	use the <code><a href="/pkg/flag/">flag</a></code> package to parse
    239 	command-line flags.
    240 	<br/><br/>
    241 	These calls to <code>flag.Int</code> register new flags with the
    242 	<code>flag</code> package. The arguments to <code>Int</code> are the
    243 	flag name, its default value, and a description. The <code>Int</code>
    244 	function returns a pointer to an integer that will contain the
    245 	user-supplied value (or the default value if the flag was omitted on
    246 	the command-line).
    247 </step>
    248 
    249 <step title="Program set up" src="doc/codewalk/markov.go:/flag.Parse/,/rand.Seed/">
    250 	The <code>main</code> function begins by parsing the command-line
    251 	flags with <code>flag.Parse</code> and seeding the <code>rand</code>
    252 	package's random number generator with the current time.
    253 	<br/><br/>
    254 	If the command-line flags provided by the user are invalid the
    255 	<code>flag.Parse</code> function will print an informative usage
    256 	message and terminate the program.
    257 </step>
    258 
    259 <step title="Creating and building a new Chain" src="doc/codewalk/markov.go:/c := NewChain/,/c\.Build/">
    260 	To create the new <code>Chain</code> we call <code>NewChain</code>
    261 	with the value of the <code>prefix</code> flag.
    262 	<br/><br/>
    263 	To build the chain we call <code>Build</code> with
    264 	<code>os.Stdin</code> (which implements <code>io.Reader</code>) so
    265 	that it will read its input from standard input.
    266 </step>
    267 
    268 <step title="Generating and printing text" src="doc/codewalk/markov.go:/c\.Generate/,/fmt.Println/">
    269 	Finally, to generate text we call <code>Generate</code> with
    270 	the value of the <code>words</code> flag and assigning the result
    271 	to the variable <code>text</code>.
    272 	<br/><br/>
    273 	Then we call <code>fmt.Println</code> to write the text to standard
    274 	output, followed by a carriage return.
    275 </step>
    276 
    277 <step title="Using this program" src="doc/codewalk/markov.go">
    278 	To use this program, first build it with the
    279 	<a href="/cmd/go/">go</a> command:
    280 	<pre>
    281 $ go build markov.go</pre>
    282 	And then execute it while piping in some input text:
    283 	<pre>
    284 $ echo "a man a plan a canal panama" \
    285 	| ./markov -prefix=1
    286 a plan a man a plan a canal panama</pre>
    287 	Here's a transcript of generating some text using the Go distribution's
    288 	README file as source material:
    289 	<pre>
    290 $ ./markov -words=10 &lt; $GOROOT/README
    291 This is the source code repository for the Go source
    292 $ ./markov -prefix=1 -words=10 &lt; $GOROOT/README
    293 This is the go directory (the one containing this README).
    294 $ ./markov -prefix=1 -words=10 &lt; $GOROOT/README
    295 This is the variable if you have just untarred a</pre>
    296 </step>
    297 
    298 <step title="An exercise for the reader" src="doc/codewalk/markov.go">
    299 	The <code>Generate</code> function does a lot of allocations when it
    300 	builds the <code>words</code> slice. As an exercise, modify it to
    301 	take an <code>io.Writer</code> to which it incrementally writes the
    302 	generated text with <code>Fprint</code>.
    303 	Aside from being more efficient this makes <code>Generate</code>
    304 	more symmetrical to <code>Build</code>.
    305 </step>
    306 
    307 </codewalk>
    308