1 <!-- 2 Copyright 2011 The Go Authors. All rights reserved. 3 Use of this source code is governed by a BSD-style 4 license that can be found in the LICENSE file. 5 --> 6 7 <codewalk title="Generating arbitrary text: a Markov chain algorithm"> 8 9 <step title="Introduction" src="doc/codewalk/markov.go:/Generating/,/line\./"> 10 This codewalk describes a program that generates random text using 11 a Markov chain algorithm. The package comment describes the algorithm 12 and the operation of the program. Please read it before continuing. 13 </step> 14 15 <step title="Modeling Markov chains" src="doc/codewalk/markov.go:/ chain/"> 16 A chain consists of a prefix and a suffix. Each prefix is a set 17 number of words, while a suffix is a single word. 18 A prefix can have an arbitrary number of suffixes. 19 To model this data, we use a <code>map[string][]string</code>. 20 Each map key is a prefix (a <code>string</code>) and its values are 21 lists of suffixes (a slice of strings, <code>[]string</code>). 22 <br/><br/> 23 Here is the example table from the package comment 24 as modeled by this data structure: 25 <pre> 26 map[string][]string{ 27 " ": {"I"}, 28 " I": {"am"}, 29 "I am": {"a", "not"}, 30 "a free": {"man!"}, 31 "am a": {"free"}, 32 "am not": {"a"}, 33 "a number!": {"I"}, 34 "number! I": {"am"}, 35 "not a": {"number!"}, 36 }</pre> 37 While each prefix consists of multiple words, we 38 store prefixes in the map as a single <code>string</code>. 39 It would seem more natural to store the prefix as a 40 <code>[]string</code>, but we can't do this with a map because the 41 key type of a map must implement equality (and slices do not). 42 <br/><br/> 43 Therefore, in most of our code we will model prefixes as a 44 <code>[]string</code> and join the strings together with a space 45 to generate the map key: 46 <pre> 47 Prefix Map key 48 49 []string{"", ""} " " 50 []string{"", "I"} " I" 51 []string{"I", "am"} "I am" 52 </pre> 53 </step> 54 55 <step title="The Chain struct" src="doc/codewalk/markov.go:/type Chain/,/}/"> 56 The complete state of the chain table consists of the table itself and 57 the word length of the prefixes. The <code>Chain</code> struct stores 58 this data. 59 </step> 60 61 <step title="The NewChain constructor function" src="doc/codewalk/markov.go:/func New/,/\n}/"> 62 The <code>Chain</code> struct has two unexported fields (those that 63 do not begin with an upper case character), and so we write a 64 <code>NewChain</code> constructor function that initializes the 65 <code>chain</code> map with <code>make</code> and sets the 66 <code>prefixLen</code> field. 67 <br/><br/> 68 This is constructor function is not strictly necessary as this entire 69 program is within a single package (<code>main</code>) and therefore 70 there is little practical difference between exported and unexported 71 fields. We could just as easily write out the contents of this function 72 when we want to construct a new Chain. 73 But using these unexported fields is good practice; it clearly denotes 74 that only methods of Chain and its constructor function should access 75 those fields. Also, structuring <code>Chain</code> like this means we 76 could easily move it into its own package at some later date. 77 </step> 78 79 <step title="The Prefix type" src="doc/codewalk/markov.go:/type Prefix/"> 80 Since we'll be working with prefixes often, we define a 81 <code>Prefix</code> type with the concrete type <code>[]string</code>. 82 Defining a named type clearly allows us to be explicit when we are 83 working with a prefix instead of just a <code>[]string</code>. 84 Also, in Go we can define methods on any named type (not just structs), 85 so we can add methods that operate on <code>Prefix</code> if we need to. 86 </step> 87 88 <step title="The String method" src="doc/codewalk/markov.go:/func[^\n]+String/,/}/"> 89 The first method we define on <code>Prefix</code> is 90 <code>String</code>. It returns a <code>string</code> representation 91 of a <code>Prefix</code> by joining the slice elements together with 92 spaces. We will use this method to generate keys when working with 93 the chain map. 94 </step> 95 96 <step title="Building the chain" src="doc/codewalk/markov.go:/func[^\n]+Build/,/\n}/"> 97 The <code>Build</code> method reads text from an <code>io.Reader</code> 98 and parses it into prefixes and suffixes that are stored in the 99 <code>Chain</code>. 100 <br/><br/> 101 The <code><a href="/pkg/io/#Reader">io.Reader</a></code> is an 102 interface type that is widely used by the standard library and 103 other Go code. Our code uses the 104 <code><a href="/pkg/fmt/#Fscan">fmt.Fscan</a></code> function, which 105 reads space-separated values from an <code>io.Reader</code>. 106 <br/><br/> 107 The <code>Build</code> method returns once the <code>Reader</code>'s 108 <code>Read</code> method returns <code>io.EOF</code> (end of file) 109 or some other read error occurs. 110 </step> 111 112 <step title="Buffering the input" src="doc/codewalk/markov.go:/bufio\.NewReader/"> 113 This function does many small reads, which can be inefficient for some 114 <code>Readers</code>. For efficiency we wrap the provided 115 <code>io.Reader</code> with 116 <code><a href="/pkg/bufio/">bufio.NewReader</a></code> to create a 117 new <code>io.Reader</code> that provides buffering. 118 </step> 119 120 <step title="The Prefix variable" src="doc/codewalk/markov.go:/make\(Prefix/"> 121 At the top of the function we make a <code>Prefix</code> slice 122 <code>p</code> using the <code>Chain</code>'s <code>prefixLen</code> 123 field as its length. 124 We'll use this variable to hold the current prefix and mutate it with 125 each new word we encounter. 126 </step> 127 128 <step title="Scanning words" src="doc/codewalk/markov.go:/var s string/,/\n }/"> 129 In our loop we read words from the <code>Reader</code> into a 130 <code>string</code> variable <code>s</code> using 131 <code>fmt.Fscan</code>. Since <code>Fscan</code> uses space to 132 separate each input value, each call will yield just one word 133 (including punctuation), which is exactly what we need. 134 <br/><br/> 135 <code>Fscan</code> returns an error if it encounters a read error 136 (<code>io.EOF</code>, for example) or if it can't scan the requested 137 value (in our case, a single string). In either case we just want to 138 stop scanning, so we <code>break</code> out of the loop. 139 </step> 140 141 <step title="Adding a prefix and suffix to the chain" src="doc/codewalk/markov.go:/ key/,/key\], s\)"> 142 The word stored in <code>s</code> is a new suffix. We add the new 143 prefix/suffix combination to the <code>chain</code> map by computing 144 the map key with <code>p.String</code> and appending the suffix 145 to the slice stored under that key. 146 <br/><br/> 147 The built-in <code>append</code> function appends elements to a slice 148 and allocates new storage when necessary. When the provided slice is 149 <code>nil</code>, <code>append</code> allocates a new slice. 150 This behavior conveniently ties in with the semantics of our map: 151 retrieving an unset key returns the zero value of the value type and 152 the zero value of <code>[]string</code> is <code>nil</code>. 153 When our program encounters a new prefix (yielding a <code>nil</code> 154 value in the map) <code>append</code> will allocate a new slice. 155 <br/><br/> 156 For more information about the <code>append</code> function and slices 157 in general see the 158 <a href="/doc/articles/slices_usage_and_internals.html">Slices: usage and internals</a> article. 159 </step> 160 161 <step title="Pushing the suffix onto the prefix" src="doc/codewalk/markov.go:/p\.Shift/"> 162 Before reading the next word our algorithm requires us to drop the 163 first word from the prefix and push the current suffix onto the prefix. 164 <br/><br/> 165 When in this state 166 <pre> 167 p == Prefix{"I", "am"} 168 s == "not" </pre> 169 the new value for <code>p</code> would be 170 <pre> 171 p == Prefix{"am", "not"}</pre> 172 This operation is also required during text generation so we put 173 the code to perform this mutation of the slice inside a method on 174 <code>Prefix</code> named <code>Shift</code>. 175 </step> 176 177 <step title="The Shift method" src="doc/codewalk/markov.go:/func[^\n]+Shift/,/\n}/"> 178 The <code>Shift</code> method uses the built-in <code>copy</code> 179 function to copy the last len(p)-1 elements of <code>p</code> to 180 the start of the slice, effectively moving the elements 181 one index to the left (if you consider zero as the leftmost index). 182 <pre> 183 p := Prefix{"I", "am"} 184 copy(p, p[1:]) 185 // p == Prefix{"am", "am"}</pre> 186 We then assign the provided <code>word</code> to the last index 187 of the slice: 188 <pre> 189 // suffix == "not" 190 p[len(p)-1] = suffix 191 // p == Prefix{"am", "not"}</pre> 192 </step> 193 194 <step title="Generating text" src="doc/codewalk/markov.go:/func[^\n]+Generate/,/\n}/"> 195 The <code>Generate</code> method is similar to <code>Build</code> 196 except that instead of reading words from a <code>Reader</code> 197 and storing them in a map, it reads words from the map and 198 appends them to a slice (<code>words</code>). 199 <br/><br/> 200 <code>Generate</code> uses a conditional for loop to generate 201 up to <code>n</code> words. 202 </step> 203 204 <step title="Getting potential suffixes" src="doc/codewalk/markov.go:/choices/,/}\n/"> 205 At each iteration of the loop we retrieve a list of potential suffixes 206 for the current prefix. We access the <code>chain</code> map at key 207 <code>p.String()</code> and assign its contents to <code>choices</code>. 208 <br/><br/> 209 If <code>len(choices)</code> is zero we break out of the loop as there 210 are no potential suffixes for that prefix. 211 This test also works if the key isn't present in the map at all: 212 in that case, <code>choices</code> will be <code>nil</code> and the 213 length of a <code>nil</code> slice is zero. 214 </step> 215 216 <step title="Choosing a suffix at random" src="doc/codewalk/markov.go:/next := choices/,/Shift/"> 217 To choose a suffix we use the 218 <code><a href="/pkg/math/rand/#Intn">rand.Intn</a></code> function. 219 It returns a random integer up to (but not including) the provided 220 value. Passing in <code>len(choices)</code> gives us a random index 221 into the full length of the list. 222 <br/><br/> 223 We use that index to pick our new suffix, assign it to 224 <code>next</code> and append it to the <code>words</code> slice. 225 <br/><br/> 226 Next, we <code>Shift</code> the new suffix onto the prefix just as 227 we did in the <code>Build</code> method. 228 </step> 229 230 <step title="Returning the generated text" src="doc/codewalk/markov.go:/Join\(words/"> 231 Before returning the generated text as a string, we use the 232 <code>strings.Join</code> function to join the elements of 233 the <code>words</code> slice together, separated by spaces. 234 </step> 235 236 <step title="Command-line flags" src="doc/codewalk/markov.go:/Register command-line flags/,/prefixLen/"> 237 To make it easy to tweak the prefix and generated text lengths we 238 use the <code><a href="/pkg/flag/">flag</a></code> package to parse 239 command-line flags. 240 <br/><br/> 241 These calls to <code>flag.Int</code> register new flags with the 242 <code>flag</code> package. The arguments to <code>Int</code> are the 243 flag name, its default value, and a description. The <code>Int</code> 244 function returns a pointer to an integer that will contain the 245 user-supplied value (or the default value if the flag was omitted on 246 the command-line). 247 </step> 248 249 <step title="Program set up" src="doc/codewalk/markov.go:/flag.Parse/,/rand.Seed/"> 250 The <code>main</code> function begins by parsing the command-line 251 flags with <code>flag.Parse</code> and seeding the <code>rand</code> 252 package's random number generator with the current time. 253 <br/><br/> 254 If the command-line flags provided by the user are invalid the 255 <code>flag.Parse</code> function will print an informative usage 256 message and terminate the program. 257 </step> 258 259 <step title="Creating and building a new Chain" src="doc/codewalk/markov.go:/c := NewChain/,/c\.Build/"> 260 To create the new <code>Chain</code> we call <code>NewChain</code> 261 with the value of the <code>prefix</code> flag. 262 <br/><br/> 263 To build the chain we call <code>Build</code> with 264 <code>os.Stdin</code> (which implements <code>io.Reader</code>) so 265 that it will read its input from standard input. 266 </step> 267 268 <step title="Generating and printing text" src="doc/codewalk/markov.go:/c\.Generate/,/fmt.Println/"> 269 Finally, to generate text we call <code>Generate</code> with 270 the value of the <code>words</code> flag and assigning the result 271 to the variable <code>text</code>. 272 <br/><br/> 273 Then we call <code>fmt.Println</code> to write the text to standard 274 output, followed by a carriage return. 275 </step> 276 277 <step title="Using this program" src="doc/codewalk/markov.go"> 278 To use this program, first build it with the 279 <a href="/cmd/go/">go</a> command: 280 <pre> 281 $ go build markov.go</pre> 282 And then execute it while piping in some input text: 283 <pre> 284 $ echo "a man a plan a canal panama" \ 285 | ./markov -prefix=1 286 a plan a man a plan a canal panama</pre> 287 Here's a transcript of generating some text using the Go distribution's 288 README file as source material: 289 <pre> 290 $ ./markov -words=10 < $GOROOT/README 291 This is the source code repository for the Go source 292 $ ./markov -prefix=1 -words=10 < $GOROOT/README 293 This is the go directory (the one containing this README). 294 $ ./markov -prefix=1 -words=10 < $GOROOT/README 295 This is the variable if you have just untarred a</pre> 296 </step> 297 298 <step title="An exercise for the reader" src="doc/codewalk/markov.go"> 299 The <code>Generate</code> function does a lot of allocations when it 300 builds the <code>words</code> slice. As an exercise, modify it to 301 take an <code>io.Writer</code> to which it incrementally writes the 302 generated text with <code>Fprint</code>. 303 Aside from being more efficient this makes <code>Generate</code> 304 more symmetrical to <code>Build</code>. 305 </step> 306 307 </codewalk> 308