Boduch's Blog: stream

Python 2.2 saw the introduction of the yield statement. This statement returns a generator from a function or method. Returning a generator is like returning a list. You can iterate through the result in a loop. The difference between returning a generator and a list is the flow of control. When a method yields data, it also yields control flow temporarily. The method will resume control once the generator's next() method is called. When a method returns data, control flow is permanently returned. Generators aren't a replacement for returning data. They are a tool for Python developers to return a data stream.

Let's create a simple example system to get a better feel for what generators are all about. We'll design a book catalog that searches book files. The format of each book file is a serialized Python dictionary. The book fields, or dictionary keys, are the book title, synopsis, and cover image. Using generators where a simple list is sufficient can be avoided. During it's lifetime, a generator method goes through several states. These states help us envision generator properties, such as control flow and the overall responsiveness of the program.

Before we embark on our book catalog design, we should make an analogy. A Unix pipe is a data channel between two processes. The first process writes data to the pipe while the second process reads data from the pipe. Python generators are similar. Instead of processes we have a method that writes data to the generator and an iterator that reads data from the generator. The iterator that reads data from the pipe is a for loop. Each loop iteration is executed when data is made available by the generator. Generators, used as a loop operand, behave the same as lists or tuples. The only difference is the flow of control.

Our book catalog application should have an optional sorting component to sort search results. For this, we will create a Sorter class with a sort() method. This method accepts a list of books and returns a sorted list. Python lists have a sort() method that will sort the elements in the list, so we'll use it instead of reinventing the wheel. Once the list has been sorted, should Sorter.sort() return a generator? Before you decide, ask yourself if the return data is available. Does the method use a pipe and filter approach? Are we iterating over some data set, manipulating each element? Does the method produce a stream of data, or a monotonous piece of data? Our Sorter.sort() method isn't any of these things and will simply return the sorted dictionary.

The next book catalog component is the most important piece of the puzzle. Searching for books. A Filter class with a filter() method will handle this. This method accepts book iterator and filter string parameters and returns a book iterator as the filter result. Returning a generator from this method is a good idea because the return data isn't immediately available. This is because we're iterating over a set of books. We filter each book by checking if the supplied string exists in the title or description. If so, the book is yielded. This method produces streaming behavior because other objects that invoke the Filter.filter() method can begin reading from the returned generator before all data is available.

How does all this interleaving data and control flow work? Let's take a look at the states a generator method goes through during it's lifetime. The method starts in a running state. This is where the method computes data to send to the generator. Once data is ready to yield, the method goes into a yielding state. This state isn't active for very long. It is only active while the data is being written to the generator. Finally, the method goes into a frozen state. The method is frozen so that it may resume its flow of control once data has been read from the generator. The method will then enter the running state again.

The final component of our book catalog is a FileReader class. This class has a read() method that loads all books files and creates the corresponding book dictionaries. Each dictionary is then sent to the generator returned by this method. Now that we have all our book catalog components, the search work flow is easy. The FileReader.read() method yields a book dictionary. The Filter.filter() method searches the dictionary for "MyBook". A match is found and the book is sent to the generator. The user interface, which invoked Filter.filter() displays "MyBook" in the search results. FileReader.read() resumes with the next file in the directory. This chain of generators produces a stream of data and a responsive search. Think about a catalog with 5000 books. If one of the first 100 books matches the criteria, the user sees this book before the search has completed.

Designing a simple book catalog program has shown us that generators add overhead to methods when they aren't used to create a data stream. If your method operates on individual elements of an input data set and produces another set, use a generator. This is a pipe and filter approach where the generator is the pipe and the method logic is the filter. Our example illustrates the effect data streams can have on the responsiveness of some behaviors, like searching for books. Adding multiple threads of control to an application can also increase the responsiveness but using generators is a more intuitive, data-centric approach.

The Django Python web application framework provides tools to parse multi-part form data as any web framework should. In fact, the developer responsible for creating Django web applications need not concern themselves with the underlying parsing machinery. However, it is still there and very accessible.

The Django boundary iterators are used to help parse multi-part form data. These iterators are also general enough to be used in different contexts. The BoundaryIter class collaborates with other iterator classes such as ChunkIter and LazyStream. An example of these classes collaborating are illustrated below.

#Example; Using the Django boundary iterator.

#Imports.
from StringIO import StringIO
from django.http.multipartparser import ChunkIter, BoundaryIter, LazyStream
                                    
if __name__=="__main__":
  #Boundary data.
  b_boundary="-B-"
  c_boundary="-C-"

  #Example message with two boundaries.
  _message="bdata%scdata%s"%(b_boundary, c_boundary)

  #Instantiate a chunking iterator for file-like objects.
  _chunk_iter=ChunkIter(StringIO(_message))

  #Instantiate a lazy data stream using the chunking iterator.
  _lazy_stream=LazyStream(_chunk_iter)

  #Instantiate two boundary iterators.
  _bboundary_iter=BoundaryIter(_lazy_stream, b_boundary)
  _cboundary_iter=BoundaryIter(_lazy_stream, c_boundary)

  #Display the parsed boundary data.
  for data in _bboundary_iter:
      print "%s:      %s"%(b_boundary, data)

  for data in _cboundary_iter:
      print "%s:      %s"%(c_boundary, data)

In this example, we have a sample message containing two boundaries that is to be parsed. In order to achieve this, we create three iterators, and a stream. The _chunk_iter iterator is a ChunkIter instance that is a very general iterator used to read a single chunk of data at a time. This iterator expects a file-like object to iterate over. The two boundary iterators, _bboundary_iter and _cboundary_iter, are instances of the BoundaryIter class. This iterators expect both a data stream and a boundary. In this case, a LazyStream instance is passed to the boundary iterators.

We finally display the results of iterating over the boundary iterators. By the time the second iterator is reached, the data stream is now shorter in length.

Boduch's Blog

Wednesday, August 11, 2010

Using Python Generators

Wednesday, August 19, 2009

Django Boundary Iterators