Boduch's Blog: io

The new io module, designed for Python 3 but available in 2.6, gives file handling in Python a commonsensical layer of abstraction. The intent behind the new module is to replace the built-in open function and the methods of file-like objects. But why?

The file-like objects we've used in Python for over ten years already provide a nice abstraction layer between the programmer and the API. The nicest thing about file-like objects is that the API is small, and so easily emulated by other objects that we'd like to use in a file context. The facilities available for working with files in Python aren't perfect, nor is the new io module. The new ideas brought forth by io have had quite a while to coalesce. And since the module is available in Python 2, the transition to Python 3 using this new io code should not have to change.

Over the course of many years, the Python community has been busy writing IO code using the native file interface. Lots of code over significant amounts of time yields some common usage patterns that aren't optimal for the vast majority of users. That is to say, there is nothing wrong with the native file implementation — it works as expected. The improvements in the io module aim more toward writing less band-aid code surrounding the file interface. I'm certainly in favour of writing less code, especially for common problems that are encapsulated behind the interface. My question is this — can Python projects successfully transition to a new methodology of treating file-like objects as stream-like?

Good Old Days
Interestingly enough, the good old days of file-like objects are still with us. This is good, because should they have suddenly disappeared, we'd be left with exactly zero functional Python packages. That isn't what the Python development effort is intending — the goal isn't to profess what a select few think is good for the language. Instead, the file-like interface will likely never go away. Like many other provisional decisions made by the core Python development effort, the largest influence is maintaining a community of developers that can keep their code stable.

The Python 2 series has seen some changes, mostly additions, that are by and large aimed at Python 3 adoption. The io module being one of those additions. Needless to say, you can start making your code Python 3 friendly now. The trouble is, with the built-in file interface in particular, where is the motivation to start rewriting code? Because there is a new way of doing it? I think the simplicity of the Python 2 file interface doesn't exactly inspire mass exodus from the norm. What is needed is a real kick that will really excite developers. Something to really demonstrate that writing new code for working with files is worth while.

There is a problem there too. The file interface is so useful that we often emulate the API around other objects — file-like objects. So we don't know that we're using a file, and we probably don't care much for that matter, as long as the implementation is hidden from view. The file-like concept is so pervasive that it's almost as though Python has a concept dependency. Python has no notion of a required interface, but is staying file-like mandatory? I whole-heartedly support the idea that given an opportunity to fix some common use cases without introducing mammoth amounts of code, that developers would jump on it.

Brave New API
The key differentiator that the new io module brings to the table is a class hierarchy. The most general class, IOBase, is completely abstract and does nothing more than provide an API. From here, traversing downward through the hierarchy, we're presented with more specific IO capabilities. For example, the RawIOBase isn't abstract — it deals with low-level system calls. But rather than interact with RawIOBase directly, we'd probably want to use one of it's descendants. And this is the beneficial design tactic brought forth by this new library — low level operations that we would typically have to create our own abstractions around are taken care of for us.

A prominent use case would be in dealing with text versus bytes. TextIOBase, for example, will take care of unicode issues that we're used to dealing with ourselves using the traditional file interface. The distinction between raw bytes and text is a key philosophy in Python 3. How the io module handles these different types by means of the class hierarchy offers a glimpse into this philosophy. If you're writing Python 2 code, which I think most of us still are if we write code that's used in production, this is a simple means to write less boiler-plate code to deal with types. I think the io module works well in taking Python back to the concept of typeless languages — by hiding some of the type woes behind an API.

The PEP for this module states outright that it took some influence from Java IO. That means we've now buffered IO classes to work with. Why is this important? Predictable performance. Python runs on a lot of disparate operating systems and devices. That means that there are a number of differences between system calls for the same Python application running in different places in terms of latency. Buffered IO provides the necessary means to read or write as much as possible, while hiding the underlying system calls from the programmer. Any application that reads or writes from more than one place concurrently will likely run into issues of unpredictable responsiveness, at least from the user's perspective. There are definitely ways around this using the built-in file API, but suppressing that code into a core library seems like the better choice.

The built-in file API isn't going away. Conceptually, the new io module isn't all that different. The core abstractions in the class hierarchy are virtually the same as file-like objects we use today. The challenge is going to be figuring out where is is worthwhile to replace old code with this module. Or, maybe Python IO concepts going forward will force us to rethink our application data in terms of streams and what the rewards of doing so are.

Diesel Web is a Python web application framework containing very simplistic components. Rather than focusing on providing a rich set of application components, the focus is on raw performance. With this aspect of the web application taken care of, developers can focus on functionality, with the more freedom than most other frameworks can offer.

The performance offered by Diesel Web is achieved through non-blocking, asynchronous IO. This is different from most other web application frameworks in that most use the thread pool pattern. In the thread pool pattern, a main thread listens for incoming requests and passes control to a thread in the pool. Once the request has been handed off to the thread inside the pool, the main thread can then go back to listening for requests and doesn't need to worry about blocking while processing individual requests. This is the common way that concurrency is achieved in web application frameworks.

The asynchronous IO approach taken by Diesel Web scales better than the thread pool approach because there aren't any locking and synchronization primitives that accumulate overhead. IO events dictate what happens at the OS level and can thus scale better. The real benefit to using this approach would become apparent with thousands of users.

One other topic of interest with Diesel Web is the complete lack of dependencies, which is always a good thing. The framework appears to be striving for simplicity, another good thing, and doesn't really need much help from external packages. Basic HTTP protocol support is built in and that is really all that is needed as a starting point. It will be interesting to see if many other larger applications get built using this framework.

Boduch's Blog

Thursday, June 28, 2012

From File-Like to Stream-Like

Thursday, October 1, 2009

Diesel Web