Showing posts with label thread. Show all posts
Showing posts with label thread. Show all posts

Thursday, October 29, 2009

Gaphor Idle Threads

The Gaphor UML modeling tool written in Python uses GTK as its' user interface library. GTK is a good choice as it is portable across many platforms and has a nice feature set. What is also nice about the GTK library, from a development perspective, is it as fairly straightforward to handle events that are triggered from the user interface. Such events may be user-generated, such as mouse clicks on widgets. Other events, may be generated by the widgets themselves. Either way, adding a handler for any event such as these is trivial. Using gobject, developers can also add handlers that are executed in between events, such as when the event loop is idle.

In any GTK application, there is a main event loop that must be instantiated within the application code. This is going to be one of the very first actions executed because without it, there will be no responses to GTK events. Every time an event is triggered inside a GTK main event loop, the event instance is placed in a pending event queue. Once the event has been processed, or handled, the event is no longer considered pending and is removed from the queue. So what this means is that the GTK main loop can be viewed, at a high level, as having two distinct states; pending and free. These states are illustrated below.



Here, the initial state represents the instantiation of the GTK main event loop while the final state represents the termination of the main loop. The termination often means that the user has exited the application successfully but could also mean that the application has exited erroneously. Regardless, there are longer any GTK events that will be processed once the main loop has exited, even if the containing application has not exited.

As illustrated, the GTK main loop as to states while running and two transitions between these states. The GTK main event loop will transition to a pending state when there are one or more pending events. The GTK main event loop will transition to a free state when there are zero pending events.

Gaphor defines an idle thread class that makes good use of all this GTK event machinery. The GIdleThread class uses gobject.idle_add() to add a callback to the GTK main event loop. This callback is only executed when there are zero pending events. Actually, it will still execute if there are pending events with a lower priority but that doesn't necessarily concern the concept here. The key concept is that the callbacks created by GIdleThread are only executed when the GTK main loop is idle. The GIdleThread class is illustrated below.



So the nagging question from developers is, why add this abstraction layer on top of the gobject.idle_add() function? Simply put, the GIdleThread class is used to assemble queues when the GTK main loop isn't busy. The obvious benefit here being that queues of arbitrary size can be assembled without sacrificing responsiveness to the end user.

An example use of this class is to read and parse data files. The generator function that yields data is passed, along with the queue that will eventually contain all the parsed data to the GIdleThread constructor. This abstraction also provides the thread-like feeling for developers that use it. Although not a real thread, it looks and behaves like one and is ideal for constructing queues.

Monday, June 15, 2009

Combining Multiprocessing And Threading

In Python, there are two ways to achieve concurrency within a given application; multiprocessing and threading. Concurrency, whether in a Python application, or an application written in another language, often coincides with events taking place. These events can be written directly in code much more effectively when using an event framework. The basic need that the developer using this framework has is the ability to publish events. In turn, things happen in response to those events. Now, what the developer most likely isn't concerned with is the concurrency semantics involved with these event handlers. The circuits Python event framework will take care of this for the developer. What is interesting is how the framework manages the concurrency method used; multiprocessing or threading.

With the multiprocessing approach, a new system process is created for each logical thread of control. This is beneficial on systems with more than one processor because the Python global interpreter lock isn't a concern. This gives the application potential to achieve true concurrency. With the threading approach, a new system thread, otherwise known as a lightweight process is created for each logical thread of control. Applications using this approach means that the Python global interpreter lock is a factor. On systems with more than one processor, true concurrency is not possible within the application itself. The good news is that both approaches can potentially be used inside a given application. There are two independent Python modules that exist for each method. The abstractions inside of each of these modules share nearly identical interfaces.

The circuits Python event framework uses an approach that will use either the multiprocessing module or the threading module. The circuits framework will attempt to use the multiprocessing module method to concurrency in preference to the threading module. The approach to importing the required modules and defining the concurrency abstraction is illustrated below.



As you can see, the core Process abstraction within circuits is declared based on what modules exist on the system. If multiprocessing is available, it is used. Otherwise, the threading module is used. The only downfall to this approach is that as long as the multiprocessing module is available, threads cannot be used. Threads may be preferable to processes in certain situations.

Thursday, May 28, 2009

Stateful Python Lists

The Python programming language defines list objects as one of the primitive language types. Lists, are similar to arrays in more traditional programming languages. Of course, Python lists are mutable types, as are arrays. The main difference between Python lists and traditional arrays are similar in most other comparisons of Python to other languages; Python lists are more flexible and elegant. However, Python lists are not magic. One situation where Python lists have the potential to easily fall apart is when used in a multi-threaded environment. This fact isn't exclusive to lists, the same risk involved with passing data across threads applies to all types. The good news is that this can be easily remedied with lists. Python list instances support the notion of "pushing" and "pulling" data. This is a useful thought construct because this transfers directly to the design. Any given list instance can be "pushing" state when adding data to the list or a "pulling" state when retrieving data from the list.

The Set class in the boduch Python library is essentially a Python list that publishes events. These events that are published by Set instances have the potential to be handled in a multi-threaded environment. If Set instances are being both "pushed" to and "pulled" from in a multi-threaded environment, this could be very dangerous. Data that is expected to exist in the set instance when "pulled" may not have arrived yet. In this scenario, it would be useful to know what state the Set instance is in. By combining both the Set and StateMachine classes as demonstrated below, we end up with a "stateful" Python list. The implementation of the StatefulSet shown here isn't fully thread safe. This is because it is incomplete. However, the code used in the main program is thread safe.
#example; Creating a multi-threaded Python list with a boduch Set.

#Necessary imports.
from boduch.event import threaded, subscribe, EventSetPush
from boduch.handle import Handle, HandleSetPush
from boduch.data import Set
from boduch.state import StateMachine
from boduch.type import is_type
from boduch import PRIORITY_MINOR, PRIORITY_CRITICAL

#A handle for Set.push() method invocations.
class HandlePreStatefulSetPush(Handle):
#This gets a higher than critical priority since it is
#intended to run before the core behavior.
priority=PRIORITY_CRITICAL+1
def __init__(self, *args, **kw):
Handle.__init__(self, *args, **kw)

def run(self):
#Check if we are handling a StatefulSet event. If
#so, set goes into a pushing state.
set_obj=self.get_event_data("set")
if is_type(set_obj, "StatefulSet"):
set_obj.change_state("PUSHING")

#A handle for Set.push() method invocations.
class HandlePostStatefulSetPush(Handle):
#This handle gets a minor priority since it is
#intended to run after all the pushing is complete.
priority=PRIORITY_MINOR
def __init__(self, *args, **kw):
Handle.__init__(self, *args, **kw)

def run(self):
#Check if we are handling a StatefulSet event. If
#so, set goes into a ready state.
set_obj=self.get_event_data("set")
if is_type(set_obj, "StatefulSet"):
set_obj.change_state("READY")

#A stateful version of the Set class.
class StatefulSet(StateMachine, Set):
def __init__(self, *args, **kw):
#Initialize the base classes and populate the
#potential set states.
StateMachine.__init__(self, *args, **kw)
Set.__init__(self, *args, **kw)
self.add_state("READY")
self.add_state("PUSHING")

def _push(self, data):
#Change to a pushing state before actually pushing data.
self.change_state("PUSHING")
self.push(data)

def _get(self, index):
#Don't attempt to return anything while in a pushing state.
while self=="PUSHING":
print "Hold on, still pushing data."
pass
return self[index]

#Subscribe the custom stateful handles.
subscribe(EventSetPush, HandlePreStatefulSetPush)
subscribe(EventSetPush, HandlePostStatefulSetPush)

#Main program.
if __name__=="__main__":
#Enable threaded event mode.
threaded(True)
#Instantiate a stateful set.
set_obj=StatefulSet()
#Populate the set while retrieving data in a thread-safe manor.
for i in range(0,10):
set_obj._push("SET DATA"+str(i))
print set_obj._get(i)

Tuesday, March 24, 2009

Why we need a thread-safe publish/subscribe event system

Publish-subscribe event systems are a fairly common design pattern in modern computing. The concept becomes increasingly powerful in distributed systems where many nodes can subscribe to an event or topic emitted from a single node. The name publish-subscribe, or pub-sub, is used because it has a tight analogue in the real world. People with magazine or newspaper subscriptions receive updates when something is published. Because of this analogue, developers are more easily able to reason about events and why they occurred in complex software systems. In any given software system, some code will need to react to one or more events. These events can range from anything as simple as a mouse click to a complete database failure. The publish-subscribe pattern is infinitely extensible because any number of observers may subscribe to a single event. Subscriptions can also be canceled to as to offer architectural scalability in both directions; up and down. One bottleneck in a publish-subscribe framework can occur while the publishing object needs to wait until all subscribers have finished reacting to the event. In some cases, this is unavoidable such as when the publisher is expecting a value to be returned from one of the subscribers. In other cases, however, the publisher doesn't care about the subscribers or how they react to a published event. In a localized publish-subscribe system, that is, not a distributed publish-subscribe system, we could use threads for subscribers. If we were to build and use a framework such as this, where subscriptions react to events in separate threads of control, we would also need the ability to turn threading off and use the framework in the same way and have it still be functional. This is because threading is simply not an option in every scenario.

The boduch Python library offers a publish-subscribe event system such as this. The library is still in it's infancy but has the ability to run subscription event handles in new threads on control. The threading capability can also be switched on and off. The same code using the library can be run in either mode. Events are declared by specializing a base "Event" class. Likewise, event handles, or subscriptions, are declared by specializing a base "Handle" class. Developers can then subscribe to an event by passing an event class and a handle class to the subscribe function. Multiple handles may be listening to a given event type and if running in threaded mode, each handle will start a new thread of control. There are limits on the number of threads that are allowed to be run at a given time but this can be adjusted either manually or pragmatically. When running in threaded mode, or non-threaded mode for that matter, published events may be specified as atomic. This really only as an effect when the event system is running in threaded mode because it forces all handles for that particular event to run in the publisher's thread. When running in non-threaded mode, atomic publications are idempotent.

As mentioned earlier, there are several limitations to the boduch library since it is still in its infancy as a project. For instance, there is no way to specify a filter for event subscriptions. Subscribers may want to react to event types based on data contained within the event instance. In turn, there is no way to tack the source object that emitted the event. Finally, there is no real guarantee that proper ordering will be preserved when running in threaded mode. However, this can be worked-around. I haven't actually encountered a scenario where the ordering of instructions have been defective when running in threaded mode. This doesn't mean it is possible. I actually hope I do some day so I can incorporate more built in safety in the library.

Wednesday, March 11, 2009

Interesting bug found in the boduch Python library

In the latest release of the boduch Python library, Set instances can now be iterated over. This is done by defining a custom iterator class, SetIterator, that is returned by the Set.__iter__() method. I thought I would further test out this new functionality in a hope that I would discover some new unit tests that I can include with the library. But before I could even get to the Set iteration testing, I discovered an entirely new bug with the Set class.

Firstly, here is the code I used to find the bug.
#Example; boduch Set bug.

from boduch.data import Set
from boduch.handle import Handle
from boduch.event import subscribe, threaded, EventSetPush

class MyHandle(Handle):
def __init__(self, *args, **kw):
Handle.__init__(self, *args, **kw)

def run(self):
pass

if __name__=="__main__":
threaded(True)
subscribe(EventSetPush, MyHandle)

set_obj1=Set()
set_obj2=Set()

set_obj1.push("data1")
set_obj2.push("data2")

print "SET1",set_obj1.data
print "SET2",set_obj2.data
Here, we defined a custom event handle called MyHandle. I the run method doesn't actually do anything because I discovered the bug before I wrote any handling code. In the main program, we set the event manager to threaded mode. Next, we subscribe our custom event handle to the EventSetPush event. This means that every time Set.push() is invoked, so is MyHandle.run() (in a new thread since we are running in threaded mode here). We then create two set instances and push some data onto each set. Finally, we print the underlying Python lists associated with each set instance.

Here is my initial output.
SET1 ['data1', 'data2']
SET2 ['data1', 'data2']
Slightly different from what was expected. Each set instance should have had one element each. Instead, the lists look identical. Naturally, I assumed that they were the same list. This lead me to start examining the thread manager, thinking that since I was testing in threaded mode, there must be some sort of cross-thread data contamination. Thankfully, the problem got much simpler since I was able to eliminate this as a potential cause. Next in line, the event manager. I tried everything to try and prove that the Set instances were in fact the same instance. Not so. The instances had different memory addresses.

I then realized that Set inherits from Type but the constructor for type is not invoked. Odd. I tried to think of a reason why I would want to inherit something with no static functionality and not initialize it. I think I may have done this because the underlying list instances of Set objects are stored in an attribute called data. Type instances also define a data attribute. I must have thought, during the original implementation, that defining a data attribute for the Set class would have some adverse effect on the Type functionality. Not so. So now, the Type constructor is invoked but with no parameters. This means that the initial value of the Set.data attribute is actually an empty dictionary since this is what the Type constructor will initialize it as. The Set constructor will then initialize the data attribute to a list accordingly.

This however, wasn't the problem either. I was still getting the strange output that pointed so convincingly at the fact that the Set.data attribute was pointing to the same list instance. So, I took a look at the way in which the data attribute is initialized for Set instances. The Set constructor will accept a data keyword parameter. The default value of this parameter is an empty list. This parameter then becomes the Set.data attribute. Just for fun, I decided to take away this parameter and have the data attribute be initialized as an empty list inside the constructor.

Sure enough, that did it. I got the correct output for my two set instances. The data attribute must have been pointing to the same keyword parameter variable. I have a felling that this may be caused somewhere in the event manager. Or maybe not. I haven't tested this scenario outside the library yet.

I'm going to get this release out as soon as possible. The data keyword parameter for the Set constructor will be removed for now. As a side note, this will also affect the iteration functionality for Hash instances in the next release since the Hash.__iter__() method will return a SetIterator instance containing the hash keys. Each key will simply need to be pushed onto the set instead.

Monday, January 5, 2009

Logical time

In any parallel computer system, it is important that causal precedence be maintained. This is true in both multi-threaded/multi-process local systems and distributed systems. When events to accomplish the same task are executed in parallel, there needs to be some sort of synchronization mechanism in place.

Logical time is traditionally how this is accomplished in distributed systems. Logical time allows several disparate processes to reason about causal precedence. For example, assume we have three distributed processes; p1, p2, and p3. The three processes are responsible for executing a task, t1. For simplicity, lets also assume that we know exactly how many events are required to accomplish our task; 10. And finally, we'll also assume the same code to execute any possible event in our task is available at each process.

The first process might start off by assigning events to the other two processes. Although, the processes will most likely be specialized in that the carry out specific functionality. This may be considered a pseudo-event and the real events do not take place until this initial event has completed. We need a logical way of specifying the causal order. The progress of each individual process should be piggybacked along with each inter-process message.

The same holds true with multi-threaded applications. Although they are running on a single machine, they are logically distributed and therefore need to carry-out the same causality precedence property. Multi-threaded applications typically use locks that allow only a single thread to access a critical section while locked. The problem is that this only prevents multiple threads from accessing the critical section simultaneously. Locks do not prevent causal order errors. The same discipline of logical time also needs to be taken into account even on single machines.

Tuesday, December 23, 2008

Python processing

I've discovered an interesting Python package called processing. The name may be a little misleading but it is nonetheless useful. The reason it is misleading is that from the developer API perspective it is a threading API which mimics the built-in Python threading module. The main difference being, of course, it uses processes instead of threads.

This comes in handy when dealing with the controversial global interpreter lock (although I have no problem with the GIL). On multi-core and multi-processor systems, multi-threaded applications do not distribute threads amongst these processors. This is because of the Python GIL. The GIL is used to ensure safety with mutable data types.

However, new processes are distributed amongst processors. This is the issue the processing module aims to solve. I recommend trying it out. I also issue a warning. Spawning new processes more computationally expensive than spawning new threads no matter what platform is in use. There is no substitute for increasing application responsiveness with threads. However, longer-running algorithms may be better suited in a new process.