Showing posts with label xml. Show all posts

Wednesday, July 7, 2010

Xpath By Attribute

In Python, the ElementTree module is quite handy for parsing XML documents and strings. The module also has limited support for XPath query strings. This is also any because it allows us to retrieve elements from the parsed XML tree without the need to traverse it.

However, trying to query by attribute doesn't really work as expected. This is too bad because it would be really handy if it did work. The following is a sample that follows the expected XPath syntax, but raises a SyntaxError exception.

from xml.etree import ElementTree as ET

XML = """
<xml>
    <person>
        <name first="FirstName1" last="LastName1"/>
    </person>
    <person>
        <name first="FirstName2" last="LastName2"/>
    </person>    
</xml>
"""

if __name__ == '__main__':
   tree = ET.fromstring(XML)
   print tree.find('.//person/name[@first="FirstName2"]')

Wednesday, March 24, 2010

Less Is More

When it comes to representations returned from a RESTful API, is there a preferred format? If JSON better than XML? More often than not, JSON is the preferred representation format because Ajax web applications are usually making the API requests. And when it is some other application making the request, it is usually trivial to make use of the returned data.

What about returning HTML as a RESTful representation? Isn't it as useful as JSON or XML? I would like to think so, just as long as it is consistent and simple. For instance, returning HTML lists is probably just as easy for clients to parse as retuning some other format. The added benefit is that HTML can be inserted directly into a browser. It can also be styled any any way imaginable.

Using HTML is less because it involves less work on the client end in most cases. It is more because it offers more flexibility.

Wednesday, October 14, 2009

Working With Elements

There is virtually no escaping XML markup in one form or another when building modern applications. Whether HTML, SOAP, or some other dialect, XML has become an important standard to support in applications. Even if the application isn't a web application.

Thankfully, most programming languages have built-in library support for reading and manipulating XML. Some are better than others. For instance, the ElementTree Python package is probably the easiest library for developers to work with. It doesn't add unnecessary complexity on top of a simplistic standard.

The ElementTree package is now part of the growing set of standard Python modules included in the distribution. Since the module can be used to both read and write XML data, it cuts down on dependencies. Most XML libraries support both reading and writing of XML data, but like any other library type, it may do one well but not the other.

The following is an example of how easy it is to not only use the ElementTree package to build an XML document, but also to add abstractions around the elements that are created.

#Example; Abstracting elements.

#Do element tree imports.
from xml.etree.ElementTree import Element, tostring

#The base DOM element.
class Dom(object):
   def __init__(self, name, **kw):
      
       #Create the element and set the attributes.
       self._element=Element(name)
       for i in kw.keys():
           self._element.attrib[i]=kw[i]

   #Set an element attribute.
   def __setitem__(self, name, value):
       self._element.attrib[name]=value
      
   #Get an attribute.
   def __getitem__(self, name):
       return self._element.attrib(name)
  
   #Append a sub-element.
   def append(self, value):
       self._element.append(value._element)

#A specialized Dom class for accepting raw text content.        
class DomContent(Dom):
  
   #Constructor.
   def __init__(self, name, content=None, **kw):
       Dom.__init__(self, name, **kw)
       self._element.text=content

#Common HTML elements.
class Head(Dom):
   def __init__(self):
       Dom.__init__(self, "head")
      
class Title(DomContent):
   def __init__(self, content, **kw):
       DomContent.__init__(self, "title", content, **kw)
      
class Body(Dom):
   def __init__(self):
       Dom.__init__(self, "body")
      
class Div(DomContent):
   def __init__(self, content, **kw):
       DomContent.__init__(self, "div", content, **kw)

#The root document.
class Document(Dom):
   def __init__(self, title):
       Dom.__init__(self, "html")

       #Initialize the head, title, and body elements.
       self.head=Head()
       self.title=Title(title)
       self.body=Body()
      
       #Add the title element to the head element.
       self.head.append(self.title)
      
       #Add the head and body elements to the document.
       self.append(self.head)       
       self.append(self.body)
      
   #Actual output.
   def __str__(self):
       return tostring(self._element)
      
#Main.
if __name__=="__main__":
  
   #Initialize the document with a title.
   my_doc=Document("My Document")
  
   #Create a div with content and an attribute.
   my_div=Div("My Div", style="float: left;")
  
   #Add the div to the body.
   my_doc.body.append(my_div)
  
   #Display.
   print my_doc

In this example, we construct a simple HTML page. The Dom class is the topmost level abstraction that we create around ElementTree. The Dom class is meant to represent any HTML tags that are placed in the HTML page. The Dom._element attribute represents the actual ElementTree element. The Dom constructor will give the element attribute values based on what keyword parameters were passed to the constructor. The Dom.__getitem__() and Dom.__setitem__() methods allow element attributes to be get and set respectively. The Dom.append() method allows other Dom instances to be attached to the current instance as a sub-element.

The DomContent class is a simple specialization of the Dom class. The DomContent class accepts an additional content parameter, otherwise, the class isn't really any different than its' base class.

The Head, Title, Body, and Div classes are all standard HTML specializations of the Dom class. The main difference being that Title and Div inherit from DomContent instead of Dom because they support raw text content.

The Document class is a helper type of abstraction. It assembles Dom elements common in all HTML pages we might want to build. It is the Document class that makes the main program trivial to read and understand.

Monday, February 23, 2009

An argument against XML

My argument against using XML as a data format in certain situations is that it is too verbose. In other situations, however, the verbosity provided by XML is needed. Such as for human consumption. This is why XML exists, it is easy to use and read by both humans and computers.

The verbosity problem with XML stems from the use of tags. Every entity represented in XML needs needs to be enclosed in a tag. The opening tag indicating that a new entity definition has started and the ending tag indicating the end of that definition. For example, consider the following XML.

<person>
<name>adam</name>
</person>

This is a trivial example of a person entity with a single name attribute. Notice the duplication of the text "person" and "name" in the metadata. With XML this is required. However, tags may also have attributes. Our person definition could be expressed as follows.

<person name="adam"/>

Here there is no metadata duplication. But I think the second example negates the readability philosophy behind XML. What exactly is the difference between attributes and child entities in XML? Semantically, there is none. A child entity is still an attribute of the parent entity.

With JSON, there is no duplication of metadata or any confusion of how an entity is defined. This is because the JSON format is focused on lightweight data, not readability. For instance, here is our person in JSON.

{person:{name:"adam"}}

Now, if a person were reading this, the chances of them getting the meaning right are greatly reduced when compared to the XML equivalent. However, it is much less verbose in most cases. And verbosity counts when data is being transferred over a network. Another plus, the XML is not lost. JSON can easily be converted to XML and back. So if JSON-formatted data must be edited by humans as XML, this is not difficult to achieve.

Here is a simple Python demonstration of reducing the size of XML data with JSON.

#Example; XML string and JSON string

xml_string="""
<entry><title>mytitle</title><body>mybody</body></entry>
"""

json_string="""
{entry:{title:"mytitle",body:"mybody"}}
"""
if __name__=="__main__":
print 'XML Length:',len(xml_string)
print 'JSON Length:',len(json_string)
pcent=float(len(json_string))/len(xml_string)*100
print 'XML size as JSON:',pcent,'%'

Finally, since XML is based on tags, there is no opportunity for sets of primitive types. For example, some client says to the server "give me a list of names and nothing else". The client will likely name something along the lines of the following.

<list>
<item name="name1"/>
<item name="name2"/>
<item name="name3"/>
</list>

Here is the JSON alternative.

["name1", "name2", "name3"]

Tuesday, July 22, 2008

JSON or XML?

Which is the better format for data representation? JSON? or XML? That is a tough one because there are several dimensions here that constitute better. Some easy differences:

XML is easier to understand, and therefore better for humans to read.
JSON is more lightweight, and therefor better for software to read.

So how do you go about deciding which format to go with in the context of your application? If the data is constantly manipulated by humans, the answer is easy; XML. In this scenario, the lightweight of JSON simply doesn't pay off. If the data is seldom interpreted by humans but the data format is transferred across a network (which it most likely will), JSON would be the way to go.

Can the two data formats exist within the same context? Yes. Would it make sense to use both formats in the same application? Not likely. One exception I can think of is when using third-party libraries or tools that require one format or the other but you have already invested heavily in the other format. The two formats are isomorphic enough that there is no magic needed to convert between the two. It is just unnecessary if it can be avoided.

If I were to start developing an application from scratch today, and the choice between the two formats needed to be made, I would most likely choose JSON. There are few reasons to not use it. There exist JSON libraries for all the major programming languages. The only question is human interpretation. It would be pretty easy to build a JSON, tree-style, viewer/editor (I've never seen anything out there that does this, and only this). This JSON editor/viewer will be a topic of further discussion in later posts.