As I indirectly mentioned in my previous post, I will be giving a presentation on SQL Server 2005 at the March .NUFW meeting. I have been spending a great deal of time preparing for this presentation and have most recently been studying the new XML capabilities including extensions to the FOR XML clause, the inclusion of the XML data type, and the native support for XQuery.
All of these enhancements seem exciting but I am at a loss as to why much of this is being natively supported in an RDBMS, especially one that allows for embedded CLR assemblies. Don’t get me wrong, I think that XML is a great technology and have used it many times but like everything else, it is a tool and needs to be used appropriately.
I fully understand wanting to manipulate XML as close to the database as possible to reduce the amount of network traffic typically required to build a dynamic XML document. As I mentioned above, SQL Server 2005 supports embedded CLR assemblies. One of the new features includes in version 2.0 of the .NET Framework is the SqlServer namespace that provides classes for querying and manipulating data against the context connection. This provides an easy and familiar mechanism for building and parsing XML documents inside of the database because the classes defined in the System.Xml namespace are accessible!
I even understand the need for an XML data type so that XML documents may be passed into or returned by stored procedures or user defined functions for parsing or building, respectively. What I do not understand is storing XML in a relational database!
All of the arguments I have read or heard for storing XML in a relational database fall into a couple of categories:
- XML is universal because it uses UNICODE.
- XML allows for native mapping of properties to objects, thus, it provides a simple mechanism for persisting and recovering the data related to an object.
These arguments have not convinced me that storing XML in a relational database is a good practice. The UNICODE argument falls apart immediately in that if UNICODE is needed, values can [typically] be stored natively in the database through the use of an NCHAR or NVARCHAR data type (SQL Server and Oracle support this). If natively storing the value as UNICODE is not an option, it can be converted later.
As for the second argument, I will agree that XML does allow for an easy mechanism for persisting and recovering object related data. This is also known as laziness! I would even argue that this approach is lazy even when the data is “guaranteed” (is anything ever certain?) not to change.
I have never seen an XML document that wasn’t able to be easily represented in a relational format. It is true that many documents require many tables but the point is that it can still be done. In many aspects, it is also more efficient to store data values individually spread across multiple tables than it is to store a single XML document (or fragment) in a single column.
Parsing the data values into individual columns across an appropriate number of tables provides many advantages to storing an entire XML document in a single column.
- When storing raw XML, a great deal of space is wasted just to store tags and attribute names. Depending on the structure of the XML, this can quickly add up since it will be extended across all of the rows in the table storing the XML.
- What if a single value needs to be changed? How about adding a new Item? If the XML document is large, this can be an expensive operation since the entire column value will need to be replaced. (It should be noted here that SQL Server 2005 does support a proprietary XML Data Manipulation Language (DML) but the XQuery specification does not currently support this functionality so it is likely to change in future versions.)
- What if a different view of the data is needed? Suppose the XML represents an article and only the title is needed? The entire document would need to be read and parsed to extract a single value. The title itself could be broken out into its own column but then other problems may exist:
- Multiple copies of the same data if the title is left in the XML. This is wasting space and adding complexity because the values need to be kept in sync.
- The title could be removed from the XML and stored in its own column. This is eliminating the purpose of storing the XML in the database in the first place because the title will either need to be added back into the XML for delivery to the client or two queries will need to be run!
- Another gripe about storing raw XML in a database is that it is difficult to query against. An example is if the XML column stores documents that describe books and a user wants to find all of the books written by “Bob”, either every column needs to be returned and searched using XPath or XQuery (SQL Server 2005 provides native support for XQuery but the syntax is an awkward combination of T-SQL and FLWOR)
- The XML specification does not require values to be strongly typed. Strongly typed data ensures consistency. Storing the values in individual columns that are appropriately typed will add an extra layer of protection to the data.
- Finally, a “typical” XML document does not represent a scalar value but rather, a complex type.
It is for these reasons that I do not support storing XML in a relational database even for document-centric type applications. As I mentioned earlier, I have no issue with XML itself. I believe that it is a great tool and deserves a place in every developer’s toolkit. Like all tools though, it needs to be used appropriately in order to be effective. If my logic is flawed or I have missed something, please feel free to try to change my mind.
Since I worked at an XML DBMS vendor before recently joining Microsoft, I have a somewhat different perspective. I agree that if it is easy to “shred” XML into the relational model, that’s generally best. BUT …”I have never seen an XML document that wasn’t able to be easily represented in a relational format.”It gets dicey with lots of recursion, very open content models, mixed content, and other document-like features. “When storing raw XML, a great deal of space is wasted just to store tags and attribute names.” and “The entire document would need to be read and parsed to extract a single value.”Modern DBMS such as SQL Server 2005 use an optimized, indexed internal format, not raw XML text. “Finally, a “typical” XML document does not represent a scalar value but rather, a complex type.” That’s why SQL Server 2005 stores XML in a native XML type designed to support XQuery. I wouldnt’ argue that it is clear how well this will all work in practice or how easily users will adapt to using XQuery to query the native XML datatype. We shall see what the actual best practices turn out to be. I do think that more regular the XML and SQL-friendly the queries, the better your approach will be. I think, however, that for people that have to deal with relatively unstructured XML that they get from random places on the Web rather than a single trusted application, the more people will find the XML datatype and XQuery quite useful.