I just remembered a discussion I had a few months ago about the xml encoding attribute, which got me thinking on the subject again. My original question was: "What's the use of the
encoding attribute in an xml declaration?"
Let me explain briefly... If you have an
declaration at the top of an xml document, this will basically tell your xml parser that the document it's reading is stored in UTF-8 encoding. The encoding is the way a document (not limited to xml documents, this goes for any text document) is actually stored or streamed, which means it defines how the ones and zeroes should make up a readable text document. There are lots of possible encodings and you can read books the size of a small building about the subject but that doesn't really matter now. UTF-8 and UTF-16 are the defaults for xml documents by the way, so if you omit the attribute, the parser should check the Byte Order Mark (BOM) at the beginning of the file to determine which of the two the file will be encoded in. This implies that every xml parser on the planet (and beyond, if SETI@home finds anything my guess is that it'll be an alien xml document) should at least understand UTF-8 and UTF-16 encoding.
Anyway, my point was: at the time the parser encounters the encoding attribute in the xml declaration, it's already reading the file so it should already be assuming some kind of encoding. To me this sounds like saying "This sentence is written in English" or shouting to people that they need ears to understand you (well unless they can lipread of course). Or to quote Blackadder III: "It's the most pointless book since 'How to Learn French' was translated into French." What I understood from my discussion on the subject is that this shouldn't be a problem because of two possible reasons:
- All the characters in an xml declaration as the one above are very basic (i.e. they're all plain ASCII) and will be the same bytes for all the different encodings out there. I find this hard to believe (what about little and big endian differences?) but it shouldn't be too hard to check though.
- Special bytes (such as the BOM) at the beginning of a file already indicate the used encoding. But in this case the
encoding attribute doesn't seem to have any value at all anymore.
Another use could be that you could switch to a different encoding after the
declaration, indicating in the
encoding attribute how the rest of the document will be encoded. But why would you want to do that? I can understand that you would like to set a different encoding for some element in an xml document but what's the point of having the entire document after the xml declaration in another encoding than the declaration itself?
Even the xml 1.0 specification admits that autodetecting the encoding is a hopeless situation, also providing some hints as to what a parser should do to determine the right encoding.
So still, I don't see the real use of the
encoding attribute directly embedded into the document. Especially since changing the encoding means changing the document content - which in my mind are two very unrelated things which shouldn't be affecting each other. The encoding says something about the document and therefore it is metadata. Some ("internal") metadata certainly belongs with the document itself (e.g. the date and time a picture was taken gets nicely baked into a jpeg file), while other ("external") metadata is very context-sensitive (e.g. a document's encoding). You can store an xml document on disk as Unicode, but choose to send it ASCII encoded over a wire to reduce bandwith. That doesn't change the semantics of the xml document.
I guess this is just another example of what Joel Spolsky means by "leaky abstractions". We should hide the way a file is physically stored from upstream layers but still the parser somehow needs to know how to read the bytes...
To a certain extent, filesystems already allow this type of metadata but it is very limited. For example, you can tag a file as readonly in most filesystems but you cannot define your own tags. Maybe this system should be extended to provide some generic options to associate metadata with files. If the rumors about Windows "Longhorn" are true that the new WinFS filesystem will be using a SQL Server "Yukon" as a backend, it should be a breeze to allow more generic attributes at the file or directory level. On a wire-level, the TCP/IP protocols define headers which provide out-of-band metadata but these headers are also fixed. Layers that build upon this stack, such as HTTP, will define their own headers but the borderline between inline and outline headers seems to become quite unclear.
Metadata is ubiquitous, but it has its set of problems. Maybe it's time somebody pins down the way we should be handling "external" metadata from now on - taking the context into account. For example, encoding makes sense when storing bytes to disk or transmitting them over the wire but generally not when manipulating the data semantically from within a programming language. A readonly flag has a use for a file on disk but not when it's attached to an email. Windows Access Control Lists (ACL's) defining which user has access to which files and directories make no sense when the file has left the supervision of the domain controller. If I look at Attributes in .NET I see this sort of metadata standardisation has been done beautifully within a programming environment - why not expand this to other computing domains?