Standard for exchanging file: URIs ================================== Rationale --------- The use of URIs in the desktop is pervasive these days. All the major desktops have file handling system that use URIs instead of pathnames to be able to specify files not accessible in the normal UNIX file system. The URIs used in these systems are mostly based on the RFCs specifying the core URI mechanism and its various protocol versions. However there are sometimes extensions for new protocols that aren't standardized yet, and sometimes the standards aren't clear on some details. Passing of URIs between applications happens in various ways such as drag and drop, cut and paste and command line arguments. In order to be interoperable there needs to be some standardization of such URIs. Its the hope of many that eventually we'd have a common standard and perhaps even a common implementation. However, at the very least, we need a strict definition of how to specify URIs for absolute local filenames when exchanging them between applications. This document gives such a specification. URI standards ------------- The specification for file: URIs, RFC2396[1] and RFC1738[2] says that file URIs are of the form: file:/// Where the hostname and path parts can contain a limited subset of ASCII characters, representing their ASCII values, and any other bytes escaped by using a % followed by a two digit hex value. As a special case the hostname part can be "localhost" or empty meaning the machine the URI is being interpreted on. Given a URI like this we can unescape it into a hostname, and a string of octets (of undefined encoding), which maps 1:1 to a UNIX filename. UNIX filenames -------------- An absolute filename in UNIX is a string containing filenames separated by and starting with a '/'. The filenames can contain any byte values except 0 and '/'. There is no specified encoding for filenames, and although we hope that eventually all filenames will be encoded in UTF8 we can't rely on this, because then we would be unable to e.g. rename a file with a misencoded filename. file: URIs on UNIX ------------------ Since each desktop has to have a way to generate displayable versions of filenames (this generally means somehow generating Unicode for it) we can rely on support for that in the platform. The internal form of the file reference (the URI) must always be convertible to the original UNIX byte-string so that we can operate of the file, so the display form of the filename should be generated at the last moment when displaying only. This gives us the following definition for file: URI that are to be exchanged with other apps: File URIs are of the form "file:///", where hostname can be empty, with all non-allowed bytes escaped, containing no escaped '/' or zero bytes. The unescaped byte string is not supposed to be interpreted in any way, and is not in a specified encoding. It corresponds exactly to the filename as used in UNIX system calls. If you need to display the unescaped filename, that should be handled the same way you display normal filenames. Hostnames --------- When generating a file: uri the hostname part, if nonempty, should be whatever is returned from gethostname(). This means that the name is canonical for all users on the same machine, so that you can easily see if the referenced file is on the current machine. Note that "localhost" or an empty hostname needs to be handled specially, always meaning the host the uri is being interpreted on. Backwards compatibility: ------------------------ Some current apps generate URIs of the form "file:/". These are not correct according to RFC1738, so they should not be generated. However for backwards compatibility, it is recommended that such URIs are interpreted as file URIs with an empty hostname. Some current apps generate file URIs by converting the filename from whatever locale the application runs in to UTF8. This behavior means that URIs can't be converted to filenames without knowing the locale of the application that produced them, and that not all valid filenames can be converted to URIs. Such behavior is not allowed, and should be changed. [1] http://www.ietf.org/rfc/rfc2396.txt [2] http://www.ietf.org/rfc/rfc1738.txt