|
1563 | 1563 | </style> |
1564 | 1564 | <meta content="Bikeshed version 4416b18d5, updated Tue Jan 2 15:52:39 2024 -0800" name="generator"> |
1565 | 1565 | <link href="https://isocpp.org/favicon.ico" rel="icon"> |
1566 | | - <meta content="d43ff0c47551e157762ce58dcadb6301e7929605" name="revision"> |
| 1566 | + <meta content="13d7c997af9d4ef75ba4d5addd710fe9c58a4268" name="revision"> |
1567 | 1567 | <style>/* Boilerplate: style-autolinks */ |
1568 | 1568 | .css.css, .property.property, .descriptor.descriptor { |
1569 | 1569 | color: var(--a-normal-text); |
@@ -2138,17 +2138,54 @@ <h2 class="heading settled" data-level="2" id="motivation"><span class="secno">2 |
2138 | 2138 | </blockquote> |
2139 | 2139 | <p>Arbitrary paths are formatted on POSIX such that there is no data loss. |
2140 | 2140 | Unfortunately this is not the case on Windows, for example:</p> |
2141 | | -<pre class="language-c++ highlight"><c- k>auto</c-> <c- n>p1</c-> <c- o>=</c-> <c- n>std</c-><c- o>::</c-><c- n>filesystem</c-><c- o>::</c-><c- n>path</c-><c- p>(</c->L<c- s>"</c-><c- se>\xd800</c-><c- s>"</c-><c- p>);</c-> <c- c1>// a lone surrogate</c-> |
2142 | | -<c- k>auto</c-> <c- n>p2</c-> <c- o>=</c-> <c- n>std</c-><c- o>::</c-><c- n>filesystem</c-><c- o>::</c-><c- n>path</c-><c- p>(</c->L<c- s>"</c-><c- se>\xd801</c-><c- s>"</c-><c- p>);</c-> <c- c1>// another lone surrogate</c-> |
| 2141 | +<pre class="language-c++ highlight"><c- k>auto</c-> <c- n>p1</c-> <c- o>=</c-> <c- n>std</c-><c- o>::</c-><c- n>filesystem</c-><c- o>::</c-><c- n>path</c-><c- p>(</c->L<c- s>"</c-><c- se>\xD800</c-><c- s>"</c-><c- p>);</c-> <c- c1>// a lone surrogate</c-> |
| 2142 | +<c- k>auto</c-> <c- n>p2</c-> <c- o>=</c-> <c- n>std</c-><c- o>::</c-><c- n>filesystem</c-><c- o>::</c-><c- n>path</c-><c- p>(</c->L<c- s>"</c-><c- se>\xD801</c-><c- s>"</c-><c- p>);</c-> <c- c1>// another lone surrogate</c-> |
2143 | 2143 | <c- k>auto</c-> <c- n>s1</c-> <c- o>=</c-> <c- n>std</c-><c- o>::</c-><c- n>format</c-><c- p>(</c-><c- s>"{}</c-><c- se>\n</c-><c- s>"</c-><c- p>,</c-> <c- n>p1</c-><c- p>);</c-> <c- c1>// s1 == "�"</c-> |
2144 | 2144 | <c- k>auto</c-> <c- n>s2</c-> <c- o>=</c-> <c- n>std</c-><c- o>::</c-><c- n>format</c-><c- p>(</c-><c- s>"{}</c-><c- se>\n</c-><c- s>"</c-><c- p>,</c-> <c- n>p2</c-><c- p>);</c-> <c- c1>// s2 == "�"</c-> |
2145 | 2145 | </pre> |
2146 | 2146 | <p>Apart from being inconsistent between platforms, this makes it impossible to |
2147 | 2147 | reliably round trip paths. For example, <code class="highlight"><c- n>p1</c-></code> and <code class="highlight"><c- n>p2</c-></code> above are two distinct |
2148 | 2148 | paths that are formatted as the same string. This may result in a silent data |
2149 | 2149 | loss and is remarkably different from other standard formatters such as the ones |
2150 | | -for floating point numbers which are specifically designed to allow round trip.</p> |
| 2150 | +for floating point numbers which are specifically designed to round trip.</p> |
| 2151 | + <p>For comparison, on POSIX formatting of arbitrary paths including the ones that |
| 2152 | +are not valid Unicode works as expected and is lossless:</p> |
| 2153 | +<pre class="language-c++ highlight"><c- k>auto</c-> <c- n>p</c-> <c- o>=</c-> <c- n>std</c-><c- o>::</c-><c- n>filesystem</c-><c- o>::</c-><c- n>path</c-><c- p>(</c-><c- s>"</c-><c- se>\x80</c-><c- s>"</c-><c- p>);</c-> |
| 2154 | +<c- k>auto</c-> <c- n>s</c-> <c- o>=</c-> <c- n>std</c-><c- o>::</c-><c- n>format</c-><c- p>(</c-><c- s>"{}</c-><c- se>\n</c-><c- s>"</c-><c- p>,</c-> <c- n>p</c-><c- p>);</c-> <c- c1>// s == "\x80"</c-> |
| 2155 | +</pre> |
2151 | 2156 | <h2 class="heading settled" data-level="3" id="proposal"><span class="secno">3. </span><span class="content">Proposal</span><a class="self-link" href="#proposal"></a></h2> |
| 2157 | + <p>The current paper proposes preventing data loss and formatting ill-formed |
| 2158 | +UTF-16 paths using WTF-8 (Wobbly Transformation Format − 8-bit) which is |
| 2159 | +"a superset of UTF-8 that can losslessly represent arbitrary sequences of |
| 2160 | +16-bit code unit (even if ill-formed in UTF-16) but preserves the other |
| 2161 | +well-formedness constraints of UTF-8." (<a data-link-type="biblio" href="#biblio-wtf" title="The WTF-8 encoding">[WTF]</a>)</p> |
| 2162 | + <table> |
| 2163 | + <tbody> |
| 2164 | + <tr> |
| 2165 | + <th>Code |
| 2166 | + <th>Before |
| 2167 | + <th>After |
| 2168 | + <tr> |
| 2169 | + <td> |
| 2170 | +<pre class="language-c++ highlight"><c- n>std</c-><c- o>::</c-><c- n>format</c-><c- p>(</c-><c- s>"{}</c-><c- se>\n</c-><c- s>"</c-><c- p>,</c-> <c- n>std</c-><c- o>::</c-><c- n>filesystem</c-><c- o>::</c-><c- n>path</c-><c- p>(</c->L<c- s>"</c-><c- se>\xD800</c-><c- s>"</c-><c- p>));</c-> |
| 2171 | +</pre> |
| 2172 | + <td> |
| 2173 | +<pre class="highlight"><c- s>"�"</c-> |
| 2174 | +</pre> |
| 2175 | + <td> |
| 2176 | +<pre class="highlight"><c- s>"</c-><c- se>\xED\xA0\x80</c-><c- s>"</c-> |
| 2177 | +</pre> |
| 2178 | + <tr> |
| 2179 | + <td> |
| 2180 | +<pre class="language-c++ highlight"><c- n>std</c-><c- o>::</c-><c- n>format</c-><c- p>(</c-><c- s>"{}</c-><c- se>\n</c-><c- s>"</c-><c- p>,</c-> <c- n>std</c-><c- o>::</c-><c- n>filesystem</c-><c- o>::</c-><c- n>path</c-><c- p>(</c->L<c- s>"</c-><c- se>\xD801</c-><c- s>"</c-><c- p>));</c-> |
| 2181 | +</pre> |
| 2182 | + <td> |
| 2183 | +<pre class="highlight"><c- s>"�"</c-> |
| 2184 | +</pre> |
| 2185 | + <td> |
| 2186 | +<pre class="highlight"><c- s>"</c-><c- se>\xED\xA0\x81</c-><c- s>"</c-> |
| 2187 | +</pre> |
| 2188 | + </table> |
2152 | 2189 | <p>TODO</p> |
2153 | 2190 | </main> |
2154 | 2191 | <script> |
|
0 commit comments