Skip to content

Commit 3625315

Browse files
committed
Update paper
1 parent 13d7c99 commit 3625315

File tree

2 files changed

+94
-7
lines changed

2 files changed

+94
-7
lines changed

papers/p3904.bs

Lines changed: 53 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,8 +49,8 @@ Arbitrary paths are formatted on POSIX such that there is no data loss.
4949
Unfortunately this is not the case on Windows, for example:
5050

5151
```c++
52-
auto p1 = std::filesystem::path(L"\xd800"); // a lone surrogate
53-
auto p2 = std::filesystem::path(L"\xd801"); // another lone surrogate
52+
auto p1 = std::filesystem::path(L"\xD800"); // a lone surrogate
53+
auto p2 = std::filesystem::path(L"\xD801"); // another lone surrogate
5454
auto s1 = std::format("{}\n", p1); // s1 == "�"
5555
auto s2 = std::format("{}\n", p2); // s2 == "�"
5656
```
@@ -59,10 +59,60 @@ Apart from being inconsistent between platforms, this makes it impossible to
5959
reliably round trip paths. For example, `p1` and `p2` above are two distinct
6060
paths that are formatted as the same string. This may result in a silent data
6161
loss and is remarkably different from other standard formatters such as the ones
62-
for floating point numbers which are specifically designed to allow round trip.
62+
for floating point numbers which are specifically designed to round trip.
63+
64+
For comparison, on POSIX formatting of arbitrary paths including the ones that
65+
are not valid Unicode works as expected and is lossless:
66+
67+
```c++
68+
auto p = std::filesystem::path("\x80");
69+
auto s = std::format("{}\n", p); // s == "\x80"
70+
```
6371

6472
# Proposal # {#proposal}
6573

74+
The current paper proposes preventing data loss and formatting ill-formed
75+
UTF-16 paths using WTF-8 (Wobbly Transformation Format − 8-bit) which is
76+
"a superset of UTF-8 that can losslessly represent arbitrary sequences of
77+
16-bit code unit (even if ill-formed in UTF-16) but preserves the other
78+
well-formedness constraints of UTF-8." ([[WTF]])
79+
80+
<table>
81+
<tr>
82+
<th>Code
83+
<th>Before
84+
<th>After
85+
</tr>
86+
<tr>
87+
<td>
88+
```c++
89+
std::format("{}\n", std::filesystem::path(L"\xD800"));
90+
```
91+
<td>
92+
```
93+
"�"
94+
```
95+
<td>
96+
```
97+
"\xED\xA0\x80"
98+
```
99+
</tr>
100+
<tr>
101+
<td>
102+
```c++
103+
std::format("{}\n", std::filesystem::path(L"\xD801"));
104+
```
105+
<td>
106+
```
107+
"�"
108+
```
109+
<td>
110+
```
111+
"\xED\xA0\x81"
112+
```
113+
</tr>
114+
</table>
115+
66116
TODO
67117

68118
<pre class=biblio>

papers/p3904.html

Lines changed: 41 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1563,7 +1563,7 @@
15631563
</style>
15641564
<meta content="Bikeshed version 4416b18d5, updated Tue Jan 2 15:52:39 2024 -0800" name="generator">
15651565
<link href="https://isocpp.org/favicon.ico" rel="icon">
1566-
<meta content="d43ff0c47551e157762ce58dcadb6301e7929605" name="revision">
1566+
<meta content="13d7c997af9d4ef75ba4d5addd710fe9c58a4268" name="revision">
15671567
<style>/* Boilerplate: style-autolinks */
15681568
.css.css, .property.property, .descriptor.descriptor {
15691569
color: var(--a-normal-text);
@@ -2138,17 +2138,54 @@ <h2 class="heading settled" data-level="2" id="motivation"><span class="secno">2
21382138
</blockquote>
21392139
<p>Arbitrary paths are formatted on POSIX such that there is no data loss.
21402140
Unfortunately this is not the case on Windows, for example:</p>
2141-
<pre class="language-c++ highlight"><c- k>auto</c-> <c- n>p1</c-> <c- o>=</c-> <c- n>std</c-><c- o>::</c-><c- n>filesystem</c-><c- o>::</c-><c- n>path</c-><c- p>(</c->L<c- s>"</c-><c- se>\xd800</c-><c- s>"</c-><c- p>);</c-> <c- c1>// a lone surrogate</c->
2142-
<c- k>auto</c-> <c- n>p2</c-> <c- o>=</c-> <c- n>std</c-><c- o>::</c-><c- n>filesystem</c-><c- o>::</c-><c- n>path</c-><c- p>(</c->L<c- s>"</c-><c- se>\xd801</c-><c- s>"</c-><c- p>);</c-> <c- c1>// another lone surrogate</c->
2141+
<pre class="language-c++ highlight"><c- k>auto</c-> <c- n>p1</c-> <c- o>=</c-> <c- n>std</c-><c- o>::</c-><c- n>filesystem</c-><c- o>::</c-><c- n>path</c-><c- p>(</c->L<c- s>"</c-><c- se>\xD800</c-><c- s>"</c-><c- p>);</c-> <c- c1>// a lone surrogate</c->
2142+
<c- k>auto</c-> <c- n>p2</c-> <c- o>=</c-> <c- n>std</c-><c- o>::</c-><c- n>filesystem</c-><c- o>::</c-><c- n>path</c-><c- p>(</c->L<c- s>"</c-><c- se>\xD801</c-><c- s>"</c-><c- p>);</c-> <c- c1>// another lone surrogate</c->
21432143
<c- k>auto</c-> <c- n>s1</c-> <c- o>=</c-> <c- n>std</c-><c- o>::</c-><c- n>format</c-><c- p>(</c-><c- s>"{}</c-><c- se>\n</c-><c- s>"</c-><c- p>,</c-> <c- n>p1</c-><c- p>);</c-> <c- c1>// s1 == "�"</c->
21442144
<c- k>auto</c-> <c- n>s2</c-> <c- o>=</c-> <c- n>std</c-><c- o>::</c-><c- n>format</c-><c- p>(</c-><c- s>"{}</c-><c- se>\n</c-><c- s>"</c-><c- p>,</c-> <c- n>p2</c-><c- p>);</c-> <c- c1>// s2 == "�"</c->
21452145
</pre>
21462146
<p>Apart from being inconsistent between platforms, this makes it impossible to
21472147
reliably round trip paths. For example, <code class="highlight"><c- n>p1</c-></code> and <code class="highlight"><c- n>p2</c-></code> above are two distinct
21482148
paths that are formatted as the same string. This may result in a silent data
21492149
loss and is remarkably different from other standard formatters such as the ones
2150-
for floating point numbers which are specifically designed to allow round trip.</p>
2150+
for floating point numbers which are specifically designed to round trip.</p>
2151+
<p>For comparison, on POSIX formatting of arbitrary paths including the ones that
2152+
are not valid Unicode works as expected and is lossless:</p>
2153+
<pre class="language-c++ highlight"><c- k>auto</c-> <c- n>p</c-> <c- o>=</c-> <c- n>std</c-><c- o>::</c-><c- n>filesystem</c-><c- o>::</c-><c- n>path</c-><c- p>(</c-><c- s>"</c-><c- se>\x80</c-><c- s>"</c-><c- p>);</c->
2154+
<c- k>auto</c-> <c- n>s</c-> <c- o>=</c-> <c- n>std</c-><c- o>::</c-><c- n>format</c-><c- p>(</c-><c- s>"{}</c-><c- se>\n</c-><c- s>"</c-><c- p>,</c-> <c- n>p</c-><c- p>);</c-> <c- c1>// s == "\x80"</c->
2155+
</pre>
21512156
<h2 class="heading settled" data-level="3" id="proposal"><span class="secno">3. </span><span class="content">Proposal</span><a class="self-link" href="#proposal"></a></h2>
2157+
<p>The current paper proposes preventing data loss and formatting ill-formed
2158+
UTF-16 paths using WTF-8 (Wobbly Transformation Format − 8-bit) which is
2159+
"a superset of UTF-8 that can losslessly represent arbitrary sequences of
2160+
16-bit code unit (even if ill-formed in UTF-16) but preserves the other
2161+
well-formedness constraints of UTF-8." (<a data-link-type="biblio" href="#biblio-wtf" title="The WTF-8 encoding">[WTF]</a>)</p>
2162+
<table>
2163+
<tbody>
2164+
<tr>
2165+
<th>Code
2166+
<th>Before
2167+
<th>After
2168+
<tr>
2169+
<td>
2170+
<pre class="language-c++ highlight"><c- n>std</c-><c- o>::</c-><c- n>format</c-><c- p>(</c-><c- s>"{}</c-><c- se>\n</c-><c- s>"</c-><c- p>,</c-> <c- n>std</c-><c- o>::</c-><c- n>filesystem</c-><c- o>::</c-><c- n>path</c-><c- p>(</c->L<c- s>"</c-><c- se>\xD800</c-><c- s>"</c-><c- p>));</c->
2171+
</pre>
2172+
<td>
2173+
<pre class="highlight"><c- s>"�"</c->
2174+
</pre>
2175+
<td>
2176+
<pre class="highlight"><c- s>"</c-><c- se>\xED\xA0\x80</c-><c- s>"</c->
2177+
</pre>
2178+
<tr>
2179+
<td>
2180+
<pre class="language-c++ highlight"><c- n>std</c-><c- o>::</c-><c- n>format</c-><c- p>(</c-><c- s>"{}</c-><c- se>\n</c-><c- s>"</c-><c- p>,</c-> <c- n>std</c-><c- o>::</c-><c- n>filesystem</c-><c- o>::</c-><c- n>path</c-><c- p>(</c->L<c- s>"</c-><c- se>\xD801</c-><c- s>"</c-><c- p>));</c->
2181+
</pre>
2182+
<td>
2183+
<pre class="highlight"><c- s>"�"</c->
2184+
</pre>
2185+
<td>
2186+
<pre class="highlight"><c- s>"</c-><c- se>\xED\xA0\x81</c-><c- s>"</c->
2187+
</pre>
2188+
</table>
21522189
<p>TODO</p>
21532190
</main>
21542191
<script>

0 commit comments

Comments
 (0)