fix(lexer): Only allow horizontal whitespace in frontmatter
In writing up the reference for frontmatter, I realized that we probably
shouldn't be accepting Unicode Line Ending characters between the code
fence and infostring or trailing after the infostring or a code fence.
In digging into the unicode specification we use for Whitespace, it
divides it up into categories, so I'm deferring to what it says for
horizontal whitespace for what should be used within a line.
Note, I am leaving out support for Unicode Default Ignorable characters.
I figure that can be discussed outside of this change within the
reference and tracking issue.
Fixesrust-lang/rust#145971
Frontmatter tracking issue: rust-lang/rust#136889
This was done in #145740 and #145947. It is causing problems for people
using r-a on anything that uses the rustc-dev rustup package, e.g. Miri,
clippy.
This repository has lots of submodules and subtrees and various
different projects are carved out of pieces of it. It seems like
`[workspace.dependencies]` will just be more trouble than it's worth.
In writing up the reference for frontmatter, I realized that we probably
shouldn't be accepting Unicode Line Ending characters between the code
fence and infostring or trailing after the infostring or a code fence.
In digging into the unicode specification we use for Whitespace, it
divides it up into categories, so I'm deferring to what it says for
horizontal whitespace for what should be used within a line.
Note, I am leaving out support for Unicde Default Ignorable characters.
I figure that can be discussed outside of this change within the
reference and tracking issue.
The RFC only limits hyphens at the beginning of lines and not if they
are indented or embedded in other content.
Sticking to that approach was confirmed by the T-lang liason at
https://github.com/rust-lang/rust/issues/141367#issuecomment-3202217544
There is a regression in error message quality which I'm leaving for
someone if they feel this needs improving.
Revert <https://github.com/rust-lang/rust/pull/138084> to buy time to
consider options that avoids breaking downstream usages of cargo on
distributed `rustc-src` artifacts, where such cargo invocations fail due
to inability to inherit `lints` from workspace root manifest's
`workspace.lints` (this is only valid for the source rust-lang/rust
workspace, but not really the distributed `rustc-src` artifacts).
This breakage was reported in
<https://github.com/rust-lang/rust/issues/138304>.
This reverts commit 48caf81484b50dca5a5cebb614899a3df81ca898, reversing
changes made to c6662879b27f5161e95f39395e3c9513a7b97028.
By naming them in `[workspace.lints.rust]` in the top-level
`Cargo.toml`, and then making all `compiler/` crates inherit them with
`[lints] workspace = true`. (I omitted `rustc_codegen_{cranelift,gcc}`,
because they're a bit different.)
The advantages of this over the current approach:
- It uses a standard Cargo feature, rather than special handling in
bootstrap. So, easier to understand, and less likely to get
accidentally broken in the future.
- It works for proc macro crates.
It's a shame it doesn't work for rustc-specific lints, as the comments
explain.
It was added in #123752 to handle some cases involving emoji, but it
isn't necessary because it's always treated the same as
`TokenKind::InvalidIdent`. This commit removes it, which makes things a
little simpler.
- Rename it as `invalid_ident_or_prefix`, which matches the possible
outputs (`InvalidIdent` or `InvalidPrefix`).
- Use the local wrapper for `is_xid_continue`, for consistency.
- Make it clear what `\u{200d}` means.
We already do this for a number of crates, e.g. `rustc_middle`,
`rustc_span`, `rustc_metadata`, `rustc_span`, `rustc_errors`.
For the ones we don't, in many cases the attributes are a mess.
- There is no consistency about order of attribute kinds (e.g.
`allow`/`deny`/`feature`).
- Within attribute kind groups (e.g. the `feature` attributes),
sometimes the order is alphabetical, and sometimes there is no
particular order.
- Sometimes the attributes of a particular kind aren't even grouped
all together, e.g. there might be a `feature`, then an `allow`, then
another `feature`.
This commit extends the existing sorting to all compiler crates,
increasing consistency. If any new attribute line is added there is now
only one place it can go -- no need for arbitrary decisions.
Exceptions:
- `rustc_log`, `rustc_next_trait_solver` and `rustc_type_ir_macros`,
because they have no crate attributes.
- `rustc_codegen_gcc`, because it's quasi-external to rustc (e.g. it's
ignored in `rustfmt.toml`).
Do not accept the following
```rust
macro_rules! lexes {($($_:tt)*) => {}}
lexes!(🐛"foo");
```
Before, invalid emoji identifiers were gated during parsing instead of lexing in all cases, but this didn't account for macro expansion of literal prefixes.
Fix#123696.
Given `'hello world'` and `'1 str', provide a structured suggestion for a valid string literal:
```
error[E0762]: unterminated character literal
--> $DIR/lex-bad-str-literal-as-char-3.rs:2:26
|
LL | println!('hello world');
| ^^^^
|
help: if you meant to write a `str` literal, use double quotes
|
LL | println!("hello world");
| ~ ~
```
```
error[E0762]: unterminated character literal
--> $DIR/lex-bad-str-literal-as-char-1.rs:2:20
|
LL | println!('1 + 1');
| ^^^^
|
help: if you meant to write a `str` literal, use double quotes
|
LL | println!("1 + 1");
| ~ ~
```
Fix#119685.
That is, change `diagnostic_outside_of_impl` and
`untranslatable_diagnostic` from `allow` to `deny`, because more than
half of the compiler has be converted to use translated diagnostics.
This commit removes more `deny` attributes than it adds `allow`
attributes, which proves that this change is warranted.
They can't contain `\x` escapes, which means they can't contain high
bytes, which means we can used `unescape_unicode` instead of
`unescape_mixed` to unescape them. This avoids unnecessary used of
`MixedUnit`.
`unescape_literal` becomes `unescape_unicode`, and `unescape_c_string`
becomes `unescape_mixed`. Because rfc3349 will mean that C string
literals will no longer be the only mixed utf8 literals.