Cambridge University researchers have detailed a new way targeted vulnerabilities can be introduced into source code while making them invisible to human code reviewers, allowing for extensive supply-chain attacks.


“We have discovered ways of manipulating the encoding of source code files so that human viewers and compilers see different logic. One particularly pernicious method uses Unicode directionality override characters to display code as an anagram of its true logic,” professor Ross Anderson explained.

The Trojan Source bugs and attacks patterns

Anderson and fellow researcher (and PhD student) Nicholas Boucher have revealed two attack patterns, collectively dubbed Trojan Source attacks.

CVE-2021-42574 is a vulnerability in the bidirectional algorithm in the Unicode Specification. Attackers can use Unicode control characters to reorder tokens in source code at the encoding level, and effectively allow them to craft code that is see one way by compilers and another way by human reviewers. The researchers dubbed this the Bidi attack, and fear that it could lead to widespread supply-chain attacks on source code.

CVE-2021-42694 is an exploitable issue in the character definitions of the Unicode Specification. Attackers can use homoglyphs to produce source code identifiers (e.g., function names) that are visually identical to a target identifier.

“This adversarial function then performs some malicious action, then optionally calls the original function it is impersonating. When defined in upstream dependencies such as open source software, these adversarial functions can be imported into downstream software and invoked without visual indication of malicious code,” the researchers noted. They dubbed this the homoglyph attack.

More technical details can be found in their paper.

Fixing the problem

The researchers created POCs and verified that the Trojan Source attacks work in C, C++, C#, JavaScript, Java, Rust, Go, and Python. They believe other programming languages that support internationalized text in comments and string literals may be vulnerable as well.

They also proposed a number of defense measures that should be implemented in compilers, interpreters, and build pipelines supporting Unicode; language specifications; and code (text) editors and repository front-ends.

They disclosed their findings (under embargo) to a variety of organizations and companies that can work on building these defenses.

“We are of the view that the long-term solution to the problem will be deployed in compilers. We note that almost all compilers already defend against one related attack, which involves creating adversarial function names using zero-width space characters, while three generate errors in response to another, which exploits homoglyphs in function names,” they shared.

“About half of the compiler maintainers we contacted during the disclosure period are working on patches or have committed to do so. As the others are dragging their feet, it is prudent to deploy other controls in the meantime where this is quick and cheap, or relevant and needful. Three firms that maintain code repositories are also deploying defences. We recommend that governments and firms that rely on critical software should identify their suppliers’ posture, exert pressure on them to implement adequate defences, and ensure that any gaps are covered by controls elsewhere in their toolchain.”

Organizations that have already implemented fixes and are working on detections in their supply chain include the Rust teamGitHubRedHat, and Atlassian (multiple products are affected).

Anderson and Boucher said that after scanning as much of the open source ecosystem as they could for signs of Trojan Source attacks in the wild, they mostly found false positives.

“However, we did find some evidence of techniques similar to Trojan Source attacks being exploited. In one instance, a static code analysis tool for smart contracts, Slither, contained scanning for right-to-left override characters,” they noted.

“We also discovered multiple instances of JavaScript obfuscation that used Bidi characters to assist in obscuring code. This is not necessarily malicious, but is still an interesting use of directionality overrides. Finally, we found multiple implementations of exploit generators for directionality override in filename extensions.”

In the wake of the release of this research, Péter Szilágyi, team lead at Ethereum, pointed out that he discovered the (Bidi) issue five years ago and reported it to the team developing the Go programming language.