The scene

At Sky’s GST-Data Department we collect operational and business data from all our streaming platforms and supporting systems and transform it into a consumer-friendly format to support business decisions as well as the needs of our data science and product teams.

One of the last steps in our pipeline is to take the cleaned but still raw data in BigQuery and transform it into a relational format, join the tables, and perform final cleanups. To do that, for each target table we run a series of SQL transformation jobs using the dbt framework. Each job is triggered by a separate cronjob and ends up producing a table BigQuery database. Those jobs alone crunch petabytes of data daily.

We also keep some human-friendly text descriptions of the tables in the source code close to the SQL definitions. These descriptions are saved both in BigQuery and in a centralised data catalogue. You can see the ecosystem of jobs and generated tables on the image below.

The crime: “One day a dead man appeared at the traffic lights”

It all started innocently: we received a change request with a significantly improved description of one of the tables. The relevant part of it looked like this:

All in all, it’s Markdown with some fancier symbols, but nothing dangerous, as we have used symbols such as ✅ before in table descriptions without any issue.

Shortly after merging the change, we received alerts informing us that the job whose description had just changed was failing. Taking a closer look at the logs, the execution of the dbt job failed because of a BigQuery syntax error:

Database Error:
-----------------------------------------------------------------------------
Syntax error: Illegal escape sequence: Unicode value \ud83d is invalid

On-Site Examination

First intuition obviously was that one of the special characters is unsupported for some reason. So, we dug into the Markdown file's hexadecimal representation and found no bytes corresponding to \ud83d. From the other end, when searching for that error code on the internet, we found that such sequence represents an unnamed symbol from the High Surrogates Unicode blocks, a strange beast from the depths of text encoding.

Background – Who Let the Surrogates Out?

To understand the issue, let’s get some background in text encoding. The Unicode standard assigns a numeric value to each character. By convention, this numeric value is written in hexadecimal and is prefixed with U+, but in fact it is nothing more than a number.

When we want to store a character in a file or send it to a friend (assuming we have any) over internet, we need to encode the numeric codepoints to a sequence of bytes. There is a plethora of different encodings, but the most well-known and used today are the family of UTF encodings.

Of those, the prevalent is the UTF-8 encoding. It encodes characters to 1–4 bytes. Basic English letters such as x and interpunction are encoded to one byte. Characters from most world languages get encoded as two bytes, so words such as dřevotříska would use some one-byte and some two-byte sequences. Most Chinese, Japanese and some special characters get encoded to three bytes, and most modern colourful emojis utilise four bytes.

On the other side of the spectrum, there’s UTF-32, which encodes any character to 4 bytes. That is usually wasteful since most texts use just a very small subset of characters, on the other hand the encoding is very simple – it’s very easy to work with it programmatically because it’s fixed-length, and the bytes directly represent the Unicode codepoint, as you can see in the table above.

In the middle, there’s UTF-16, sharing the advantages (but mostly disadvantages) of the other two encodings. For the most part, it behaves similarly to UTF-32, as the 2-byte sequences are numerically equivalent to corresponding Unicode codepoints. That changes for codepoints above U+FFFF though. Those codepoints are encoded by two pairs of bytes, the so-called surrogate pair, which can be easily recognised because the first byte of each unit in the pair is in the range D8 to DF.

The important piece of information for us is that the horizontal traffic light symbol 🚥 is encoded to four bytes in both UTF-8 and in UTF-16, despite the byte representations being different in both cases. Furthermore, you could have already noticed that the UTF-16 representation of the 🚥 symbol starts with D83D, which is the same sequence BigQuery was complaining about in the logs. On the contrary, the check mark symbol ✅ is encoded to just three bytes in UTF-8 and two bytes in UTF-16, which caused no issue.

Autopsy

First, we confirmed that our source Markdown file with 🚥 and ✅ was correctly encoded in UTF-8. Next, we looked deep into dbt’s source code, specifically into its BigQuery adapter responsible for rendering SQL and sending it to the database for execution.

We found that when table descriptions are configured to be persisted, dbt first escapes the description string. This makes sense as the description may contain absolutely anything, so to generate a valid SQL, the description needs to be sanitised.

The first clue appeared just when we looked at the definition of the sql_escape function. Instead of tailoring the escaping mechanism for BigQuery, it feeds the description to a JSON encoder. That is a neat shortcut, as converting to JSON does a lot of what we want from SQL escaping: it escapes special characters such as newlines to \n, double quotes to \" or non-ASCII characters such as € to \u20ac.

But here’s where things get interesting: how does JSON handles characters outside of the U+0000 to U+FFFF range h? Python’s json library claims to follow the RFC 7159 standard, which specifies the following:

To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a 12-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834 \uDD1E".

To be clear, Basic Multilingual Plane is the name of the Unicode codepoints up to U+FFFF. Thus, symbols such as 🚥

(Unicode U+1F6A5) should be encoded as UTF-16 and each of the surrogates is escaped separately. And indeed, the SQL generated by dbt contains a pair of two-byte values:

The Murderer Was JaSON

So, what is the issue with that? BigQuery simply has a different mechanism for escaping large Unicode codepoints than JSON has. BigQuery expects codepoints within the Basic Multilingual Plane to be escaped using \u followed by 4 hexadecimal numbers, while for larger codepoints it expects an escape sequence \U followed by 8 hexadecimal numbers. That is incompatible with JSON’s requirement to use only \u escaping and using UTF-16 encoding for codepoints larger than U+FFFF before escaping the resulting values. The range used for UTF-16 surrogate pairs is explicitly forbidden in the BigQuery documentation.

Symbols such as ✅ which fit to two bytes in UTF-16 are escaped by JSON as a single unit, e.g. \u2705, which is consistent with what BigQuery expects. Consequently, their usage did not cause any issue in the past, and BigQuery was happy to store and even display them within the table description.

In short:

✅ → \u2705 → OK for BigQuery
🚥 → \ud83d\udea5 → ❌ BigQuery throws a fit, because it expects \U0001f6a5

Resolution

What that means for our scenario is that we must limit ourselves to the Basic Multilingual Plane, which is the only range where all components are consistent with each other. Although the list is rich, including various alphabets, punctuations, phonetic extensions, currency or letterlike symbols, and more, we cannot use e.g. historic alphabets or most modern emojis

As you can see in the above table. the set of allowed characters is not entirely boring. There still is big potential to underline the meaning we communicate.

While this fun issue did not cause any serious trouble, we found it very amusing that while all components of the pipeline support the whole Unicode table, together they break due to implementation details. Even in 2025 when most technologies support UTF-8, nobody is safe from historical artifacts of unusual text encodings in the most unexpected places.

A Story of How Traffic Lights Broke Our ETL