Testing Ruby's Unicode Support

To see how far Ruby's Unicode support has come, I tested every string method to see which ones violate the principle of least surprise. The results are presented as a handy table that you can reference to see which string manipulation methods are Unicode-unfriendly.

Among the new features shipped with Ruby 2.4 is improved Unicode support. Specifically, methods like upcase and downcase work as expected, turning "ä" to "Ä" and back. This made me curious: what other Unicode improvements have been made since 2013 when I read André Arko's blog post Strings in Ruby are UTF-8 now… right??

I tested all of Ruby's string methods, not looking for technical errors but for violations of the "principle of least surprise." Specifically, my assumptions were that:

  1. Unique characters are unique: "e" and "ë" are different, just like "e" and "E" are.
  2. Single characters count as single characters, no matter how they're represented in unicode. This means that "e" and "ë" are each a single character, even though the latter is represented by two code points.
  3. Characters are immutable. Reversing a string of characters shouldn't alter the individual characters.
  4. Whitespace is treated as whitespace. Even those tricky unicode whitespace characters.
  5. Digits are treated as digits. The number 2 is always the number 2 no matter how it's written.

Unfortunately, most of Ruby's string manipulation methods fail these tests. If you're working with Unicode strings, you therefore have to be extremely careful which ones you use.

NOTE: After publication, some readers pointed out that many of the failures I mentioned wouldn't have happened if I would have normalized the unicode test strings. This is true. However strings aren't automatically normalized by Ruby or Rails (in any of the apps I tested). These tests were always meant to illustrate the worst-case and I think they're still useful in that regard.

Unicode tests with Ruby 2.4.0

Method Test Expected Result Verdict
#% "%s" % "noël" "noël" "noël" OK
#* "noël" * 2 "noëlnoël" "noëlnoël" OK
#<< "noël" << "ë" "noëlë" "noëlë" OK
#<=> "ä" <=> "z" -1 -1 OK
#== "ä" == "ä" true true OK
#=~ "ä" =~ /a./ nil 0 Beware!
#[] "ä"[0] "ä" "a" Beware!
#[]= "ä"[0] = "u" "u" "u" OK
#b "ä".b.encoding.to_s "ASCII-8BIT" "ASCII-8BIT" OK
#bytes "ä".bytes [97, 204, 136] [97, 204, 136] OK
#bytesize "ä".bytesize 3 3 OK
#byteslice "ä".byteslice(1) "\xCC" "\xCC" OK
#capitalize "ä".capitalize "Ä" "Ä" OK
#casecmp "äa".casecmp("äz") -1 -1 OK
#center "ä".center(3) " ä " "ä " Beware!
#chars "ä".chars ["ä"] ["a", "̈"] Beware!
#chomp "ä ".chomp "ä" "ä" OK
#chop "ä".chop "" "a" Beware!
#chr "ä".chr "ä" "a" Beware!
#clear "ä".clear "" "" OK
#codepoints "ä".codepoints [97, 776] [97, 776] OK
#concat "ä".concat("x") "äx" "äx" OK
#count "ä".count("a") 0 1 Beware!
#crypt "123".crypt("ää") == "123".crypt("aa") false false OK
#delete "ä".delete("a") "ä" "̈" Beware!
#downcase "Ä".downcase "ä" "ä" OK
#dump "ä".dump "\"a\\u0308\"" "\"a\\u0308\"" OK
#each_byte "ä".each_byte.to_a [97, 204, 136] [97, 204, 136] OK
#each_char "ä".each_char.to_a ["ä"] ["a", "̈"] Beware!
#each_codepoint "ä".each_codepoint.to_a [97, 776] [97, 776] OK
#each_line "ä".each_line.to_a ["ä"] ["ä"] OK
#empty? "ä".empty? false false OK
#encode "ä".encode("ASCII", undef: :replace) "a?" "a?" OK
#encoding "ä".encoding.to_s "UTF-8" "UTF-8" OK
#end_with? "ä".end_with?("ä") true true OK
#eql? "ä".eql?("a") false false OK
#force_encoding "ä".force_encoding("ASCII") "a\xCC\x88" "a\xCC\x88" OK
#getbyte "ä".getbyte(2) 136 136 OK
#gsub "ä".gsub("a", "x") "ä" "ẍ" Beware!
#hash "ä".hash == "a".hash false false OK
#include? "ä".include?("a") false true Beware!
#index "ä".index("a") nil 0 Beware!
#replace "ä".replace("u") "u" "u" OK
#insert "ä".insert(1, "u") "äu" "aü" Beware!
#inspect "ä".inspect "\"ä\"" "\"ä\"" OK
#intern "ä".intern :ä :ä OK
#length "ä".length 1 2 Beware!
#ljust "ä".ljust(3, "_") "ä__" "ä_" Beware!
#lstrip " ä".lstrip "ä" "ä" OK
#match "ä".match("a") nil # Beware!
#next "ä".next "ä" "b̈" Beware!
#ord "ä".ord 97 97 OK
#partition "händ".partition("a") ["händ"] ["h", "a", "̈nd"] Beware!
#prepend "ä".prepend("ä") "ää" "ää" OK
#replace "ä".replace("ẍ") "ẍ" "ẍ" OK
#reverse "händ".reverse "dnäh" "dn̈ah" Beware!
#rpartition "händ".rpartition("a") ["händ"] ["h", "a", "̈nd"] Beware!
#rstrip "line ".rstrip "line" "line " Beware!
#scrub "ä".scrub "ä" "ä" OK
#setbyte s = "ä"; s.setbyte(0, "x".ord); s "ẍ" "ẍ" OK
#size "ä".size 1 2 Beware!
#slice "ä".slice(0) "ä" "a" Beware!
#split "ä".split("a") ["ä"] ["", "̈"] Beware!
#squeeze "ää".squeeze("ä") "ä" "ää" Beware!
#start_with? "ä".start_with?("a") false true Beware!
#strip " line ".strip "line" " line " Beware!
#sub "ä".sub("a", "x") "ä" "ẍ" Beware!
#succ "ä".succ "b̈" "b̈" OK
#swapcase "ä".swapcase "Ä" "Ä" OK
#to_c "١".to_c (1+0i) (0+0i) Beware!
#to_f "١".to_f 1.0 0.0 Beware!
#to_i "١".to_i 1 0 Beware!
#to_r "١".to_r (1/1) (0/1) Beware!
#to_sym "ä".to_sym :ä :ä OK
#tr "ä".tr("a", "b") "ä" "b̈" Beware!
#unpack "ä".unpack("CCC") [97, 204, 136] [97, 204, 136] OK
#upto "ä".upto("c̈").to_a ["ä", "b̈", "c̈"] ["ä", "b̈", "c̈"] OK
#valid_encoding? "ä".valid_encoding? true true OK
What to do next:
  1. Try Honeybadger for FREE
    Honeybadger helps you find and fix errors before your users can even report them. Get set up in minutes and check monitoring off your to-do list.
    Start free trial
    Easy 5-minute setup — No credit card required
  2. Get the Honeybadger newsletter
    Each month we share news, best practices, and stories from the DevOps & monitoring community—exclusively for developers like you.
    author photo

    Starr Horne

    Starr Horne is a Rubyist and Chief JavaScripter at Honeybadger.io. When she's not neck-deep in other people's bugs, she enjoys making furniture with traditional hand-tools, reading history and brewing beer in her garage in Seattle.

    More articles by Starr Horne
    Stop wasting time manually checking logs for errors!

    Try the only application health monitoring tool that allows you to track application errors, uptime, and cron jobs in one simple platform.

    • Know when critical errors occur, and which customers are affected.
    • Respond instantly when your systems go down.
    • Improve the health of your systems over time.
    • Fix problems before your customers can report them!

    As developers ourselves, we hated wasting time tracking down errors—so we built the system we always wanted.

    Honeybadger tracks everything you need and nothing you don't, creating one simple solution to keep your application running and error free so you can do what you do best—release new code. Try it free and see for yourself.

    Start free trial
    Simple 5-minute setup — No credit card required

    Learn more

    "We've looked at a lot of error management systems. Honeybadger is head and shoulders above the rest and somehow gets better with every new release."
    — Michael Smith, Cofounder & CTO of YvesBlue

    Honeybadger is trusted by top companies like:

    “Everyone is in love with Honeybadger ... the UI is spot on.”
    Molly Struve, Sr. Site Reliability Engineer, Netflix
    Start free trial
    Are you using Sentry, Rollbar, Bugsnag, or Airbrake for your monitoring? Honeybadger includes error tracking with a whole suite of amazing monitoring tools — all for probably less than you're paying now. Discover why so many companies are switching to Honeybadger here.
    Start free trial
    Stop digging through chat logs to find the bug-fix someone mentioned last month. Honeybadger's built-in issue tracker keeps discussion central to each error, so that if it pops up again you'll be able to pick up right where you left off.
    Start free trial
    “Wow — Customers are blown away that I email them so quickly after an error.”
    Chris Patton, Founder of Punchpass.com
    Start free trial