Testing Ruby's Unicode Support

To see how far Ruby's Unicode support has come, I tested every string method to see which ones violate the principle of least surprise. The results are presented as a handy table that you can reference to see which string manipulation methods are Unicode-unfriendly.

Among the new features shipped with Ruby 2.4 is improved Unicode support. Specifically, methods like upcase and downcase work as expected, turning "ä" to "Ä" and back. This made me curious: what other Unicode improvements have been made since 2013 when I read André Arko's blog post Strings in Ruby are UTF-8 now… right??

I tested all of Ruby's string methods, not looking for technical errors but for violations of the "principle of least surprise." Specifically, my assumptions were that:

  1. Unique characters are unique: "e" and "ë" are different, just like "e" and "E" are.
  2. Single characters count as single characters, no matter how they're represented in unicode. This means that "e" and "ë" are each a single character, even though the latter is represented by two code points.
  3. Characters are immutable. Reversing a string of characters shouldn't alter the individual characters.
  4. Whitespace is treated as whitespace. Even those tricky unicode whitespace characters.
  5. Digits are treated as digits. The number 2 is always the number 2 no matter how it's written.

Unfortunately, most of Ruby's string manipulation methods fail these tests. If you're working with Unicode strings, you therefore have to be extremely careful which ones you use.

NOTE: After publication, some readers pointed out that many of the failures I mentioned wouldn't have happened if I would have normalized the unicode test strings. This is true. However strings aren't automatically normalized by Ruby or Rails (in any of the apps I tested). These tests were always meant to illustrate the worst-case and I think they're still useful in that regard.

Unicode tests with Ruby 2.4.0

Method Test Expected Result Verdict
#% "%s" % "noël" "noël" "noël" OK
#* "noël" * 2 "noëlnoël" "noëlnoël" OK
#<< "noël" << "ë" "noëlë" "noëlë" OK
#<=> "ä" <=> "z" -1 -1 OK
#== "ä" == "ä" true true OK
#=~ "ä" =~ /a./ nil 0 Beware!
#[] "ä"[0] "ä" "a" Beware!
#[]= "ä"[0] = "u" "u" "u" OK
#b "ä".b.encoding.to_s "ASCII-8BIT" "ASCII-8BIT" OK
#bytes "ä".bytes [97, 204, 136] [97, 204, 136] OK
#bytesize "ä".bytesize 3 3 OK
#byteslice "ä".byteslice(1) "\xCC" "\xCC" OK
#capitalize "ä".capitalize "Ä" "Ä" OK
#casecmp "äa".casecmp("äz") -1 -1 OK
#center "ä".center(3) " ä " "ä " Beware!
#chars "ä".chars ["ä"] ["a", "̈"] Beware!
#chomp "ä ".chomp "ä" "ä" OK
#chop "ä".chop "" "a" Beware!
#chr "ä".chr "ä" "a" Beware!
#clear "ä".clear "" "" OK
#codepoints "ä".codepoints [97, 776] [97, 776] OK
#concat "ä".concat("x") "äx" "äx" OK
#count "ä".count("a") 0 1 Beware!
#crypt "123".crypt("ää") == "123".crypt("aa") false false OK
#delete "ä".delete("a") "ä" "̈" Beware!
#downcase "Ä".downcase "ä" "ä" OK
#dump "ä".dump "\"a\\u0308\"" "\"a\\u0308\"" OK
#each_byte "ä".each_byte.to_a [97, 204, 136] [97, 204, 136] OK
#each_char "ä".each_char.to_a ["ä"] ["a", "̈"] Beware!
#each_codepoint "ä".each_codepoint.to_a [97, 776] [97, 776] OK
#each_line "ä".each_line.to_a ["ä"] ["ä"] OK
#empty? "ä".empty? false false OK
#encode "ä".encode("ASCII", undef: :replace) "a?" "a?" OK
#encoding "ä".encoding.to_s "UTF-8" "UTF-8" OK
#end_with? "ä".end_with?("ä") true true OK
#eql? "ä".eql?("a") false false OK
#force_encoding "ä".force_encoding("ASCII") "a\xCC\x88" "a\xCC\x88" OK
#getbyte "ä".getbyte(2) 136 136 OK
#gsub "ä".gsub("a", "x") "ä" "ẍ" Beware!
#hash "ä".hash == "a".hash false false OK
#include? "ä".include?("a") false true Beware!
#index "ä".index("a") nil 0 Beware!
#replace "ä".replace("u") "u" "u" OK
#insert "ä".insert(1, "u") "äu" "aü" Beware!
#inspect "ä".inspect "\"ä\"" "\"ä\"" OK
#intern "ä".intern :ä :ä OK
#length "ä".length 1 2 Beware!
#ljust "ä".ljust(3, "_") "ä__" "ä_" Beware!
#lstrip " ä".lstrip "ä" "ä" OK
#match "ä".match("a") nil # Beware!
#next "ä".next "ä" "b̈" Beware!
#ord "ä".ord 97 97 OK
#partition "händ".partition("a") ["händ"] ["h", "a", "̈nd"] Beware!
#prepend "ä".prepend("ä") "ää" "ää" OK
#replace "ä".replace("ẍ") "ẍ" "ẍ" OK
#reverse "händ".reverse "dnäh" "dn̈ah" Beware!
#rpartition "händ".rpartition("a") ["händ"] ["h", "a", "̈nd"] Beware!
#rstrip "line ".rstrip "line" "line " Beware!
#scrub "ä".scrub "ä" "ä" OK
#setbyte s = "ä"; s.setbyte(0, "x".ord); s "ẍ" "ẍ" OK
#size "ä".size 1 2 Beware!
#slice "ä".slice(0) "ä" "a" Beware!
#split "ä".split("a") ["ä"] ["", "̈"] Beware!
#squeeze "ää".squeeze("ä") "ä" "ää" Beware!
#start_with? "ä".start_with?("a") false true Beware!
#strip " line ".strip "line" " line " Beware!
#sub "ä".sub("a", "x") "ä" "ẍ" Beware!
#succ "ä".succ "b̈" "b̈" OK
#swapcase "ä".swapcase "Ä" "Ä" OK
#to_c "١".to_c (1+0i) (0+0i) Beware!
#to_f "١".to_f 1.0 0.0 Beware!
#to_i "١".to_i 1 0 Beware!
#to_r "١".to_r (1/1) (0/1) Beware!
#to_sym "ä".to_sym :ä :ä OK
#tr "ä".tr("a", "b") "ä" "b̈" Beware!
#unpack "ä".unpack("CCC") [97, 204, 136] [97, 204, 136] OK
#upto "ä".upto("c̈").to_a ["ä", "b̈", "c̈"] ["ä", "b̈", "c̈"] OK
#valid_encoding? "ä".valid_encoding? true true OK
author photo

Starr Horne

Starr Horne is a Rubyist and Chief Javascripter at Honeybadger.io. When he's not neck-deep in other people's bugs, he enjoys making furniture with traditional hand-tools, reading history and brewing beer in his garage in Seattle.

“We’ve looked at a lot of error management systems. Honeybadger is head and shoulders above the rest and somehow gets better with every new release.”
Michael Smith
Try Error Monitoring Free for 15 Days
Are you using Bugsnag, Rollbar, or Airbrake for your monitoring? Honeybadger includes exception, uptime, and check-in monitoring — all for probably less than you’re paying now. Discover why so many companies are switching to Honeybadger here.
Try Error Monitoring Free for 15 Days
Stop digging through chat logs to find the bug-fix someone mentioned last month. Honeybadger's built-in issue tracker keeps discussion central to each error, so that if it pops up again you'll be able to pick up right where you left off.
Try Error Monitoring Free for 15 Days
"Wow — Customers are blown away that I email them so quickly after an error."
Chris Patton
Try Error Monitoring Free for 15 Days