Strings and Yet Other Strings

Strings and Yet Other Strings

You're going to be writing code with many strings, arrays, vectors, and similar structures. Let's explore the two things that make these tick!

·

8 min read

Strings are hard. They seem like they should be simple, and that should scare you! Where do they live, how do they move around, can they be reused, etc. are all questions that language designers have to think about. Rust seems to have thought about it a lot to end up with so many different structures to handle them. What's the deal with &str and String, what is that weird Cow thing, and what's the best way to write functions so that they can accept any string type? Here's the two things that are at the core of these choices.

Slices

Let's start with a really simple example of a string being used:

let my_string = "hello world!";

The my_string variable is given the value "hello world", which is a string literal. The type of this variable is &str, which can be thought of as a string reference, but it would be better to get used to calling these shared string slices, to align with other slice-able types like arrays and vectors. Slices are really important in Rust: they're similar to arrays in C/C++ in terms of them pointing to a contiguous sequence of elements in memory, but they are sized. In more modern C++, you could refer to string_views, I guess, but I don't know enough to make a clear cut here, the docs make it sound very similar:

The class template basic_string_view describes an object that can refer to a constant contiguous sequence of CharT with the first element of the sequence at position zero.

For a basic_string_view named str, pointers, iterators, and references to elements of str are invalidated when an operation invalidates a pointer in the range [str.data(), str.data() + str.size()).

A typical implementation holds only two members: a pointer to constant CharT and a size.

Slices in Rust are a generalization of this concept. For example, you can have a &[u32] or &[(f32, f32)] with the same properties but without requiring CharT. I love the clarity of the docs regarding slices and how you can use them, take a look:

Slices are either mutable or shared. The shared slice type is &[T], while the mutable slice type is &mut [T], where T represents the element type.

I think it's really smart to not use a { mutable, immutable } x { unique, shared } product of options if you don't like two of the resulting four options. We either need them to be mutable (and uniquely owned) or shared (and immutable), and that's what we do here.

The type str is a string slice, it's a primitive string type, but we don't see it by itself very often. String literals are typed by definition as &'static str - the lifetime annotation describes that it can survive the static lifetime of the program. Rust defaults to string slices being valid UTF-8, and it's actually an undefined behavior to have a string slice not be valid UTF-8 (there's a lot of functions dealing with this!).

String literals are shared, though, so we can conclude one thing about them right away: they'll be immutable!

let my_string = "hello" + " world!";
//              ------- ^ --------- &str
//              |       |
//              |       `+` cannot be used to concatenate
//              |       two `&str` strings
//              &str

This might seem very unfriendly, coming from many languages, but sugar is bad for the teeth, and ownership is at the core of the Rust language - it's more important to explicitly understand what's shared and thus immutable than to make it easier to write a simple expression in an even simpler way. What do you do when you want to concat two strings, then?

Well, a couple of things can work, but you ultimately always have to go into an owned version of the same data: in the case of strings, it's called String.

String is on the heap, it can grow, it's not immutable, and a string object owns its data explicitly. What's the easiest way to achieve this?

let s = "hello ".to_owned() + "world";

If you try this, it will have worked. You turn the first part of the concat into something that can actually concat, and then + is defined on it for other string-types. to_owned is defined in the ToOwned trait in the standard library, here's how:

pub trait ToOwned {
    type Owned: Borrow<Self>;

    // Required method
    fn to_owned(&self) -> Self::Owned;

    // Provided method
    fn clone_into(&self, target: &mut Self::Owned) { ... }
}

It's described as a generalization of Clone for borrowed data. The generalization is useful because Clone only works for the transition from &T to T, and you might have a different borrow in play. ToOwned will work even then. Not surprisingly, all slices have an implementation of ToOwned, so once you learn to use it with strings, the same works with other slices as well:

let v = &[1, 2];
let o = v.to_owned(); // Vec<u32>

It should be noted - another way to get a String from a &str that might be easier on the eyes when doing a lot of concat-ing, and that's the format! macro:

let s = format!("{} {}!", "hello", "world");

This macro happily accepts any string, be it &str or String, or even anything else that can be printed with either the {} pattern (meaning it derives Display) or the {:?} pattern (meaning it derives Debug). format!, similarly to print! and the rest of the family is there because we can't have a variable number of arguments freely in a function and have good compile-time checks.

So how does the String type work internally to be free to translate from &str into and from String? Well, it's actually quite exactly what you expect:

#[derive(PartialEq, PartialOrd, Eq, Ord)]
#[stable(feature = "rust1", since = "1.0.0")]
#[cfg_attr(not(test), lang = "String")]
pub struct String {
    vec: Vec<u8>,
}

It's a vector that can be borrowed, and when we borrow from it - we get a slice. For completeness, to get a &[u8] from a &str, we have the .to_bytes() function.

So strings are either literals, or owned and borrowed from the heap via extracting a slice! But isn't this a mess to work with? As in - should I pass a String to my functions or a &str? Or neither? What should I return?

Passing ownership

Let's think about passing ownership for a moment. If I have a function and I want to pass it an argument that's going to be immutable, I might not want to copy it or clone it without good reason. If I just need to peek at it, I might just do that by borrowing.

So let's say I want a function that takes a vector and peeks at it. Let's start from the basic case and then make it better:

fn vec_op(my_vec: &Vec<&str>) {
  ...
}

fn main() {
  let v = vec!["hello", "world"];
  vec_op(&v);
}

This will work, but you're basically asking for a shared reference to something that can be turned into a shared slice when needed -- and will be! What will happen, basically, is that the set of operations over a &Vec<&str> is practically the same as the one over &[&str], which is the natural borrow of a vector and any other borrowed array-like type. Without any loss of generality, it pays well to equip your functions to be able to collect data from any such type and not just Vec (except if you really want it to be so limited).

fn vec_op(my_vec: &[&str]) {
  // higher value stuff!
}

fn main() {
    let v1 = vec!["hello", "world"];
    vec_op(&v1);

    let v2 = ["hello", "world"];
    vec_op(&v2);
}

This time around, though, we started with Vec and forgot to talk about strings! Let's see how that fares, let's imagine a function that will be passed an immutable String:

fn str_op(my_str: &String) { ... }

Well... the same thing as before happens - the only thing we can do with &String is whatever we can do with its slice - &str, so we might as well just have that signature instead! You can notice that this is the same for any type that has a borrow/owned thing going on. So... it's best we write:

fn str_op(my_str: &str) { ... }

fn main() {
  let s1 = "hello";
  str_op(s1);

  let s2 = format!("{} world!", s1);
  str_op(&s2);
}

If we want to deal with mutable structures, pass a mutable reference to the owned value - we want to change stuff, we need ownership! In this case, &mut String and in the one above &mut Vec<&str>.

When we return values, we always want to make sure to return some semblance of normalcy, which means that we want to own the data we return. This means we never want to return a slice -- and in most cases I can think of, this won't really work for you at all either way (the compiler will be angry). That makes things simpler.

What about the Cows?

"Clone On Write" is a nice enum to know about, although it might seem funny or just plain old weird at first. Here it is from the std:

pub enum Cow<'a, B>
where
    B: 'a + ToOwned + ?Sized,
 {
    Borrowed(&'a B),
    Owned(<B as ToOwned>::Owned),
}

It takes the type B which is implementing ToOwned, meaning that it can be turned into an owned type like we've seen above with slices of all kinds. The ?Sized means that B objects are maybe sized at compile-time, but maybe not. str is an example of an unsized type, so it needs to be borrowed to be useful for us. Cow has two enum variants in it -- one for the borrowed data, and one for the owned version.

Cows are useful when you want to not really care about whether some parts of your code return &str while others return String:

fn my_str_func(i: u32) -> Cow<'static, str> {
  if i % 2 == 0 {
    "this is an even number!".into()
  } else {
    format!("{} is an odd number!", i).into()
  }
}

To match a Cow you can behave as with any other enum and match it directly to find out what it is, or you can turn it into either the owned version (via into_owned) or the borrowed version (via into_borrowed). It's useful, but might seem intimidating. What do you think?