C++ Succinctly: Strings

Introduction

Strings are one of those troublesome things in C and C++. In the early days of the languages, strings were all character arrays, typically 7-bit ASCII (though perhaps EBCDIC on IBM mainframes that C was ported to). Then came a mess of OS-specific workarounds, such as code pages, to allow for languages with characters that were not in the English alphabet. After a period of chaos, came Unicode. Then Unicode. And then Unicode again. And a few more Unicodes here and there as well, which is the root of the problem today.

Unicode is, in essence, two things. It’s a defined series of code points in which there is a one-to-one mapping of a particular code point to a particular value, some are graphic, others control and manipulate formatting or provide other required information. Everyone who uses Unicode agrees on all of these, including the private-use code points, which all agree are reserved for Unicode-conforming applications. So far, so good.

Then there are the encoding schemes where the divisions come from. There are 1,114,112 code points in Unicode. How do you represent them? The answer was the encoding schemes. UTF-16 was the first. It was later followed by UTF-8 and UTF-32. There are also endianness issues with some of these.

Other formats came and went, some of which were never even part of Unicode.

Windows ultimately adopted UTF-16 as did .NET and Java. Many GNU/Linux and other UNIX-like systems adopted UTF-8. Some UNIX-like systems use UTF-32. Some might use UTF-16. The web uses UTF-8 for the most part, due to that encoding’s intentional design to be mostly backward compatible with ASCII. As long as you are working on one system, all is well. When you try to become cross-platform, things can become more confusing.

Strings

char* Strings

The char* strings (pointers to arrays of char) originally meant ASCII strings. Now they sometimes mean ASCII, but more frequently, they mean UTF-8. This is especially true in the UNIX world.

When programming for Windows, generally, you should assume that a char* string is an ASCII string or a code-page string. Code pages use the extra bit left over from 7-bit ASCII to add another 128 characters, thus creating a lot of localized text still fitting within one byte per character.

wchar_t* Strings

wchar_t* strings (pointers to arrays of wchar_t, also called wide characters) use a different, implementation-dependent character set. On Windows, this means a 16-bit value, which is used for UTF-16. You should always work with wchar_t as your native character type for Windows unless you have to support really, really old OS versions (i.e., the old Windows 9X series).

When you write a wide character string constant in code, you prefix the opening double quotes with an L. For example: const wchar_t* s = L"Hello World";. If you only need a single character, you again use the L, but with single quotes: wchar_t ch = L'A’;.

std::string and std::wstring Strings

The std::string and std::wstring classes are found in the <string> header file. As you might imagine, std::string corresponds to char* while std::wstring corresponds to wchar_t*.

These classes provide a convenient way to store variable length strings and should be used for class member variables in place of their corresponding raw pointers (char* and wchar_t*). You should only use the raw pointers to pass strings as arguments, and then only if the string will be used as-is or copied locally into one of these string types.

In either case, the function should take in the string pointer as a pointer to const (e.g., const wchar_t* someStr). After all, pointers do not incur the same construction and destruction expense that std::string and std::wstring do. Using a pointer to const ensures that the function will not accidentally modify the data or try to free the memory that is pointed to.

To get a pointer to const for the contents of one of these, call its c_str member function. Note that the returned pointer points to const since the data should not be modified, nor should delete be called on the pointer. The memory is still owned and managed by the underlying std::string or std::wstring instance. This also means that if the underlying instance is destroyed, the pointer that c_str gives you becomes invalid, which is why, if you need the string data beyond the scope of the function it is being passed to, you should always store the string data in one of these types rather than storing the pointer directly.

To add text, use the append member function.

To see if a particular sequence of characters occurs in a string, use the find member function or one of its more specific variants, such as find_first_of. If the sequence is not in the string, then the return value will equal std::npos. Otherwise, it will be the index of the relevant starting point for the sequence.

To get a sub-string, use the substr member function, passing it the starting zero-based index and the number of elements (i.e. the number of char or wchar_t characters) to copy. It will return a std::string or a std::wstring without allowing you to overflow a buffer by passing an inaccurate count or an improper starting index.

There are other useful methods, all of which are documented as part of the basic_string class, which is a template class that std::string and std::wstring are predefined specializations of.

std::wstringstream Strings

The std::wstringstream class (there is a std::stringstream as well) is similar to the .NET StringBuilder class. It is usable in much the same way as any other C++ Standard Library stream. I find this type very useful for constructing a string within a member function that will then be stored in a std::wstring class member.

For an example of its usage, see the Toppings::GetString member function in the ConstructorsSample\Toppings.h file. Here is its code, just as a refresher:

Conclusion

As I mentioned in the introduction, the history of strings isn't pretty, but I hope that this article has given you a proper understanding of strings in C++. The next installment of this series covers common idioms in C++.

This lesson represents a chapter from C++ Succinctly, a free eBook from the team at Syncfusion.
Tags:

Comments

Related Articles