For IBM i customers using UTF-8, we provide this advice, which I think applies to all platforms.
When storing indexes in a UTF-8 instance, it is important to note that some characters will use more than one byte when stored in a UTF-8 field. Latin lowercase and uppercase characters [a-z] [A-Z] and Arabic numerals [0-9] use only one byte. Accented characters might use two bytes. DBCS characters might use two or three bytes.
When using the graphical indexer of the OnDemand Administrator client with a UTF-8 instance, the Indexer Properties dialog will be presented before your sample data is displayed. You must set the Code Page to the value that matches the data being indexed.
When indexes are stored, they are converted from the Code Page specified on the Indexer Properties dialog to UTF-8 (CCSID 1208). String conversion between code pages might result in an increase in the length of the string when data is loaded on the server. For example, the OnDemand Administrator client might require two bytes to display a double-byte character, yet the server might require three bytes to store the character in the database.
When storing data for languages such as Greek, Russian, and Arabic, it is recommended that you create application group string fields that are double the length you would use if the instance did not support UTF-8. For other languages, if your index values contain accented characters, you will need to make the fields longer.
When selecting a string, the graphical indexer will set the Field Length to the length selected in the sample data.
On the Database Field Attributes tab, the graphical indexer increases the string length to a size that is sufficient to hold the data that you have selected. If you expect that other possible values for the field might require more space than Content Manager OnDemand calculated, you can override the length by typing a different number in the space provided.