Utf8 message decoding issue

There has been several discussions concerning corrupted international characters in decoded messages in the past, including

as well as a change log mentioning a fix against incorrect handling with double byte characters in XMLLightweightParser .

We have encountered similar problem recently, suggesting the fix is incomplete. After checking XMLLightweightParser, it seems to me that the following code

if (lastChar >= 0xfff0) {
            if (Log.isDebugEnabled()) {
                Log.debug("Waiting to get complete char: " + String.valueOf(buf));
            }
            // Rewind the position one place so the last byte stays in the buffer
            // The missing byte should arrive in the next iteration. Once we have both
            // of bytes we will have the correct character
            byteBuffer.position(byteBuffer.position()-1);
            // Decrease the number of bytes read by one
            readByte--;
            // Just return if nothing was read             if (readByte == 0) {
                return;
            }
        }

is not adequate to handle messge in utf8 encoding(it works only if the message is encoded in double byte charactersets, e.g. GBK for simplified chinese). The reason is:

  1. in utf8 encoding, single character can be upto 6 bytes long.

  2. after decoding, byteBuffer’s position is pointing beyond the end of the buffer. it does not stop at the first byte of the incomplete character.

Here’s the patch. It uses CharsetDecoder’s API to gain finer control over the decoding loop.

--- E:\download\openfire_src\src\java\org\jivesoftware\openfire\nio\XMLLightweightParser.java     2008-04-24 17:28:56.000000000 +-0800
+++ E:\download\openfire_src_3_5_1\openfire_src\src\java\org\jivesoftware\openfire\nio\XMLLightweightParser.java     2008-04-30 17:53:21.000000000 +-0800
@@ -13,12 +13,14 @@ import org.apache.mina.common.ByteBuffer; import org.jivesoftware.util.Log; import java.nio.CharBuffer; import java.nio.charset.Charset;
+import java.nio.charset.CoderResult;
+import java.nio.charset.CharsetDecoder; import java.util.ArrayList; import java.util.List; /**
  * This is a Light-Weight XML Parser.
  * It read data from a channel and collect data until data are available in
@@ -148,34 +150,55 @@
         invalidateBuffer();
         // Check that the buffer is not bigger than 1 Megabyte. For security reasons
         // we will abort parsing when 1 Mega of queued chars was found.
         if (buffer.length() > 1048576) {
             throw new Exception("Stopped parsing never ending stanza");
         }
-        CharBuffer charBuffer = encoder.decode(byteBuffer.buf());
-        char[] buf = charBuffer.array();
-        int readByte = charBuffer.remaining();
+
+        // allocate a new character buffer to store the result, if it's not big
+        // enough, we'll increase the buffer size until decoding's finished
+        // the charBuffer might need to be cached to improve performance
+        CharBuffer charBuffer = CharBuffer.allocate(256);
+        CharsetDecoder decoder = encoder.newDecoder(); -        // Verify if the last received byte is an incomplete double byte character
-        char lastChar = buf[readByte-1];
-        if (lastChar >= 0xfff0) {
-            if (Log.isDebugEnabled()) {
-                Log.debug("Waiting to get complete char: " + String.valueOf(buf));
+        // loop until we've emptied the incoming buffer
+        while (true) {
+            // decode until charBuffer is filled or error
+            CoderResult cr = decoder.decode(byteBuffer.buf(), charBuffer, true);
+
+            // in case of malformed or unmappable characters, we need to skip the
+            // problematic bytes left in the buffer. the length of these bytes
+            // can be obtained from CoderResult object
+            if (cr.isError()) {
+                byteBuffer.position(byteBuffer.position() + cr.length());
+                if (Log.isDebugEnabled()) {
+                    Log.debug("Skipping malformed or unmappable byte sequece");
+                }
             }
-            // Rewind the position one place so the last byte stays in the buffer
-            // The missing byte should arrive in the next iteration. Once we have both
-            // of bytes we will have the correct character
-            byteBuffer.position(byteBuffer.position()-1);
-            // Decrease the number of bytes read by one
-            readByte--;
-            // Just return if nothing was read -            if (readByte == 0) {
-                return;
+
+            // if we have an overflow situation, increase the charBuffer limit and
+            // decode more characters
+            if (byteBuffer.remaining() != 0 && (! cr.isUnderflow())) {
+                // double the charbuffer capacity to limit the amount of allocations
+                CharBuffer tmp = CharBuffer.allocate(charBuffer.capacity() * 2);
+
+                // put content to new buffer
+                tmp.put(charBuffer.array(), 0, charBuffer.position());
+                charBuffer = tmp;
             }
+            else
+                // otherwise we've done, let's quit decode loop
+                break;
         } +        // return immediately if no character had been decoded
+        int readByte = charBuffer.position();
+        if (readByte == 0)
+                return;
+
+        char[] buf = charBuffer.array();
         buffer.append(buf, 0, readByte);
         // Do nothing if the buffer only contains white spaces
         if (buffer.charAt(0) <= ' ' && buffer.charAt(buffer.length()-1) <= ' ') {
             if ("".equals(buffer.toString().trim())) {
                 // Empty the buffer so there is no memory leak
                 buffer.delete(0, buffer.length());

Hi,

I wonder how much impact a while loop has for the performance. I would think of a method to detect the last valid byte in the buffer, similar to

public void read(ByteBuffer byteBuffer) throws Exception {
int len = byteBuffer.remaining();
byte lastByte1 = byteBuffer.get(len - 1);
/* 0xxxxxxx */
/* 110xxxxx 10xxxxxx */
/* 1110xxxx 10xxxxxx 10xxxxxx */
/* 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx */
if ( lastByte1 == 0xxxxxxx ) {
     /* last byte is complete
      * "len" bytes can pe parsed */
} else {
     if ( lastByte1 != 10xxxxxx ) {
          /* last byte is start of new UTF-8 char
           * "len-1" bytes can pe parsed */
          len=len-1;
     } else {
          /* last byte == 10xxxxxx */
          byte lastByte2 = byteBuffer.get(len - 2);
          if ( lastByte2 != 10xxxxxx ) {
               /* last byte2 is start of new UTF-8 char
                * "len-2" bytes can pe parsed */
               len=len-2;
          } else {
               ...
          }
     }
...

Anyway I’ll create an issue for your bug report. JM-1371

LG

Hi, I believe the while loop should have no serious impact on the performance, as it’ll only be executed a very small number of times(notice the buffer size is doubled for every loop). Also, it’s the same algorithm used by CharsetDecoder.decode(ByteBuffer in), which is in turned used by Charset.decode, which is the logic I replaced.

There’s a possibility for optimization: the CharBuffer instance can be cached if there no threading issues. I’m not very familiar with the usage pattern so I avoided putting the optimization in. The current implementation won’t be worse since that’s how the JDK does it.

Hi,

I did not “notice the buffer size is doubled for every loop” before but now that you mention it … quite another bad idea to allocate memory in a loop. While Java is fast one can slow it down by allocation memory. We have the ByteBuffer with 1 MB which contains UTF-8 characters. They are converted to a StringBuilder which will be equal or smaller then the ByteBuffer.

But I agree that it makes sense to use the CharsetDecoder. Using it one can convert the buffer much faster (I’m assuming that we know how much bytes we can convert):

CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.reset();
buffer.append(byteBuffer.getString(len, decoder));

http://forum.java.sun.com/thread.jspa?threadID=738698&messageID=4240826 may also be interesting.

LG

I agree it’s a bad idea to allocate the CharBuffer in this manner. From what you’ve said, it seems to me that the incoming ByteBuffer is cached and reused, so the CharBuffer can be pre-allocated as well.

Actually, here’s the source code from CharsetDecoder.decode(ByteBuffer in) from Sun’s JDK 6.

public final CharBuffer decode(ByteBuffer in)
     throws CharacterCodingException
    {
     int n = (int)(in.remaining() * averageCharsPerByte());
     CharBuffer out = CharBuffer.allocate(n);      if ((n == 0) && (in.remaining() == 0))
         return out;
     reset();
     for (;;) {
         CoderResult cr = in.hasRemaining() ?
          decode(in, out, true) : CoderResult.UNDERFLOW;
         if (cr.isUnderflow())
          cr = flush(out);          if (cr.isUnderflow())
          break;
         if (cr.isOverflow()) {
          n = 2*n + 1;     // Ensure progress; n might be 0!
          CharBuffer o = CharBuffer.allocate(n);
          out.flip();
          o.put(out);
          out = o;
          continue;
         }
         cr.throwException();
     }
     out.flip();
     return out;
    }

you can see that there’s a loop, and there’s allocation in the loop as well. As I mentioned previously, this method is called by Charset.decode(our previous magic hammer) to do the heavy duty. This is why I said performance wise, it should be “no worse” than before.

ilyh76’s code can’t solve this problem.

it2000’s idea hits the key of this problem.

The followingcode is my solution:

if (lastChar >= 0xfff0) {

if (Log.isDebugEnabled()) {

Log.debug("Waiting to get complete char: " + String.valueOf(buf));

}

// Rewind the position one place so the last byte stays in the buffer

// The missing byte should arrive in the next iteration. Once we have both

// of bytes we will have the correct character

int backspaceLength = 1;

if (“UTF-8”.equals(encoder.displayName())) {

// Get backspace length in UTF-8

int position = byteBuffer.position();

byte abyte;

for (int i=1; i<6; i++) {

abyte = byteBuffer.get(position-i);

if ((abyte & 0xC0) == 0xC0) {

// If the byte start with 11, then this byte is the first byte of a utf-8 char

backspaceLength = i;

break;

}

}

}

byteBuffer.position(byteBuffer.position()-backspaceLength);

// Decrease the number of bytes read by one

readByte–;

// Just return if nothing was read

if (readByte == 0) {

return;

}

}

that should work for incomplete utf8 sequence. but if the sequence is not valid utf8 sequence at all, it seems you can end up in infinite loop(if my understanding of utf8 is correct): say you have a 0xBF in the stream(everything before this byte is legal utf8), after decoding, you have 0xfff0 as the lastChar, and backspaceLength defaults to 1, it doesn’t mater how many bytes are backtraced(as long as it’s not zero), once it starts decoding again, it’ll hit 0xfff0 as lastChar and start backtracing again.

Message was edited by: ilyh76: just thought of another problem, there might be less than 6 bytes in the incoming buffer. so you need some protection.

Hi,

did you ever test the code? “byteBuffer.get(position-i);” returns a signed integer value (-127…127). One may want to use "byteBuffer.getUnsigned()2.

UTF-8 allows only 4 byte characters. So one does need to test only the last 3 characters and I’d do this without a loop.

You still use the CharBuffer - I did not yet get the idea why 3 buffers are used (ByteBuffer, CharBuffer, StringBuilder). Maybe to work aroung a Mina bug? One can easily convert the byteBuffer to a String with

StringBuilder buffer;
decoder = Charset.forName("UTF-8").newDecoder();
buffer.append(byteBuffer.getString(len, decoder)); // len is the length to be converted, without incomplete UTF-8 chars

PS:

It seems that Mina has a bug, byteBuffer.remaining(), and .hasRemaining() do dot work for me as expected.

ByteBuffer byteBuffer = ByteBuffer.allocate(13);
byteBuffer.clear();
int r1= byteBuffer.remaining(); // 16 - after clear() - quite interesting
boolean h1 = byteBuffer.hasRemaining(); // true byteBuffer.compact();
int r2= byteBuffer.remaining(); // 0 - this looks fine
boolean h2 = byteBuffer.hasRemaining(); // false byteBuffer.compact();
int r3= byteBuffer.remaining(); // 16 - again after a compact() => ????
boolean h3 = byteBuffer.hasRemaining(); // Should one add some logic here to get the right result for hasRemaining and remaining?

Bellow is my “read” function and I never use CharBuffer.

I simply tested this code and find it is ok (by sending a lot of utf8 data).

The infinite loop will not appear because the decode will not stop at the malformed utf-8 char (it will replace it with a default char.), the decode only stops at the end of decoding data.

Max UTF-8 char can have 6 bytes. This article talks about UTF-8: http://www.cl.cam.ac.uk/~mgk25/unicode.html

“byteBuffer.get(position-i);” returns a signed integer value (-127…127). One may want to use "byteBuffer.getUnsigned()2.

"

I just need to compare the binary bit, so that’s no matter about negative value.

“just thought of another problem, there might be less than 6 bytes in the incoming buffer. so you need some protection”

Yes, this is necessary.

/*

  • Main reading method

*/

public void read(ByteBuffer byteBuffer) throws Exception {

invalidateBuffer();

// Check that the buffer is not bigger than 1 Megabyte. For security reasons

// we will abort parsing when 1 Mega of queued chars was found.

if (buffer.length() > 1048576) {

throw new Exception(“Stopped parsing never ending stanza”);

}

CharBuffer charBuffer = encoder.decode(byteBuffer.buf());

char[] buf = charBuffer.array();

int readByte = charBuffer.remaining();

// Verify if the last received byte is an incomplete double byte character

char lastChar = buf[c-main];

if (lastChar >= 0xfff0) {

if (Log.isDebugEnabled()) {

Log.debug("Waiting to get complete char: " + String.valueOf(buf));

}

// Rewind the position one place so the last byte stays in the buffer

// The missing byte should arrive in the next iteration. Once we have both

// of bytes we will have the correct character

// Above is openfire 3.5.1’s code

int backspaceLength = 1;

if (“UTF-8”.equals(encoder.displayName())) {

// Get backspace length in UTF-8

int position = byteBuffer.position();

byte abyte;

try {

for (int i=1; i<6; i++) {

abyte = byteBuffer.get(position-i);

if ((abyte & 0xC0) == 0xC0) {

backspaceLength = i;

break;

}

}

}

catch (IndexOutOfBoundsException e) {

// ignore

}

}

byteBuffer.position(byteBuffer.position()-backspaceLength);

// bellow is openfire 3.5.1’s code

// Decrease the number of bytes read by one

readByte–;

// Just return if nothing was read

if (readByte == 0) {

return;

}

}

buffer.append(buf, 0, readByte);

// Do nothing if the buffer only contains white spaces

if (buffer.charAt(0) <= ’ ’ && buffer.charAt(buffer.length()-1) <= ’ ') {

if ("".equals(buffer.toString().trim())) {

// Empty the buffer so there is no memory leak

buffer.delete(0, buffer.length());

return;

}

}

.

.

.

the infinite loop can occur if the malformed sequence is exactly at the end of the buffer. in this case, you need to skip, instead of backtrace.

also, would you please give me some hints on where my patch is failing? i have that patch on our production server and the problem of missing chinese character went away. however one of our QA reports that friend name and status messages could still exhibit similar behavior, but i wasn’t able to reproduce.

Message was edited by: ilyh76. another issue: your logic assumes message is always encoded in utf8, is this safe? is encoding other than utf8 allowed in jabber protocol?

Your patch’s problem is in this code:

if (cr.isError()) {

  • byteBuffer.position(byteBuffer.position() + cr.length());

  • if (Log.isDebugEnabled()) {

  • Log.debug(“Skipping malformed or unmappable byte sequece”);

  • }

}

If byteBuffer is ended with incompleted UTF8 char, cr.isError will return true (underflow) and your code just skip this char but actually we should trace back the incompleted char and let the next call of this function to decode this char.

the infinite loop can occur if the malformed sequence is exactly at the end of the buffer. in this case, you need to skip, instead of backtrace.

This problem will not happen because if no new bytes are received this function will not be called.

I add “if (“UTF-8”.equals(encoder.displayName())) {” to ensure this is an UTF-8 decoder.

the line

CoderResult cr = decoder.decode(byteBuffer.buf(), charBuffer, true);

should be changed to

CoderResult cr = decoder.decode(byteBuffer.buf(), charBuffer, false);

then everything else should be working as expected. the last parameter let the decoder know if there’re more bytes coming in. when set to false, it’ll return CoderResult.UNDERFLOW whose isError() returns false. somehow i messed it up when i created the patch.

Hi,

the article UTF-8 and Unicode FAQ is just fine: “UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.”

The examples below seem to indicate that this is also a practical solution, but http://www.ietf.org/rfc/rfc3629.txt reads:

  1. UTF-8 definition

UTF-8 is defined by the Unicode Standard [UNICODE]. Descriptions and

formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646]

In UTF-8, characters from the U0000…U10FFFF range (the UTF-16

accessible range) are encoded using sequences of 1 to 4 octets.

Also your main method still uses the CharBuffer:

...
CharBuffer charBuffer = encoder.decode(byteBuffer.buf());
char[] buf = charBuffer.array();
int readByte = charBuffer.remaining();
...

Currently I prefer Tuolins patch while also it uses the CharBuffer but it also uses a CharsetDecoder instead of an Encoder which looks to be much more right. And with isUnderflow() (see also Sun Forum) it should work fine.

Personally I’d like to get rid of the CharBuffer. My Code which is not properly working yet is attached. I did include a main() to test things.

LG

LG
buggy-XMLLightweightParser.txt (16454 Bytes)

Yes, Tuolins patch seems more reasonable. It fixed the problem without “UTF-8” Charset’s restriction.

If can change the decoder to class member, it will be better.

Hey guys,

I might be missing something here. The original code was detecting the case when only one byte of a double byte character was received and was leaving that byte in the buffer so that when the next byte arrives the character could be correctly read. I see that the new patch is skipping what the decoder considers an error. Since I was never able to reproduce the problem of the double byte character I cannot assert this but I guess that if the server got a single byte of a double byte char then the decoder will consider that an error and skip the byte. The same thing could happen when the second byte arrives and what we end up is with a string that is missing a character.

XMPP servers should not skip problematic characters. They should be either read somehow or the connection should be closed for sending invalid data.

Am I missing something here?

Thanks,

– Gato

Since I was never able to reproduce the problem of the double byte character I cannot assert this…

You can send a lot of these characters “??” to reproduce the problem.

Hi,

if the last parameter passed to the CharsetDecoder.decode(ByteBuffer in, CharBuffer out, boolean endOfInput) method is set to false, indicating more bytes are coming in, the decoder will not consider it an error if the byte sequence is legal, but incomplete. in this case, the patch leave the bytes in the buffer and decoding continues in the next invokation.

However, it will always produce an error if the byte sequence is not legal at all. in this case, i simply choose to skip the bytes, of course we can also replace it with a default character(which is the previous behavior), or somehow close the connection.

to reproduce the bug, you need to construct a scenario where an international character, encoded in utf8 with three or more bytes(e.g. a chinese character is most likely encoded in THREE bytes), is split into two package and send to XMLLightweightParser in turns. The first package should contain at least the first two bytes from that character and the second package has the rest. in this case, you can see that the original code is not backtracking enough bytes.

Hi Gato,

I hope that the first thread plus “CoderResult cr = decoder.decode(byteBuffer.buf(), charBuffer, false);” does fix this problem. Because of the loop it’s not a good solution.

But Mina has a lot of bugs as far as I can tell, especially compact() rewind() and getString(len, decoder) return random results so I have problems to provide a better solution. Maybe I should update the Mina libraries, as with this code I get funny results:

public void read(ByteBuffer byteBuffer) throws Exception {
byteBuffer.rewind();System.out.println("1"+ byteBuffer.getString(10, decoder)); byteBuffer.rewind();
//byteBuffer.rewind();System.out.println("2"+ byteBuffer.getString(10, decoder));byteBuffer.rewind();           /* make sure that an empty byteBuffer is detected */
          if (byteBuffer.hasRemaining() == false) { return; };
          int len = byteBuffer.remaining();
          byteBuffer.compact();
//byteBuffer.rewind();System.out.println("3"+ byteBuffer.getString(10, decoder));byteBuffer.rewind();
          if (byteBuffer.hasRemaining() == false) { return; };
          byteBuffer.compact();
          int len2 = byteBuffer.remaining();
byteBuffer.rewind();System.out.println("4"+ byteBuffer.getString(10, decoder));byteBuffer.rewind();
...

Prints out

1<iq id="Ou

4

when byteBuffer contains ‘<iq id="Ou’. When the comments are removed

1<iq id="Ou

2<iq id="Ou

3<iq id="Ou

4<iq id="Ou

is printed.

LG

Did anyone ever come up with a resolution for this issue?

We are also having issues with utf-8.

Please see this thread:

And I also posted a comment on this issue:

http://www.igniterealtime.org/issues/browse/OF-92

Any help would be appreciated.

Daniel