There has been several discussions concerning corrupted international characters in decoded messages in the past, including
as well as a change log mentioning a fix against incorrect handling with double byte characters in XMLLightweightParser .
We have encountered similar problem recently, suggesting the fix is incomplete. After checking XMLLightweightParser, it seems to me that the following code
if (lastChar >= 0xfff0) {
if (Log.isDebugEnabled()) {
Log.debug("Waiting to get complete char: " + String.valueOf(buf));
}
// Rewind the position one place so the last byte stays in the buffer
// The missing byte should arrive in the next iteration. Once we have both
// of bytes we will have the correct character
byteBuffer.position(byteBuffer.position()-1);
// Decrease the number of bytes read by one
readByte--;
// Just return if nothing was read if (readByte == 0) {
return;
}
}
is not adequate to handle messge in utf8 encoding(it works only if the message is encoded in double byte charactersets, e.g. GBK for simplified chinese). The reason is:
-
in utf8 encoding, single character can be upto 6 bytes long.
-
after decoding, byteBuffer’s position is pointing beyond the end of the buffer. it does not stop at the first byte of the incomplete character.
Here’s the patch. It uses CharsetDecoder’s API to gain finer control over the decoding loop.
--- E:\download\openfire_src\src\java\org\jivesoftware\openfire\nio\XMLLightweightParser.java 2008-04-24 17:28:56.000000000 +-0800
+++ E:\download\openfire_src_3_5_1\openfire_src\src\java\org\jivesoftware\openfire\nio\XMLLightweightParser.java 2008-04-30 17:53:21.000000000 +-0800
@@ -13,12 +13,14 @@ import org.apache.mina.common.ByteBuffer; import org.jivesoftware.util.Log; import java.nio.CharBuffer; import java.nio.charset.Charset;
+import java.nio.charset.CoderResult;
+import java.nio.charset.CharsetDecoder; import java.util.ArrayList; import java.util.List; /**
* This is a Light-Weight XML Parser.
* It read data from a channel and collect data until data are available in
@@ -148,34 +150,55 @@
invalidateBuffer();
// Check that the buffer is not bigger than 1 Megabyte. For security reasons
// we will abort parsing when 1 Mega of queued chars was found.
if (buffer.length() > 1048576) {
throw new Exception("Stopped parsing never ending stanza");
}
- CharBuffer charBuffer = encoder.decode(byteBuffer.buf());
- char[] buf = charBuffer.array();
- int readByte = charBuffer.remaining();
+
+ // allocate a new character buffer to store the result, if it's not big
+ // enough, we'll increase the buffer size until decoding's finished
+ // the charBuffer might need to be cached to improve performance
+ CharBuffer charBuffer = CharBuffer.allocate(256);
+ CharsetDecoder decoder = encoder.newDecoder(); - // Verify if the last received byte is an incomplete double byte character
- char lastChar = buf[readByte-1];
- if (lastChar >= 0xfff0) {
- if (Log.isDebugEnabled()) {
- Log.debug("Waiting to get complete char: " + String.valueOf(buf));
+ // loop until we've emptied the incoming buffer
+ while (true) {
+ // decode until charBuffer is filled or error
+ CoderResult cr = decoder.decode(byteBuffer.buf(), charBuffer, true);
+
+ // in case of malformed or unmappable characters, we need to skip the
+ // problematic bytes left in the buffer. the length of these bytes
+ // can be obtained from CoderResult object
+ if (cr.isError()) {
+ byteBuffer.position(byteBuffer.position() + cr.length());
+ if (Log.isDebugEnabled()) {
+ Log.debug("Skipping malformed or unmappable byte sequece");
+ }
}
- // Rewind the position one place so the last byte stays in the buffer
- // The missing byte should arrive in the next iteration. Once we have both
- // of bytes we will have the correct character
- byteBuffer.position(byteBuffer.position()-1);
- // Decrease the number of bytes read by one
- readByte--;
- // Just return if nothing was read - if (readByte == 0) {
- return;
+
+ // if we have an overflow situation, increase the charBuffer limit and
+ // decode more characters
+ if (byteBuffer.remaining() != 0 && (! cr.isUnderflow())) {
+ // double the charbuffer capacity to limit the amount of allocations
+ CharBuffer tmp = CharBuffer.allocate(charBuffer.capacity() * 2);
+
+ // put content to new buffer
+ tmp.put(charBuffer.array(), 0, charBuffer.position());
+ charBuffer = tmp;
}
+ else
+ // otherwise we've done, let's quit decode loop
+ break;
} + // return immediately if no character had been decoded
+ int readByte = charBuffer.position();
+ if (readByte == 0)
+ return;
+
+ char[] buf = charBuffer.array();
buffer.append(buf, 0, readByte);
// Do nothing if the buffer only contains white spaces
if (buffer.charAt(0) <= ' ' && buffer.charAt(buffer.length()-1) <= ' ') {
if ("".equals(buffer.toString().trim())) {
// Empty the buffer so there is no memory leak
buffer.delete(0, buffer.length());