Encoding/Decoding for HTML & URL

HTML Encoding/Decoding

A problem that we often face is the characters copied from MS Word could not be displayed correctly on the web page, especially when the page is an xml. The reason is those characters may be encoded (such as ASCII 39) different from HTML and you have to convert them.

 

You might find that your editor doesn’t have UTF-8 support and is actually using normal US-ASCII. That’s OK because UTF-8 starts with exactly the same characters as US-ASCII. However, it’s more likely that your editor using Windows-1252 or some other proprietary encoding which will probably start with US-ASCII, but the characters from 128 to 255 might not match those in UTF-8. Therefore, you would often find the problem with the quotes, as they are between 145 and 148. Instead, single quotes are 8216 and 8217. Double quotes are 8220 and 8221.

 

&

ampersand

&

<

less than sign

<

`

back apostrophe

`

left single quote

left double quote

&quot;

quote

"

&gt;

greater than sign

>

'

single quote

right single quote

right double quote

Frequently used escape characters

 

URL Encoding/Decoding

Web server like Tomcat has its own URL encoder and decoder. All your parameters will be encoded when you send a request and will be decoded for further process. However, depend on the characters you passed, they could be miss-encoded or miss-decoded. For example, you have search form with search queryString equals “crunch time”. When you submit the form to Tomcat, it will be encoded as

queryString =%e2%80%9ccrunch+time%e2%80%9d

and will be decoded by the Tomcat container as

“crunch time”

However, as you can see, those quotations have been miss-decoded by Tomcat and you would not get expected value (“crunch time”) for further process.

 

To solve this problem, you can try to strip the value out from the query string

request.getQueryString();

and decode it

URLDecoder.decode(valueStrippedOut, "UTF-8");

manually before it is decoded by Tomcat, where valueStrippedOut’s value should be %e2%80%9ccrunch+time%e2%80%9d

 

You also should be aware when you want create a hyperlink with what you search, such as

queryString=“crunch time”

The problem is that Tomcat doesn’t know these quotations characters and will miss-encode them. In this case you should pass the following query instead.

queryString =%e2%80%9ccrunch+time%e2%80%9d

 

Another thing about hyperlink, you may know already, but you have to be careful with is that you must not have “&” character in the value of the parameter you want to pass. You have to replace them with %26.

 

Reference

http://www.breakingpar.com/bkp/home.nsf/0/87256B280015193F87256C47007456E7

http://www.opinionatedgeek.com/DotNet/Tools/Base64Encode/default.aspx

http://www.accessifyforum.com/viewtopic.php?t=5832

http://www.w3.org/TR/html4/sgml/entities.html

 

 

 

 

Advertisements
This entry was posted in Servelet/JSP. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s