-
Re: Invalid EUC_JP char seq bug?
> I am using PHP with postgreSQL and I have been getting a few rare errors
> while trying to do selects on a table containing EUC_JP text.
>
> I thought it was a bug with PHP not recognizing a string as invalid
> EUC_JP characters and wrote up a bug report but the PHP developers
> assure me that the string that is generating the error is a valid EUC_JP
> string (I don't know anything about character encodings so I am taking
> them at their word and the fact that the string displays fine in my
> browser as EUC_JP lends me to suspect they might be right).
>
> The offending string is url encoded as such:
>
> words=%8f%ac%90%ec%96%be%93%fa%8d%81
>
> When I try and do a SELECT I get the following error:
>
> select id from products where name like '??????'
> Query failed: ERROR: Invalid EUC_JP character sequence found (0x8100)
Since you did not show us exact query you send to PostgreSQL, I assume
the query passed to PostgreSQL is:
select id from products where name like 'string';
where string is "0x8fac90ec96be93fa8d81".
If the string is supposed to be an EUC_JP, it would be parsed as follows:
8f: single shift 3 (indicates that following 2 bytes are a JIS 0212 character)
ac90: a JIS 0212 character
ec96: a JIS 0208 character
be93: a JIS 0208 character
fa8d: a JIS 0208 character
81: ???
The last 0x81 is invalid if the string is assumed as EUC_JP.
> (Where did the 0x00 come from??)
trailing '\0'.
> Can someone let me know if this truly is a bug in postgres?
No.
> Thanks,
>
> Jean-Christian Imbeault
>
>
> PS I have also had the error pop up with this string:
>
> search_words=%B7%F6%BA%7E
> select id from products where name like '??~'
> Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e)
This is definitly a bad EUC_JP.
--
Tatsuo Ishii
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
-
Re: Invalid EUC_JP char seq bug?
Tatsuo Ishii wrote:
>
> Since you did not show us exact query you send to PostgreSQL
I can't show the exact query because it is generated by PHP. I can
however show you the code that generates the query:
$words = $_GET["words"];
$sql = "select id from products where name like '$words'";
$conn = pg_connect("host=$DB_IP port=5432 dbname=$DB_NAME user=postgres");
$res = pg_query($conn, $sql);
The GET query string was:
words=%8f%ac%90%ec%96%be%93%fa%8d%81
I think that PHP does some internal translation of this before passing
it on though.
> I assume the query passed to PostgreSQL is:
>
> select id from products where name like 'string';
Yes.
> where string is "0x8fac90ec96be93fa8d81".
That I don't know.
> If the string is supposed to be an EUC_JP, it would be parsed as follows:
>
> 8f: single shift 3 (indicates that following 2 bytes are a JIS 0212 character
[snip ...]
Ah ... so it is not an EUC-JP string but an SJIS string. Postgres was
right. That answers my question. Thanks!
>>PS I have also had the error pop up with this string:
>>
>>search_words=%B7%F6%BA%7E
>>select id from products where name like '??~'
>>Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e)
>
>
> This is definitly a bad EUC_JP.
According to a PHP developer in my bug report
(http://bugs.php.net/bug.php?id=24309&edit=2):
"URL decoded byte sequance of 'search_words=%B7%F6%BA%7E' is
B7E6+BA7E, which is correct EUC-JP character sequence. [snip] But, I
believe encoding detection of mbstring works fine in this case.
B7E6+BA7E is not correct byte sequence of SJIS, UTF-8, ISO2022-JP. It is
correct EUC-JP byte sequence."
I see that he wrote B7E6 instead of the correct B7F6. I resubmitted my
bug report to PHP and pointed this out. Hopefully the developer will see
that this sequence is incorrect EUC-JP and that PHP failed to detect this :)
I *knew* there was nothing wrong with Postgres ;)
Thanks!
Jean-Christian Imbeault
PS I posted to HACKERS a few weeks ago about another bug (a real one :)
in the EUC-JP translation having to do with the WAVE DASH. I'll repost
here on the BUGS list, could you let me know the status of that BUG? Thanks!
---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match
-
Re: Invalid EUC_JP char seq bug?
> >>search_words=%B7%F6%BA%7E
> >>select id from products where name like '??~'
> >>Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e)
> >
> >
> > This is definitly a bad EUC_JP.
>
> According to a PHP developer in my bug report
> (http://bugs.php.net/bug.php?id=24309&edit=2):
>
> "URL decoded byte sequance of 'search_words=%B7%F6%BA%7E' is
> B7E6+BA7E, which is correct EUC-JP character sequence. [snip] But, I
> believe encoding detection of mbstring works fine in this case.
> B7E6+BA7E is not correct byte sequence of SJIS, UTF-8, ISO2022-JP. It is
> correct EUC-JP byte sequence."
>
> I see that he wrote B7E6 instead of the correct B7F6. I resubmitted my
> bug report to PHP and pointed this out. Hopefully the developer will see
> that this sequence is incorrect EUC-JP and that PHP failed to detect this :)
In the EUC_JP encoding there are some rules:
1) if the first byte is 0x8e then second byte is a JIS 0201 character
and should be greater than 0x7f
2) else if the first byte is 0x8f then second and third byte is a JIS
0212 character and they should be greater than 0x7f
3) else if the first byte is greater than 0x7f then second and third
byte is a JIS 0208 character and they should be greater than 0x7f
4) else the byte is ASII and should be eqaul to or less than 0x7f
Apparently:
B7F6: this is ok. we can apply rule #3
BA7E: this is not good, since it satisfies non of rule #1 to #4
> Thanks!
>
> Jean-Christian Imbeault
>
> PS I posted to HACKERS a few weeks ago about another bug (a real one :)
> in the EUC-JP translation having to do with the WAVE DASH. I'll repost
> here on the BUGS list, could you let me know the status of that BUG? Thanks!
Sorry for the delay. In EUC-JP <--> Unicode translation, WAVE DASH is
always a problem since there are several different mappings among
different vendors/standards. I think I need more time to solve this.
--
Tatsuo Ishii
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly