+ Reply to Thread
Results 1 to 3 of 3

Re: Invalid EUC_JP char seq bug?

  1. Re: Invalid EUC_JP char seq bug?

    > I am using PHP with postgreSQL and I have been getting a few rare errors
    > while trying to do selects on a table containing EUC_JP text.
    >
    > I thought it was a bug with PHP not recognizing a string as invalid
    > EUC_JP characters and wrote up a bug report but the PHP developers
    > assure me that the string that is generating the error is a valid EUC_JP
    > string (I don't know anything about character encodings so I am taking
    > them at their word and the fact that the string displays fine in my
    > browser as EUC_JP lends me to suspect they might be right).
    >
    > The offending string is url encoded as such:
    >
    > words=%8f%ac%90%ec%96%be%93%fa%8d%81
    >
    > When I try and do a SELECT I get the following error:
    >
    > select id from products where name like '??????'
    > Query failed: ERROR: Invalid EUC_JP character sequence found (0x8100)


    Since you did not show us exact query you send to PostgreSQL, I assume
    the query passed to PostgreSQL is:

    select id from products where name like 'string';

    where string is "0x8fac90ec96be93fa8d81".

    If the string is supposed to be an EUC_JP, it would be parsed as follows:

    8f: single shift 3 (indicates that following 2 bytes are a JIS 0212 character)
    ac90: a JIS 0212 character
    ec96: a JIS 0208 character
    be93: a JIS 0208 character
    fa8d: a JIS 0208 character
    81: ???

    The last 0x81 is invalid if the string is assumed as EUC_JP.

    > (Where did the 0x00 come from??)


    trailing '\0'.

    > Can someone let me know if this truly is a bug in postgres?


    No.

    > Thanks,
    >
    > Jean-Christian Imbeault
    >
    >
    > PS I have also had the error pop up with this string:
    >
    > search_words=%B7%F6%BA%7E
    > select id from products where name like '??~'
    > Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e)


    This is definitly a bad EUC_JP.
    --
    Tatsuo Ishii

    ---------------------------(end of broadcast)---------------------------
    TIP 3: if posting/reading through Usenet, please send an appropriate
    subscribe-nomail command to majordomo@postgresql.org so that your
    message can get through to the mailing list cleanly


  2. Re: Invalid EUC_JP char seq bug?

    Tatsuo Ishii wrote:
    >
    > Since you did not show us exact query you send to PostgreSQL


    I can't show the exact query because it is generated by PHP. I can
    however show you the code that generates the query:


    $words = $_GET["words"];
    $sql = "select id from products where name like '$words'";
    $conn = pg_connect("host=$DB_IP port=5432 dbname=$DB_NAME user=postgres");
    $res = pg_query($conn, $sql);

    The GET query string was:

    words=%8f%ac%90%ec%96%be%93%fa%8d%81

    I think that PHP does some internal translation of this before passing
    it on though.

    > I assume the query passed to PostgreSQL is:
    >
    > select id from products where name like 'string';


    Yes.

    > where string is "0x8fac90ec96be93fa8d81".


    That I don't know.

    > If the string is supposed to be an EUC_JP, it would be parsed as follows:
    >
    > 8f: single shift 3 (indicates that following 2 bytes are a JIS 0212 character


    [snip ...]

    Ah ... so it is not an EUC-JP string but an SJIS string. Postgres was
    right. That answers my question. Thanks!

    >>PS I have also had the error pop up with this string:
    >>
    >>search_words=%B7%F6%BA%7E
    >>select id from products where name like '??~'
    >>Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e)

    >
    >
    > This is definitly a bad EUC_JP.


    According to a PHP developer in my bug report
    (http://bugs.php.net/bug.php?id=24309&edit=2):

    "URL decoded byte sequance of 'search_words=%B7%F6%BA%7E' is
    B7E6+BA7E, which is correct EUC-JP character sequence. [snip] But, I
    believe encoding detection of mbstring works fine in this case.
    B7E6+BA7E is not correct byte sequence of SJIS, UTF-8, ISO2022-JP. It is
    correct EUC-JP byte sequence."

    I see that he wrote B7E6 instead of the correct B7F6. I resubmitted my
    bug report to PHP and pointed this out. Hopefully the developer will see
    that this sequence is incorrect EUC-JP and that PHP failed to detect this :)

    I *knew* there was nothing wrong with Postgres ;)

    Thanks!

    Jean-Christian Imbeault

    PS I posted to HACKERS a few weeks ago about another bug (a real one :)
    in the EUC-JP translation having to do with the WAVE DASH. I'll repost
    here on the BUGS list, could you let me know the status of that BUG? Thanks!


    ---------------------------(end of broadcast)---------------------------
    TIP 9: the planner will ignore your desire to choose an index scan if your
    joining column's datatypes do not match


  3. Re: Invalid EUC_JP char seq bug?

    > >>search_words=%B7%F6%BA%7E
    > >>select id from products where name like '??~'
    > >>Query failed: ERROR: Invalid EUC_JP character sequence found (0xba7e)

    > >
    > >
    > > This is definitly a bad EUC_JP.

    >
    > According to a PHP developer in my bug report
    > (http://bugs.php.net/bug.php?id=24309&edit=2):
    >
    > "URL decoded byte sequance of 'search_words=%B7%F6%BA%7E' is
    > B7E6+BA7E, which is correct EUC-JP character sequence. [snip] But, I
    > believe encoding detection of mbstring works fine in this case.
    > B7E6+BA7E is not correct byte sequence of SJIS, UTF-8, ISO2022-JP. It is
    > correct EUC-JP byte sequence."
    >
    > I see that he wrote B7E6 instead of the correct B7F6. I resubmitted my
    > bug report to PHP and pointed this out. Hopefully the developer will see
    > that this sequence is incorrect EUC-JP and that PHP failed to detect this :)


    In the EUC_JP encoding there are some rules:

    1) if the first byte is 0x8e then second byte is a JIS 0201 character
    and should be greater than 0x7f

    2) else if the first byte is 0x8f then second and third byte is a JIS
    0212 character and they should be greater than 0x7f

    3) else if the first byte is greater than 0x7f then second and third
    byte is a JIS 0208 character and they should be greater than 0x7f

    4) else the byte is ASII and should be eqaul to or less than 0x7f

    Apparently:

    B7F6: this is ok. we can apply rule #3
    BA7E: this is not good, since it satisfies non of rule #1 to #4

    > Thanks!
    >
    > Jean-Christian Imbeault
    >
    > PS I posted to HACKERS a few weeks ago about another bug (a real one :)
    > in the EUC-JP translation having to do with the WAVE DASH. I'll repost
    > here on the BUGS list, could you let me know the status of that BUG? Thanks!


    Sorry for the delay. In EUC-JP <--> Unicode translation, WAVE DASH is
    always a problem since there are several different mappings among
    different vendors/standards. I think I need more time to solve this.
    --
    Tatsuo Ishii

    ---------------------------(end of broadcast)---------------------------
    TIP 3: if posting/reading through Usenet, please send an appropriate
    subscribe-nomail command to majordomo@postgresql.org so that your
    message can get through to the mailing list cleanly


+ Reply to Thread