15 Jun 2005 tnt   » (Master)

PHP SimpleXML CDATA Problem... and My Solution
PHP5 has a new built in way of handling XML. It's called SimpleXML.

Using this object for "working with" XML can make development alot faster. SimpleXML parses an XML document, and turns it into an object. So if we had a document like:

<?xml version="1.0"?>

<tvshows> <show> <name>The Simpsons</name> </show>

<show> <name>That '70s Show</name> </show>

<show> <name>Family Guy</name> </show>

<show> <name>Lois &amp; Clark</name> </show> </tvshows>

Then SimpleXML would give us a (PHP) object something like:
object(SimpleXMLElement)#1 (1) {
  ["show"]=>
  array(4) {
    [0]=>
    object(SimpleXMLElement)#2 (1) {
      ["name"]=>
      string(12) "The Simpsons"
    }
    [1]=>
    object(SimpleXMLElement)#3 (1) {
      ["name"]=>
      string(14) "That '70s Show"
    }
    [2]=>
    object(SimpleXMLElement)#4 (1) {
      ["name"]=>
      string(10) "Family Guy"

} [3]=> object(SimpleXMLElement)#5 (1) { ["name"]=> string(12) "Lois & Clark" } } }

(The output above would be what you get if you called var_dump() on the object. It probably looks more complex than it really is. Basically, to get at "The Simpsons" part, we would write "$simplexml->show[2]->name".)

This is useful because: #1 we save alot of time not having to use the old XML parsing methods (... which isn't difficult, just time consuming), #2: we can "use" this in a "foreach" structure, and #3 it's easier for newbies to learn with.

The one big problem is, SimpleXML does not handle CDATA!

(If you don't know what XML CDATA Section is, look at: http://en.wikipedia.org/wiki/CDATA_section)

Look at the last entry:

        <name>Lois &amp; Clark</name>

What if we used CDATA instead to represent this, and had:

        <name><![CDATA[Lois & Clark]]></name>
Well then too bad! SimpleXML just skips all that. It just pretends that it wasn't even there! (Note that when we put the text in the CDATA block, we were able to change the "&amp;" to a "&".)

So in other words, if we had:

<?xml version="1.0"?>

<tvshows> <show> <name>The Simpsons</name> </show>

<show> <name>That '70s Show</name> </show>

<show> <name>Family Guy</name> </show>

<show> <name><![CDATA[Lois & Clark]]></name> </show> </tvshows>

Then we'd get:
object(SimpleXMLElement)#1 (1) {
  ["show"]=>
  array(4) {
    [0]=>
    object(SimpleXMLElement)#2 (1) {
      ["name"]=>
      string(12) "The Simpsons"
    }
    [1]=>
    object(SimpleXMLElement)#3 (1) {
      ["name"]=>
      string(14) "That '70s Show"
    }
    [2]=>
    object(SimpleXMLElement)#4 (1) {
      ["name"]=>
      string(10) "Family Guy"
    }
    [3]=>
    object(SimpleXMLElement)#5 (1) {
      ["name"]=>
      object(SimpleXMLElement)#6 (0) {
      }
    }
  }
}
Note that the "Lois & Clark" part isn't even there!

So, what's the solution. Well, we can turn the CDATA into XML "escaped" text before giving the "XML data" to SimpleXML. In other words, take to CDATA and do the following conversions...

&    becomes    &amp;
"    becomes    &quot;
<    becomes    &lt;
>    becomes    &gt;
(And of course, drop the "<![CDATA[" and "]]>" too.)

I tried doing this with regular expressions but just couldn't figure out the proper way to represent "not a string". (I tired it with POSIX regular expressions are Perl-compatible regular expressions. But couldn't get anything to work.) So, eventually I just decided to write a function for it. (Which is tedious.) So, here it is. Hopefully it will help everyone else to not get frustrated with SimpleXML being too simple:

    function uncdata($xml)
    {
        // States:
        //
        //     'out'
        //     '<'
        //     '<!'
        //     '<!['
        //     '<![C'
        //     '<![CD'
        //     '<![CDAT'
        //     '<![CDATA'
        //     'in'
        //     ']'
        //     ']]'                                                                                                                                                            
        //
        // (Yes, the states a represented by strings.) 
        //
  
        $state = 'out';
                                                                                                                                              
        $a = str_split($xml);
                                                                                                                                              
        $new_xml = '';
                                                                                                                                              
        foreach ($a AS $k => $v) {
                                                                                                                                              
            // Deal with "state".
            switch ( $state ) {
                case 'out':
                    if ( '<' == $v ) {
                        $state = $v;
                    } else {
                        $new_xml .= $v;
                    }
                break;
                                                                                                                                              
                case '<':
                    if ( '!' == $v  ) {
                        $state = $state . $v;
                    } else {
                        $new_xml .= $state . $v;
                        $state = 'out';
                    }
                break;
                                                                                                                                              
                 case '<!':
                    if ( '[' == $v  ) {
                        $state = $state . $v;
                    } else {
                        $new_xml .= $state . $v;
                        $state = 'out';
                    }
                break;
                                                                                                                                              
                case '<![':
                    if ( 'C' == $v  ) {
                        $state = $state . $v;
                    } else {
                        $new_xml .= $state . $v;
                        $state = 'out';
                    }
                break;
                                                                                                                                              
                case '<![C':
                    if ( 'D' == $v  ) {
                        $state = $state . $v;
                    } else {
                        $new_xml .= $state . $v;
                        $state = 'out';
                    }
                break;
                                                                                                                                              
                case '<![CD':
                    if ( 'A' == $v  ) {
                        $state = $state . $v;
                    } else {
                        $new_xml .= $state . $v;
                        $state = 'out';
                    }
                break;
                                                                                                                                              
                case '<![CDA':
                    if ( 'T' == $v  ) {
                        $state = $state . $v;
                    } else {
                        $new_xml .= $state . $v;
                        $state = 'out';
                    }
                break;
                                                                                                                                              
                case '<![CDAT':
                    if ( 'A' == $v  ) {
                        $state = $state . $v;
                    } else {
                        $new_xml .= $state . $v;
                        $state = 'out';
                    }
                break;
                                                                                                                                              
                case '<![CDATA':
                    if ( '[' == $v  ) {

$cdata = ''; $state = 'in'; } else { $new_xml .= $state . $v; $state = 'out'; } break; case 'in': if ( ']' == $v ) { $state = $v; } else { $cdata .= $v; } break; case ']': if ( ']' == $v ) { $state = $state . $v; } else { $cdata .= $state . $v; $state = 'in'; } break; case ']]': if ( '>' == $v ) { $new_xml .= str_replace('>','&gt;', str_replace('>','&lt;', str_replace('"','&quot;', str_replace('&','&amp;', $cdata)))); $state = 'out'; } else { $cdata .= $state . $v; $state = 'in'; } break; } // switch } // // Return. // return $new_xml; }

So to use this, you'd do something like:

    // Get the XML data, with possible CDATA sections in it.
    $xml_data = file_get_contents('http://changelog.ca/feed/rss/');

// Convert the CDATA sections using the un-cdata function. $xml_data = uncdata($xml_data);

// Create the SimpleXML object (not having to worry about loosing info due to CDATA) $simplexml = simplexml_load_string($xml_data);

Just an extra note. I'm not sure how efficent this is with the use of the str_split() function (to turn a string into an array of characters). But if you are using SimpleXML, you're probably not really worried about that. (Or at least not at that stage of development.)

Hopefully someone will find this useful. (If you find any errors or bugs with it, send me an e-mail and let me know.)

The Proper Way to Use PHP's eval()
Many people say it is bad practice to use the eval() procedure in any language. That there is always a better way to do it. I disagree.

I think that some situations can warrent the use of the eval() procedure. Esspecially if you use it properly and carefully.

The PHP eval() procedure is no different. There are two important rules to remember when using PHP's eval().

  1. Always check the the return value from eval().
  2. Make it so you "code" returns an "OK" signal when it is done.

Let me explain this more. Since you are using "executing" unknown code and really don't know if the code has any syntax errors, or and other errors. You should be making an effort to check for this.

The PHP eval() procedure will return FALSE if there us an error. Therefore, you should check to see if it returns FALSE. So, you should be doing something like:

    if (  FALSE === eval($code) ) {
        // Error.
        // ... handle the error ...
    }
However, it is possible that eval() could return FALSE, even if it did not have this kind of error. So, you must set up your code so returns something other than FALSE when there is no error. I suggest it returns TRUE. You can do this by doing something like:
    $code .= 'return TRUE;';

if ( FALSE === eval($code) ) { // Error. // ... handle the error ... }

That way you know if everything went OK, then it will return TRUE.

So, for a fuller example we might have something like:

    // This procedure returns a string that is legal PHP expression.
    $variable_code = get_variable_code();

$code = '$a = ' . $variable_code . ';'; // $code = "\$a = $variable_code;";

$code .= 'return TRUE;';

if ( FALSE === eval($code) ) { // Error. // ... handle the error ... }

You can do alot more interesting things with eval() too.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!