gsnedders

Only thirty minutes until I can have more paracetamol! Yay!

PHP Grievances

Tags: June 28, 2009 (2 comments)

Below is a list of things that have annoyed me with PHP while writing various pieces of code, especially SimplePie 2 and the PHP html5lib (this list will be probably be added to over-time):

  • No native Unicode support (how a web facing high-level language can still not support Unicode in 2009 is beyond me: HTML, XML, CSS, and ECMAScript are all layered on top of Unicode, so almost everything going over the web is Unicode). There's no real way to implement Unicode in an interpreted language without taking a large performance hit, and the two main extensions PHP has for dealing with Unicode don't help in a lot of situations (mbstring, for example, can return a string that isn't UTF-8 when you try to convert data to UTF-8; iconv (per spec) fails when it hits an invalid byte, which doesn't work for a lot of web stuff).
  • The library of classes and functions you can actually rely upon is very small, because almost everything can be turned off via --disable-all. This means you end up having to re-implement in conditionals stuff you really ought to be able to rely upon. Even the "PHP Standard Library" can be disabled, which makes it non-standard, and means you can't rely upon it. I'd much rather there was a sane set of extensions that were always enabled and could not be disabled: preferably, everything that ships as PHP should always be enabled and should not be able to be disabled.
  • No native queue structure. Sure, you can use array_shift and array_push, but array_shift is an O(n) operation, and n can get quite large in the real-world, to a large enough extreme that it becomes undesirable from a performance point of view. While PHP 5.3 adds SPLQueue, this suffers from the problem of it being something that can be disabled (see above). I'd expect something as basic as a queue structure to exist in a language in the 21st century.
  • Inability to override object comparison. I may want objects that represent the URIs http://example.com and http://example.com/ to be equivalent, while keeping their original form.
  • Inability to override how objects are type-cast to any type apart from string (bool would be nice, for a start…).
  • ((bool) '0' === false) is an endless source of bugs.
  • There never appears to be that much regard to backwards/forwards compatibility, as it seems there is willingness to break small things in each release causing problems for a lot of people, but never to have more major changes that would break everything, but fix a lot of the problems with the language.
  • When implementing things like the DOM API, there are subtle differences to the spec.
  • XMLWriter doesn't actually necessarily output XML. Bugs reported to say this are bogus, so are we meant to assume that XMLWriter isn't actually meant to be able to be used as an XML serializer that can be relied upon? If that is the case, what are we meant to meant to use? Is it too much to ask for an XML serializer whose output always meets the "document" production in the XML specification (sorry, standard)?
  • There is an overhead of around 70 bytes per array entry, which makes using arrays of codepoints a not-entirely-satisfactory workaround of the language's lack of Unicode support.
  • Learning the argument order of functions takes years to do, if ever, a fact which is in part due to the internal variation, for example between array_search($needle, $haystack) and strpos($haystack, $needle).
  • PHP internally has a function that does type-hinting to some extent (zend_parse_parameters), yet it is repeatedly rejected to have type-hinting in the userland of anything apart from arrays and objects.
  • The filter extension (which apparently people need to use, according to the PHP developers) is buggy to the extreme of being useless. See this for some issues with its IPv6 support (I've also found issues, including regressions from 5.2.6 to 5.2.9… maybe I'll need to implement my own again). Equally, the URL filter uses the parse_url function internally, which the manual notes is not meant to validate the given URL, which inspires confidence in the extension. Why should I use the filter extension when it can be disabled, and when it is buggy? I cannot use it in distributed code that must run consistently on PHP 5.2.0 and above, even if those bugs are fixed in future releases. Likewise, having any bugs makes me weary of using it at all, as it makes me suspect of it having further bugs.
  • The reasons behind some of the most annoying version inconsistency, and one that hit both SimplePie 1 and MagpieRSS badly was data missing all pre-defined XML entities, a bug that was ultimately caused by PHP using an internal libxml2 API, which (unsurprisingly as an internal API) changed in libxml2 2.7. This means that with any version of PHP less than 5.2.9 with a version of libxml2 of 2.7 or above the xml extension is more or less useless. This was, thankfully, redeemed in PHP 5.2.9 by using a public API only added in libxml 2.7.3, so with libxml 2.7.0–2 the xml extension never works.
  • The wonderful zend.ze1_compatibility_mode, when turned on, causes $foo = new ReflectionClass('StdClass'); to throw E_ERROR (i.e., a fatal error).

ByePie

Tags: , December 29, 2007 (0 comments)

Having put a lot of thought into the matter over the past several weeks, I've made my decision to leave development of SimplePie.

"Why!? Oh Why!?", you scream (well, maybe not, but I'm not a telepathic seer). For a start, I haven't actually really used SimplePie myself since early 2006 (now almost two years ago), and I now have less and less to do with PHP at all (and I totally hate it — a recent bug in SP was caused by the fact that "0" == false — and have therefore moved to (mainly) Python).

Furthermore, over the past year, since March/April, my time has become increasingly limited, and SP has de-facto been one of the things that I have cut a long time ago (the reason for the lack of commits from me much) — the majority of my time is now spent on schoolwork, with what is left over being spent working on various specs (predominantly HTML 5 and Tolerant HTTP Parsing).

However, what does the future of SP hold? Well, various decisions need to be made about the future direction — do you try and improve 1.x further (it was already stretched to breaking point at 1.0, mainly held back by PHP itself — a sad state to be in), or do you start on the vision of SP2? To take the former option, I doubt you could get much further than what is currently planned for 1.2 with the current 1.x base — any further development requires a large amount of reworking the internals of SP (to the extreme of being questionable about whether there is any point of not starting from scratch). The latter option is probably the best (though ideally get 1.1 out as soon as it can be).

One of the aims of SP2 is true modularity — it should be possible to use (and load) nothing more the parser itself (i.e., give it raw XML data, and it gives you an API to access the title, description, etc. as they are in the feed without sanitising them at all) — which has several advantages for deciding any successor to myself: get people to write various modules for it against pre-existing specs (most of which are only drafts and so will need further development over time). What exactly those modules will be I am mainly undecided (though it won't, I assure you, be the more complex parts of the API itself — the design of them is mainly unwritten and comes from knowledge of successes/failures from SP1's API). I will myself continue maintaining a couple of the modules (namely, the Unicode and IRI ones, both of which I use outwith of SP — though more may be added to that list).

I'm more than willing to be around in a consulting role for a while — my contact details are in the footer here, and I'll stay around in the IRC channel for a while — as well as helping people around the SP1 codebase (though I'd like to see that totally feature frozen come the end of January, with a final non-bugfix release from it in February) — which is horrifically uncommented in parts, and uses stupidly complex algorithms in others that without prior knowledge of them make no sense (I've had issues with some myself when coming back to them having not touched them in a while :) ).

Alas, there's too much to write about the vision of SP2, so that will have to be done in another post; until then, g'nite.

Resolving Relative URLs in PHP

Tags: , December 28, 2006 (0 comments)

This is deprecated, and has known bugs. See here for a replacement.

There are plenty of cases for needing to resolve relative URLs - RFC 3986 (Generic URI Syntax) has a whole section on how to go about it. SimplePie has code for this, written by me in it's entirety (although based on the pseudo-code in RFC 3986), used to deal with relative URLs in feeds (which happens to be possible pretty much everywhere). As I am the soul author of it, I've rearranged it slightly into a single function (in SimplePie it's in several methods within a larger class, as most of the methods are also called in other places), and re-licensed it under the 3 clause BSD license, LGPL, and zlib/libpng license (although of course if you redistribute it you must attach the appropriate notice as stated by one of the above licenses).

Without further ado, here's the code:

<?phpfunction absolutize_url($relative$base)
{
    
$relative trim($relative);
    
$base trim($base);
    if (!empty(
$relative))
    {
        
preg_match('/^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?$/i'$relative$match);
        for (
$i count($match); $i <= 9$i++)
        {
            if (!isset(
$match[$i]))
            {
                
$match[$i] = '';
            }
        }
        
$relative = array('scheme' => $match[2], 'authority' => $match[4], 'path' => $match[5], 'query' => $match[7], 'fragment' => $match[9]);
        if (!empty(
$relative['scheme']))
        {
            
$target $relative;
        }
        else if (!empty(
$base))
        {
            
preg_match('/^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?$/i'$base$match);
            for (
$i count($match); $i <= 9$i++)
            {
                if (!isset(
$match[$i]))
                {
                    
$match[$i] = '';
                }
            }
            
$base = array('scheme' => $match[2], 'authority' => $match[4], 'path' => $match[5], 'query' => $match[7], 'fragment' => $match[9]);
            
$target = array('scheme' => '''authority' => '''path' => '''query' => '''fragment' => '');
            if (!empty(
$relative['authority']))
            {
                
$target $relative;
                
$target['scheme'] = $base['scheme'];
            }
            else
            {
                
$target['scheme'] = $base['scheme'];
                
$target['authority'] = $base['authority'];
                if (!empty(
$relative['path']))
                {
                    if (
strpos($relative['path'], '/') === 0)
                    {
                        
$target['path'] = $relative['path'];
                    }
                    else
                    {
                        if (
$base['path'] == '/' || empty($base['path']))
                        {
                            
$target['path'] = '/' $relative['path'];
                        }
                        else
                        {
                            
$target['path'] = preg_replace('/^(.*)((\/)([^\/]*))?$/sU''\\1'$base['path']) . '/' $relative['path'];
                        }
                    }
                    if (!empty(
$relative['query']))
                    {
                        
$target['query'] = $relative['query'];
                    }
                    
$input $target['path'];
                    while (!empty(
$input))
                    {
                        
// A: If the input buffer begins with a prefix of "../" or "./", then remove that prefix from the input buffer; otherwise,
                        
if (strpos($input'../') === 0)
                        {
                            
$input substr($input3);
                        }
                        else if (
strpos($input'./') === 0)
                        {
                            
$input substr($input2);
                        }
                        
// B: if the input buffer begins with a prefix of "/./" or "/.", where "." is a complete path segment, then replace that prefix with "/" in the input buffer; otherwise,
                        
else if (strpos($input'/./') === 0)
                        {
                            
$input substr_replace($input'/'03);
                        }
                        else if (
$input == '/.')
                        {
                            
$input '/';
                        }
                        
// C: if the input buffer begins with a prefix of "/../" or "/..", where ".." is a complete path segment, then replace that prefix with "/" in the input buffer and remove the last segment and its preceding "/" (if any) from the output buffer; otherwise,
                        
else if (strpos($input'/../') === 0)
                        {
                            
$input substr_replace($input'/'04);
                            
$target['path'] = preg_replace('/(\/)?([^\/]+)$/U'''$target['path']);
                        }
                        else if (
$input == '/..')
                        {
                            
$input '/';
                            
$target['path'] = preg_replace('/(\/)?([^\/]+)$/U'''$target['path']);
                        }
                        
// D: if the input buffer consists only of "." or "..", then remove that from the input buffer; otherwise,
                        
else if ($input == '.' || $input == '..')
                        {
                            
$input '';
                        }
                        
// E: move the first path segment in the input buffer to the end of the output buffer, including the initial "/" character (if any) and any subsequent characters up to, but not including, the next "/" character or the end of the input buffer
                        
else
                        {
                            if (
preg_match('/^([^\/]+|(\/)[^\/]*)(\/|$)/'$input$match))
                            {
                                
$target['path'] .= $match[1];
                                
$input substr_replace($input''0strlen($match[1]));
                            }
                            else
                            {
                                
// We've ended up in a recursive loop, so do what we otherwise never will: return false.
                                
return false;
                            }
                        }
                    }
                }
                else
                {
                    if (!empty(
$base['path']))
                    {
                        
$target['path'] = $base['path'];
                    }
                    else
                    {
                        
$target['path'] = '/';
                    }
                    if (!empty(
$relative['query']))
                    {
                        
$target['query'] = $relative['query'];
                    }
                    else if (!empty(
$base['query']))
                    {
                        
$target['query'] = $base['query'];
                    }
                }
            }
            if (!empty(
$relative['fragment']))
            {
                
$target['fragment'] = $relative['fragment'];
            }
        }
        else
        {
            
// No base URL, just return the relative URL
            
$target $relative;
        }
        
$return '';
        if (!empty(
$target['scheme']))
        {
            
$return .= "$target[scheme]:";
        }
        if (!empty(
$target['authority']))
        {
            
$return .= "//$target[authority]";
        }
        if (!empty(
$target['path']))
        {
            
$return .= $target['path'];
        }
        if (!empty(
$target['query']))
        {
            
$return .= "?$target[query]";
        }
        if (!empty(
$target['fragment']))
        {
            
$return .= "#$target[fragment]";
        }
    }
    else
    {
        
$return $base;
    }
    return 
$return;
}
?>

RFC3339 in PHP

Tags: , March 9, 2006 (5 comments)

This is deprecated, and has known bugs. See here for a replacement.

Having searched around for any function to parse RFC3339 dates (used in Atom) in PHP, and failing to find any decent one, I wrote my own. In short, all it does is rearrange the date to a format strtotime() understands.

<?phpfunction parse_date($date)
{
    if (
preg_match('/([0-9]{2,4})-([0-9][0-9])-([0-9][0-9])T([0-9][0-9]):([0-9][0-9]):([0-9][0-9])(\.[0-9][0-9])?Z/i'$date$matches))
    {
        if (isset(
$matches[7]) && substr($matches[7], 1) >= 50)
            
$matches[6]++;
        return 
strtotime("$matches[1]-$matches[2]-$matches[3] $matches[4]:$matches[5]:$matches[6] -0000");
    }
    else if (
preg_match('/([0-9]{2,4})-([0-9][0-9])-([0-9][0-9])T([0-9][0-9]):([0-9][0-9]):([0-9][0-9])(\.[0-9][0-9])?(\+|-)([0-9][0-9]):([0-9][0-9])/i'$date$matches))
    {
        if (isset(
$matches[7]) && substr($matches[7], 1) >= 50)
            
$matches[6]++;
        return 
strtotime("$matches[1]-$matches[2]-$matches[3] $matches[4]:$matches[5]:$matches[6] $matches[8]$matches[9]$matches[10]");
    }
    else
    {
        return 
strtotime($date);
    }
}
?>

I actually wrote this for SimplePie, and like the rest of SimplePie, is released under the Creative Commons Attribution License 2.5 it is released under the zlib/libpng license.

ROT47 with PHP

Tags: October 11, 2005 (4 comments)

Some of you may of heard of ROT13 - a simple form of encryption. It uses the 26 letters of the alphabet, splits it in half, and replaces each letter with the letter thirteen places down the alphabet because it is split in half, you can simply use ROT13 on the encrypted string to get it back to normal. In PHP, the function str_rot13() does this.

ROT47 takes ROT13 one step further, instead of using the 26 letters of the alphabet, it uses ASCII codes 33 through 126, making the outputted string far less easy to decrypt in your head. There are a total of 96 characters and to encrypt a string, it replaces each character by whatever character is 47 charaters further on down the list. Like ROT13, ROT47 can just be run on an encrypted string back to normal. Unlike ROT13, PHP does not support it. Here's my basic script:

<?phpif (!function_exists('str_rot47'))
{
    function
str_rot47($str)
    {
        return
strtr($str, '!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~', 'PQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNO');
    }
}
?>

Page:  1 2 3