Older blog entries for bagder (starting at number 725)

The updated web scraping howto

webbots-spiders-and-screen-scrapers

Web scraping is a practice that is basically as old as the web. The desire to extract contents or to machine- generate things from what perhaps was primarily intended to be presented to a browser and to humans pops up all the time.

When I first created the first tool that would later turn into curl back in 1997, it was for the purpose of scraping. When I added more protocols beyond the initial HTTP support it too was to extend its abilities to “scrape” contents for me.

I’ve not (yet!) met Michael Schrenk in person, although I’ve communicated with him back and forth over the years and back in 2007 I got a copy of his book Webbots, Spiders and Screen Scrapers in its 1st edition. Already then I liked it to the extent that I posted this positive little review on the curl-and-php mailing list saying:

this book is a rare exception and previously unmatched to my knowledge in how it covers PHP/CURL. It explains to great details on how to write web clients using PHP/CURL, what pitfalls there are, how to make your code behave well and much more.

Fast-forward to the year 2011. I was contacted by Mike and his publisher at Nostarch, and I was asked to review the book with special regards to protocol facts and curl usage. I didn’t hesitate but gladly accepted as I liked the first edition already and I believe an updated version could be useful to people.

Now, in the early 2012 Mike’s efforts have turned out into a finished second edition of his book. With updated contents and a couple of new chapters, it is refreshed and extended. The web has changed since 2007 and so has this book! I hope that my contributions didn’t only annoy Mike but possibly I helped a little bit to make it even more accurate than the original version. If you find technical or factual errors in this edition, don’t feel shy to tell me (and Mike of course) about them!

Syndicated 2012-03-06 18:41:57 from daniel.haxx.se

The first month of Spindly

Let me entertain you with some info and updates from the Spindly project. (Unfortunately we don’t have any logo yet so I don’t get to show it off here.)

Since I announced my intention to proceed and write the SPDY library on my own instead of waiting for libspdy to get back to life, I have worked on a number of infrastructure details.

I converted the build to use autotools and libtool to help us really make it a portable library. I made all test cases run without memory leaks and this took some amount of changes of libspdy since it was clearly not written with carefully checking memory and there were also a lot of unnecessarily small mallocs(). Anyone who does malloc() of 8 bytes should reconsider what they’re doing.

Since I’ve had to bugfix the libspdy so much, change structs and APIs and add new functions that were missing I decided that there’s no point in us trying to keep the original libspdy code or code style intact anymore so I’ve re-indented the whole code base to a style I like better than the original style.

I’ve started to write the fundamentals of a client and server demo application that is meant to use the Spindly API to implement both sides. They don’t really do much yet but the basics are in place. I’ve worked more on my idea of what the spindly API should look like. I’ve written the code for a few functions from that API and I’ve also added a few tests for them.

Most of this work has been made by me and me alone with no particular feedback or help by others. I continue to push my changes to github without delay and I occasionally announce stuff on the mailing list to keep interested people up to date. Hopefully this will lead to someone else joining in sooner or later.

The progress has not been very fast, not only because I’ve had to do a lot of thinking about how the API should ideally work to be really useful, but also because I have quite a lot of commitments in other open source projects (primarily curl and libssh2) that require their amount of time, not to mention that my day job of course needs proper attention.

We offer a daily snapshot of the code if you can’t use or don’t want to use git.

Upcoming

I intend to add more functions from the API document, one by one and test cases for each as I go along. In parallel I hope to get the demo client and server to run so that the API proves to actually work properly.

I want the demo client and server also to allow them to run interop tests against other implementations and I want them to be able to speak SPDY with SSL switched off – for debugging reasons. Later on, I hope to be able to use the demo server in the curl test suite so that I can test that the curl SPDY integration works correctly.

We need to either fix “check” (the unit test suite) to work C89 compatible or replace it with something else.

Want to help?

If you want to help, please subscribe to the mailing list, get familiar with the code base, study the API doc and see if it makes sense to you and then help me get that API turned into code…

Syndicated 2012-02-11 18:48:24 from daniel.haxx.se

Sloppily using SSL_OP_ALL

This story begins with a security flaw in OpenSSL. OpenSSL is truly a fundamental piece of software these days and I would go so far and say that lots of our critical infrastructure today is using it and needs it. Flaws in OpenSSL literally affect entire societies or at least risk doing so if the flaws can be exploited.

SSL/TLS is a rather old and well used protocol with many different implementations, both client and server side. In order to enhance how OpenSSL works with older SSL implementations or just those that have different views on how to implement things, OpenSSL provides an API call to tweak behaviors. The SSL_CTX_set_options function. In the curl project we’ve found good use of it for this purpose, and we use the generic define SSL_OP_ALL to switch on all “rather harmless” workarounds that OpenSSL offers. Rather harmless, that’s what the comment in the header file says.

Ok, enough background and dancing around the issue. The flaw that ignited my idea to write this blog post was a particular mistake made within SSL a long time ago within the code handling SSL 3.0 and TLS 1.0 protocols when speaking this protocol with a peer that could select the plain-text (see this explanation) – the problem is a generic one with the protocol so different SSL libraries would approach it differently. Ok, so OpenSSL fixed the flaw back in the days of 0.9.6d (we’re talking May 9th 2002). As a user of a library such as OpenSSL it always feels good to see them being on top of security problems and releasing fixes. It makes you feel that you’re being looked after to some extent.

Shortly thereafter, the OpenSSL developers discovered that some broken server implementations didn’t work with the work-around they had done…

Alas, on July 30th 2002 the OpenSSL team released version 0.9.6e which offered a way for programs to disable this particular work-around. By switching this off, it would of course make the protocol less secure again but it would inter-operate better with faulty servers. How do you switch off this security measure? By using the SSL_CTX_set_options function setting the bit SSL_OP_DONT_INSERT_EMPTY_FRAGMENTS.

Ok, so far so good. But the next step is what changed everything from fine to not so fine anymore: they then added that new bit to the SSL_OP_ALL define.

Yes. In one blow every single application out there that use SSL_OP_ALL suddently started switching off this security measure as soon as they were recompiled against this version of OpenSSL. This change was made in 2002 and this is still like this today. It fixed the security problem from OpenSSL’s aspect, but the way the bit was later added to the SSL_OP_ALL define it was instead transferred to affect many programs.

In curl’s case, we were alerted about this flaw on January 19th 2012 and it resulted in a security advisory. I did a quick search for SSL_OP_ALL on koders.com and it is obvious that there are hundreds of programs out there still using this bitmask as-is. In the curl project we enabled the SSL_OP_ALL approach for the first time in the 7.10.6 release we did in July 2003. It was wrong already at the time we started using it. It turns out we’ve been enabling this flaw for almost nine years.

In the GnuTLS camp however, they simply stopped doing their work-around for this as soon as they started supporting TLS 1.1 due to the problems the work-around caused to some servers. This since TLS 1.1 isn’t vulnerable to the problem. OpenSSL 1.0.1 beta was released on Janurary 3 2012 and is the first OpenSSL version ever released to support TLS newer than 1.0… The browsers/NSS seem to have mitigated this problem in a different way and there’s a patch available for OpenSSL to implement the same work-around but there’s been no feedback on how or if it will be used.

Syndicated 2012-01-27 22:10:31 from daniel.haxx.se

News in curl 7.24.0

We continue doing curl releases roughly bi-monthly. This time we strike back with a release holding a few interesting new things that I thought are worth highlighting a little extra!

The most important and most depressing news about this release is the two security problems that were fixed. Never before have we released two security advisories for the same release.

Security fixes

The “curl URL sanitization vulnerability” is about how curl trusts user provided URL strings a little too much. Providing sneakily crafted URLs with embeded url-encoded carriage returns and line feeds users could trick curl to do un-intended actions when POP3, SMTP or IMAP protocols were used.

The “curl SSL CBC IV vulnerability” is about how curl inadvertently disables a security measurement in OpenSSL and thus weakens the security for some aspects of SSL 3.0 and TLS 1.0 connections.

Changes

We have a bunch of new changes added to curl and libcurl that some users might like:

  • curl has this ability to run a set of “extra commands” for a couple of protocols when doing a transfer – we call them “quote” operations. A while ago we introduced a way to mark commands within a series of quote commands as not being important if they fail and that the rest of the commands should be sent anyway. We mark such commands with a ‘*’-prefix. Starting now, we support that ‘*’-prefix for SFTP operations as well!
  • CURLOPT_DNS_SERVERS is a brand new option that allows programs to set which DNS server(s) libcurl should use to resolve host names. This function only works if libcurl was built to use a resolver backend that allows it to change DNS servers. That currently means nothing else but c-ares.
  • Now supports nettle for crypto functions. libcurl has long been supporting both OpenSSL and gcrypt backends for some of the crypto functions libcurl supports. The gcrypt made perfect sense when libcurl was built to use GnuTLS built to use gcrypt, but since GnuTLS recently has changed to using nettle by default the newly added support to use nettle with remove the need for an extra crypto link being linked for some users.
  • CURLOPT_INTERFACE was modified to allow “magic prefixes” for the application to tell that it uses an interface and not a host name and vice versa. The previous way would always test for both, which could lead to accidental (and slow) name resolves when the interface name isn’t currently present etc.
  • Active FTP sessions with the multi interface are now done much more non-blocking than before. Previously the multi interface would block while waiting for the server to connect back but it no longer does. A new option called CURLOPT_ACCEPTTIMEOUT_MS was added to allow programs to set how long libcurl should wait for accepting the server getting back.
  • Coming in from the Debian packaging guys, the configure script how features a new option called –enable-versioned-symbols that does exactly what it is called: it enables versioned symbols in the output libcurl.

Syndicated 2012-01-24 21:36:22 from daniel.haxx.se

Join the SPDY library development

Back in October I posted about my intentions to work on getting curl support for SPDY to be based on libspdy. I also got in touch with Thomas, the primary author of libspdy and owner of libspdy.org.

Unfortunately, he was ill already then and he was ill when I communicated with him what I wanted to see happen and I also posted a patch etc to him. He mentioned to me (in a private email) a lot of work they’ve done on the code in a private branch and he invited me to get access to that code to speed up development and allow me to use their code.

I never got any response on my eager “yes, please let me in!” mail and I’ve since mailed him twice over the period of the latest months and as there have been no responses I’ve decided to slowly ramp up my activities on my side while hoping he will soon get back.

I’ve started today by setting up the spdy-library mailing list. I hope to attract fellow interested hackers to join me on this. The goal is quite simply to make a libspdy that works for us. It is to be C89 code that is portable with an API that “makes sense”. I don’t know yet if we will work on libspdy as it currently looks, if Thomas’ team will push their updated work soon or if going with my current spindly fork off github is the way. I hope to get help to decide this!

Join the effort by simply join the mailing list and participate in the discussions: http://cool.haxx.se/cgi-bin/mailman/listinfo/spdy-library.

Welcome!

Syndicated 2012-01-05 23:03:20 from daniel.haxx.se

Rosetta stone

How to figure out if a program uses curl? I get mails from users of it since the curl license is included somewhere and it includes my email address and very often that is the only address available…

To: Daniel Stenberg <daniel@haxx...>
Subject: Rosetta Stone Question
I am trying to install Rosetta Stone on my Mac but I am having
trouble. The ReadMe says to contact the author, and this email
was in the license info. Am I to understand that you are
the author?

I don’t know exactly what Rosetta Stone is, but I guess it is the language learning software at www.rosettastone.com

Syndicated 2012-01-05 19:20:42 from daniel.haxx.se

getaddrinfo with round robin DNS and happy eyeballs

This is not news. This is only facts that seem to still be unknown to many people so I just want to help out documenting this to help educate the world. I’ll dance around the subject first a bit by providing the full background info…

round robin basics

Round robin DNS has been the way since a long time back to get some rough and cheap load-balancing and spreading out visitors over multiple hosts when they try to use a single host/service with static content. By setting up an A entry in a DNS zone to resolve to multiple IP addresses, clients would get different results in a semi-random manner and thus hitting different servers at different times:

server     A   192.168.0.1
server     A   10.0.0.1
server     A   127.0.0.1

For example, if you’re a small open source project it makes a perfect way to feature a distributed service that appears with a single name but is hosted by multiple distributed independent servers across the Internet. It is also used by high profile web servers, like for example www.google.com and www.yahoo.com.

host name resolving

If you’re an old-school hacker, if you learned to do socket and TCP/IP programming from the original Stevens’ books and if you were brought up on BSD unix you learned that you resolve host names with gethostbyname() and friends. This is a POSIX and single unix specification that’s been around since basically forever. When calling gethostbyname() on a given round robin host name, the function returns an array of addresses. That list of addresses will be in a seemingly random order. If an application just iterates over the list and connects to them in the order as received, the round robin concept works perfectly well.

but gethostbyname wasn’t good enough

gethostbyname() is really IPv4-focused. The mere whisper of IPv6 makes it break down and cry. It had to be replaced by something better. Enter getaddrinfo() also POSIX (and defined in RFC 3943 and again updated in RFC 5014). This is the modern function that supports IPv6 and more. It is the shiny thing the world needed!

not a drop-in replacement

So the (good parts of the) world replaced all calls to gethostbyname() with calls to getaddrinfo() and everything now supported IPv6 and things were all dandy and fine? Not exactly. Because there were subtleties involved. Like in which order these functions return addresses. In 2003 the IETF guys had shipped RFC 3484 detailing Default Address Selection for Internet Protocol version 6, and using that as guideline most (all?) implementations were now changed to return the list of addresses in that order. It would then become a list of hosts in “preferred” order. Suddenly applications would iterate over both IPv4 and IPv6 addresses and do it in an order that would be clever from an IPv6 upgrade-path perspective.

no round robin with getaddrinfo

So, back to the good old way to do round robin DNS: multiple addresses (be it IPv4 or IPv6 or both). With the new ideas of how to return addresses this load balancing way no longer works. Now getaddrinfo() returns basically the same order in every invoke. I noticed this back in 2005 and posted a question on the glibc hackers mailinglist: http://www.cygwin.com/ml/libc-alpha/2005-11/msg00028.html As you can see, my question was delightfully ignored and nobody ever responded. The order seems to be dictated mostly by the above mentioned RFCs and the local /etc/gai.conf file, but neither is helpful if getting decent round robin is your aim. Others have noticed this flaw as well and some have fought compassionately arguing that this is a bad thing, while of course there’s an opposite side with people claiming it is the right behavior and that doing round robin DNS like this was a bad idea to start with anyway. The impact on a large amount of common utilities is simply that when they go IPv6-enabled, they also at the same time go round-robin-DNS disabled.

no decent fix

Since getaddrinfo() now has worked like this for almost a decade, we can forget about “fixing” it. Since gai.conf needs local edits to provide a different function response it is not an answer. But perhaps worse is, since getaddrinfo() is now made to return the addresses in a sort of order of preference it is hard to “glue on” a layer on top that simple shuffles the returned results. Such a shuffle would need to take IP versions and more into account. And it would become application-specific and thus would have to be applied to one program at a time. The popular browsers seem less affected by this getaddrinfo drawback. My guess is that because they’ve already worked on making asynchronous name resolves so that name resolving doesn’t lock up their processes, they have taken different approaches and thus have their own code for this. In curl’s case, it can be built with c-ares as a resolver backend even when supporting IPv6, and c-ares does not offer the sort feature of getaddrinfo and thus in these cases curl will work with round robin DNSes much more like it did when it used gethostbyname.

alternatives

The downside with all alternatives I’m aware of is that they aren’t just taking advantage of plain DNS. In order to duck for the problems I’ve mentioned, you can instead tweak your DNS server to respond differently to different users. That way you can either just randomly respond different addresses in a round robin fashion, or you can try to make it more clever by things such as PowerDNS’s geobackend feature. Of course we all know that A) geoip is crude and often wrong and B) your real-world geography does not match your network topology.

happy eyeballs

During this period, another connection related issue has surfaced. The fact that IPv6 connections are often handled as a second option in dual-stacked machines, and the fact is that IPv6 is mostly present in dual stacks these days. This sadly punishes early adopters of IPv6 (yes, they unfortunately IPv6 must still be considered early) since those services will then be slower than the older IPv4-only ones.

There seems to be a general consensus on what the way to overcome this problem is: the Happy Eyeballs approach. In short (and simplified) it recommends that we try both (or all) options at once, and the fastest to respond wins and gets to be used. This requires that we resolve A and AAAA names at once, and if we get responses to both, we connec() to both the IPv4 and IPv6 addresses and see which one is the fastest to connect.

This of course is not just a matter of replacing a function or two anymore. To implement this approach you need to do something completely new. Like for example just doing getaddrinfo() + looping over addresses and try connect() won’t at all work. You would basically either start two threads and do the IPv4-only route in one and do the IPv6 route in the other, or you would have to issue non-blocking resolver calls to do A and AAAA resolves in parallel in the same thread and when the first response arrives you fire off a non-blocking connect() …

My point being that introducing Happy Eyeballs in your good old socket app will require some rather major remodeling no matter what. Doing this will most likely also affect how your application handles with round robin DNS so now you have a chance to reconsider your choices and code!

Syndicated 2012-01-03 22:15:21 from daniel.haxx.se

Top-3 curl bugs in 2011

This is a continuation of my little top-3 things in curl during 2011 which started with the top-3 changes 2011.

The changelog on the curl site lists 150 bugs fixed in the seven released of the year. The most import fixes in my view were…

Bug-fix 1: handle HTTP redirects to //hostname/path

Following redirects is one of the fundamentals of HTTP user agents and one of the primary things people use curl and libcurl for is to mimic browser to do automatic stuff on the web. Therefore it was even more embarrassing to realize that libcurl didn’t properly support the relative redirect when the Location: header doesn’t include the protocol but the host name. It basically means that the protocol shall remain (in reality that means HTTP or HTTPS) but it should move over to the new host and path. All browsers support this since ages ago. Since November 15th 2011, libcurl does too!

Bug-fix 2: inappropriate GSSAPI delegation

We had one security vulnerability announced in 2011 and this was it. I won’t try to blame someone else for this mistake, but there are some corners of curl and libcurl I’m not personally very familiar with and I would say the GSS stuff is one of those. In fact, even the actual GSS and GSSAPI technologies are mercy areas as far as my knowledge reaches so I was not at all aware of this feature or that we even made us of it… Of course it also turns out that there’s a certain amount of existing applications that need it so we now have that ability in the library again if enabled by an option.

Bug-fix 3: multi interface, connect fail continue to next IP

One of those silly bugs nobody would expect us to have at this point. It turned out the code for the multi interface didn’t properly move on to try the next IP in case a connect() failed and the host name had resolved to a number of addresses to try. A long term goal of mine is to remodel the internals of libcurl to always use the multi interface code and I would just wrap that interfact with some glue logic to offer the easy interface. For that to work (and for lots of other reasons of course), the multi interface simply must work for all of these things.

Additionally, this is another of those things that are hard to test for in the test suite as it would involve trickery on IP or TCP level and that’s not easy to accomplish in a portable manner.

Syndicated 2011-12-31 20:51:14 from daniel.haxx.se

Top-3 curl changes in 2011

At the end of the year I thought it would be interesting to have a look back over the past twelve months and see what the biggest changes in the curl project were and what the most important bug-fixes were etc. I’ve turned into a little mini series of blog posts. The top-3s. First out, the top-3 changes.

Top-3 changes

In total I counted 29 notable changes brought during 2011. The most significant ones in my view are:

Change 1 – fancier protocol support in proxy strings

This may sound trivial, and the code certainly is, but with this change suddenly a lot of applications that use libcurl got better proxy support without having do anything at all. Previously the application would have to set what protocol type the proxy it would use is, even though libcurl has supported having the proxy specified as an environment variable for ages.

Having only the proxy name was useful but limiting. With this new change you can specify proxy type or rather proxy protocol by prefixing the proxy name like “socks4://magic-proxy.example.com” if you want a SOCKS4 proxy. libcurl now supports socks4, socks4a, socks5 or socks5h used as prefixes as well as http.

Change 2 – allow sending “empty” HTTP headers

Another minor change in code but possibly larger impact in usefulness for applications. We introduced a way for applications to change internal headers and to add new ones ages ago. We (or rather I) then made the choice that if you’d provide a header with only the name and a colon, that is with no contents on the right side, it would delete the internal header. That was not a clever move as later on people have wanted to add “blank” or “empty” headers that look exactly like that, but libcurl has then refused to.

There have been some more or less hackish work-arounds to trick libcurl into allowing an empty header, but now finally we introduce a nice and clean way for applications to pass in these kinds of empty headers:

Pass in an empty header that instead of a colon has a semicolon! This is an otherwise illegal header that wouldn’t make sense, but libcurl will use that as a trigger that an empty header should be used and it will then replace the semicolon with a colon and things will be fine.

Change 3 – Added support for cyassl and axTLS

Proving that libcurl moves forward and into more and more markets, the number of supported SSL libraries grew to 7 this year. The all new cyassl backend that replaced the previously done yassl backend that was using cyassl’s former OpenSSL emulation layer. Now we’re using the native and pure API and things much cleaner. The possibly smallest available TLS library axTLS also got support.

Not all backends and not all SSL libraries are the same or support the same set of features, but then libcurl is used in many different scenarios and use cases and this way we offer more options to more users to craft libcurl for their particular needs. Our internal SSL backend API has managed quite well and proves to have been a worthy change. Adding support for yet another SSL library within libcurl is actually not a lot of work.

Change 4 – TSL-SRP

You may think number 4 of a top-3 list is weird, but I couldn’t cut it off here! =) TLS-SRP has been waiting in the shadows for so long and all of a sudden two of the major SSL libraries have support for it in released versions and libcurl got support for using these features in both libraries during 2011.

cURL

Syndicated 2011-12-30 20:27:34 from daniel.haxx.se

What I got from Google

Recall my present from Google. Here’s what I spent the gift codes on (they were only valid in google-store.com):

A red fleece for me, and then “softshell” jackets for my whole family.

GO11018B-2GO22003GO21003GO20003GO20003

I got the package today. The stuff matched my expectations in appearance and feel and my kids just loved these “extra Christmas gifts”.

Syndicated 2011-12-27 18:31:33 from daniel.haxx.se

716 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!