STL strings vs C strings for parsing
I'm working on a project where I need to build custom high performance HTTP server. One piece of this server is a parser for URLs in incoming requests. It is very simple and on the first glance it shouldn't be that slow compared with other parts of the server. Yet it was taking quite a lot of CPU according to the profiler. The parser is using STL and basically does several string::find() calls to find parts of URL. So I thought maybe string::find() is too slow and decided to benchmark it against strchr(). This is my benchmark code:
#include <string.h>
#include <string>
#include <time.h>
#include <iostream>
using std::string;
using std::cout;
int main() {
const char* str1 = " a ";
const string& str2 = str1;
const unsigned long iterations = 500000000l;
{
clock_t start = clock();
for (unsigned long i = 0; i < iterations; ++i) {
char* pos = strchr(str1, 'a');
}
clock_t end = clock();
double totalTime = ((double) (end - start)) / CLOCKS_PER_SEC;
double iterTime = totalTime / iterations;
double rate = 1 / iterTime;
cout << "Total time: " << totalTime << " sec\n";
cout << "Iterations: " << iterations << " it\n";
cout << "Time per iteration: " << iterTime * 1000 << " msec\n";
cout << "Rate: " << rate << " it/sec\n";
}
{
clock_t start = clock();
for (unsigned long i = 0; i < iterations; ++i) {
string::size_type pos = str2.find('a');
}
clock_t end = clock();
double totalTime = ((double) (end - start)) / CLOCKS_PER_SEC;
double iterTime = totalTime / iterations;
double rate = 1 / iterTime;
cout << "Total time: " << totalTime << " sec\n";
cout << "Iterations: " << iterations << " it\n";
cout << "Time per iteration: " << iterTime * 1000 << " msec\n";
cout << "Rate: " << rate << " it/sec\n";
}
}
ilya@denmark:~$ g++ -O3 test.cc && ./a.out
Total time: 0 sec
Iterations: 500000000 it
Time per iteration: 0 msec
Rate: inf it/sec
Total time: 15.5 sec
Iterations: 500000000 it
Time per iteration: 3.1e-05 msec
Rate: 3.22581e+07 it/sec
ilya@denmark:~$ g++ -O2 test.cc && ./a.out
Total time: 0 sec
Iterations: 500000000 it
Time per iteration: 0 msec
Rate: inf it/sec
Total time: 15.76 sec
Iterations: 500000000 it
Time per iteration: 3.152e-05 msec
Rate: 3.17259e+07 it/sec
ilya@denmark:~$ g++ -O1 test.cc && ./a.out
Total time: 0 sec
Iterations: 500000000 it
Time per iteration: 0 msec
Rate: inf it/sec
Total time: 19.23 sec
Iterations: 500000000 it
Time per iteration: 3.846e-05 msec
Rate: 2.6001e+07 it/sec
ilya@denmark:~$ g++ -O0 test.cc && ./a.out
Total time: 18.64 sec
Iterations: 500000000 it
Time per iteration: 3.728e-05 msec
Rate: 2.6824e+07 it/sec
Total time: 16.89 sec
Iterations: 500000000 it
Time per iteration: 3.378e-05 msec
Rate: 2.96033e+07 it/sec
Beyound XSS and SQL injections
What is common about HTML, XML and CSV files, SQL and LDAP queries, filenames and shell commands? All these things are based on text which is often generated by programs. And one commonly observed flaw in such programs is encoding rules are not being followed. These days many developers are aware about SQL injection and XSS problems as many books, online tutorials, blogs, coding standards, etc speak about them. Yet I'm not sure there is enough education so that developers use correct methods to protect their code from these problems. And besides this there is a lack of awareness that it is not just SQL and HTML. Definitely developers should think more broadly: if you generate programmatically any kind of text you must think about proper encoding of all data used in the generated text.The most common mistake committed by developers (and many security experts, I might add) is to treat XSS as an input validation problem. Therefore, I frequently come across situations where developers fix XSS problems by attempting to filter out meta-characters (<, >, /, “, ‘, etc). At times, if an exhaustive list of meta-characters is used, it does solve the problem, but it makes the application less friendly to the end user – a large set of characters are deemed forbidden. The correct approach to solving XSS problems is to ensure that every user supplied parameter is HTML Output EncodedA good example of wrong approach is PHP's invention called magic quotes. I have mixed feelings about this thing. On one hand it was probably a good thing because so many web based software is developed by dilettantes so overall we are living in a slightly better world as magic quotes do somewhat limit damage from bad code. On the other hand it teaches bad habits while not fixing all problems in bad code. Also it causes everybody else to suffer. Good news is that they are getting rid of this abomination in PHP6.
print join ",", @columnsWhat if one of columns contains say "," (comma)?
sprintf(cmd, "cp %s %s", orig_filename, dest_filename);Guess what happens if any of these filenames were not escaped for characters which are special for shell?
system(cmd);
Perl as replacement for shell scripting (Part I)
By shell scripting I mean bash as it is what most (all?) Linux distributions use. Bash can be used as a quite capable programming language. Bash allows programmer to build rather complex scripts by using other programs as building blocks. System comes with a number of such building blocks: find, grep, sed, awk and many others and unsurprisingly there is a lot you can do with them. But it is often a challenge to write robust shell scripts which work or at least fail gracefully for any kind of input. The main reason is that historically shell scripts could use one only data type - string*. Those building blocks, external programs you use in shell scripts have very restricted interface: there are program arguments which are strings, stream of strings as input, stream of strings as output and exit code.rm `ls`This would delete all files in the current directory .. unless they have whitespace characters in their names. There are many similar cases where an unwary programmer can make a mistake in his(her) shell script. Passing data from one process to another often requires a lot of care and the simplest code is often wrong. Another problem is that you are very limited in how you can handle errors in shell scripts - you only have process's exit code to tell you if it finished successfully. And usually it is just a boolean value saying you if there was any error or not. Quote from the linked document:
However, many scripts use an exit 1 as a general bailout upon error. Since exit code 1 signifies so many possible errors, this probably would not be helpful in debugging.If say mkdir fails your script cannot easily tell if it is because another directory with the same name already exists or you just don't have permissions for this operation.
23 Aug 2007 (updated 24 Aug 2007 at 10:21 UTC) »
libxml++ vs xerces C++

Xerces-C++ makes it easy to give your application the ability to read and write XML data.Sounds good, just what I need. So I dig documentation and find it to be completely unhelpful as it is just Doxygen autogenerated undocumentation. Fine, I can read code, let's check sample code then. I open sample code and I find that the shortest example how to parse XML into DOM tree and how to access data in the tree (DOMCount) consists of two files which are more then 600 lines long in total. Huh? I don't want to read 15 pages of code just to learn how to do two simple actions: parse XML into DOM and get data from DOM. Other examples are even more bad. Several files, several classes just to read and print freaking XML (DOMPrint). You've got to be kidding me. It cannot be that hard.
Syndicated 2007-08-23 20:54:00 (Updated 2007-08-24 10:00:03) from Ilya Martynov
23 Aug 2007 (updated 23 Aug 2007 at 21:10 UTC) »
4 silly mistakes in use of MySQL indexes

CREATE TABLE table1 (Index on col1 is redundant as any search on col1 can use primary index. This just wastes disk space and might make some queries which change this table a bit slower.
col1 INT,
col2 INT,
PRIMARY (col1, col2),
KEY (col1)
);
MySQL cannot use an index if the columns do not form a leftmost prefix of the index.Example:
CREATE TABLE table2 (MySQL wont use any indexes for query like
id INT PRIMARY,
col1 INT,
col2 INT,
col3 INT,
KEY (col1, col2)
);
SELECT * FROM table2 WHERE col2=123EXPLAIN SELECT shows this instantly. If you want to run this query faster either change order of columns in the index or add another one.
CREATE TABLE table3 (Query like
id INT PRIMARY,
col1 INT,
col2 INT,
col3 INT,
KEY (col1)
);
SELECT * FROM table2 WHERE col1=123 AND col2=456would use the index on col1 to reduce number of rows to check but MySQL can do much better if you add multicolumn index which covers both col1 and col2. The effect of adding such index is very easy to see with EXPLAIN SELECT.
Syndicated 2007-08-16 12:22:00 (Updated 2007-08-23 21:02:00) from Ilya Martynov
volatile and threading

bool gEvent = false;Let's assume that this is a single threaded program and the external condition we are waiting for is a Unix signal. The signal handler is very simple - it simply sets gEvent to true:
void waitLoop() {
while (!gEvent) {
sleep(1);
}
...
}
void wakeUp() {
gEvent = true;
}The problem with the code above is that compiler would optimize out check of the condition inside waitLoop() incorrectly assuming from local analysis of the code that gEvent never changes. The fix is to declare gEvent with volatile modifier which basically tells compiler that the variable can be changed at any time and that is unsafe to perform any optimization based on the analysis of local code:volatile bool gEvent = false;Let's take another example. The code is same but this time it is a mutli-threaded program where one thread waits for another. So waitLoop() runs inside one thread and wakeUp() eventually called from another. Is the code still correct? Probably yes if we keep volatile flag and if operations which read or write gEvent variable can be considered as atomic. The later assumptions seems to be correct for most (all?) platforms.
struct EventInfo {
EventInfo(bool happened = false, const string& source = "")
: fHappened(happened), fSource(source)
{}
bool fHappened;
string fSource;
}
volatile EventInfo gEventInfo;
void waitLoop() {
while (!fEventInfo.fHappened) {
sleep(1);
}
const string& eventSource = fEventInfo.fSource;
...
}
void wakeUp() {
gEventInfo = EventInfo(true, "wakeUp");
}This code is still ok for single-threaded program where wakeUp() is a signal handler but is unsafe for multi-threaded program where wakeUp() runs in a separate thread as operations on gEventInfo cannot be treated as atomic anymore.boost::mutex gMutex;Comparing this code with earlier examples it looks like we still need to declare gEventInfo variable as volatile but it turns out we don't really need to. Quote from Thread Cannot be Implemented as a Library [PDF]:
void waitLoop() {
string eventSource;
for (bool eventHappened = false; !eventHappened; ) {
{
boost::mutex::scoped_lock lock(gMutex);
eventHappened = fEventInfo.fHappened;
eventSource = fEventInfo.fSource;
}
sleep(1);
}
...
}
void wakeUp() {
boost::mutex::scoped_lock lock(gMutex);
gEventInfo = EventInfo(true, "wakeUp");
}
In practice, C and C++ implementations that supportSo at least if you using POSIX threads (boost::threads under Linux uses them) your code is probably safe without use of volatile as long as you use locks around global variables shared between several threads. Good question whenever this example code is portable to other platforms; after all boost::threads supports threading libraries other then POSIX which may have other rules for mutexes and locks. I haven't researched this yet as for now I don't really care about other platforms.
Pthreads generally proceed as follows:
- Functions such as pthread_mutex_lock() that are guaranteed by the standard to “synchronize memory” include hardware instructions (“memory barriers”) that prevent hardware reordering of memory operations around the call.
- To prevent the compiler from moving memory operations around calls to functions such as pthread_mutex_lock(), they are essentially treated as calls to opaque functions, about which the compiler has no information. The compiler effectively assumes that pthread_mutex_lock() may read or write any global variable. Thus a memory reference cannot simply be moved across the call. This approach also ensures that transitive calls, e.g. a call to a function f() which then calls pthread_mutex_lock(), are handled in the same way more or less appropriately, i.e. memory operations are not moved across the call to f() either, whether or not the entire user program is being analyzed at once.
Syndicated 2007-07-31 14:58:00 (Updated 2007-08-12 00:44:39) from Ilya Martynov
boost::thread and boost::mutex tutorial
For most Boost's libraries its documentation requires you to read everything from the start to the end before you can write any code. Compare that with most of CPAN modules where usually you can start using CPAN module after quickly scanning synopsis and maybe description parts of it's POD documentation. POD documentation as a rule has good examples right on top of the page. Boost's documentation usually doesn't.Syndicated 2007-07-25 16:35:00 (Updated 2007-08-02 20:49:34) from Ilya Martynov
Starting a new blog
I used to have a blog at use.perl.org but I just was too lazy to write often. One problem it seems you need a certain discipline to keep doing that. Also I just didn't like this site blogging engine. It just looked too simple and offered little control.Syndicated 2007-07-18 11:46:00 (Updated 2007-07-18 12:19:25) from Ilya Martynov
FOAF updates: Trust rankings are now exported, making the data available to other users and websites. An external FOAF URI has been added, allowing users to link to an additional FOAF file.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!