IlyaM is currently certified at Journeyer level.

Name: Ilya Martynov
Member since: 2003-02-27 11:31:21
Last Login: 2007-10-31 13:09:36

FOAF RDF Share This

Homepage: http://martynov.org/

Projects

Recent blog entries by IlyaM

Syndication: RSS 2.0

6 Dec 2007 »

STL strings vs C strings for parsing

I'm working on a project where I need to build custom high performance HTTP server. One piece of this server is a parser for URLs in incoming requests. It is very simple and on the first glance it shouldn't be that slow compared with other parts of the server. Yet it was taking quite a lot of CPU according to the profiler. The parser is using STL and basically does several string::find() calls to find parts of URL. So I thought maybe string::find() is too slow and decided to benchmark it against strchr(). This is my benchmark code:

#include <string.h>
#include <string>
#include <time.h>
#include <iostream>

using std::string;
using std::cout;

int main() {
const char* str1 = " a ";
const string& str2 = str1;

const unsigned long iterations = 500000000l;

{
clock_t start = clock();

for (unsigned long i = 0; i < iterations; ++i) {
char* pos = strchr(str1, 'a');
}

clock_t end = clock();
double totalTime = ((double) (end - start)) / CLOCKS_PER_SEC;
double iterTime = totalTime / iterations;
double rate = 1 / iterTime;

cout << "Total time: " << totalTime << " sec\n";
cout << "Iterations: " << iterations << " it\n";
cout << "Time per iteration: " << iterTime * 1000 << " msec\n";
cout << "Rate: " << rate << " it/sec\n";
}

{
clock_t start = clock();

for (unsigned long i = 0; i < iterations; ++i) {
string::size_type pos = str2.find('a');
}

clock_t end = clock();
double totalTime = ((double) (end - start)) / CLOCKS_PER_SEC;
double iterTime = totalTime / iterations;
double rate = 1 / iterTime;

cout << "Total time: " << totalTime << " sec\n";
cout << "Iterations: " << iterations << " it\n";
cout << "Time per iteration: " << iterTime * 1000 << " msec\n";
cout << "Rate: " << rate << " it/sec\n";
}
}

Turns out strchr is much faster as long as the benchmark code is compiled with optimizations on:

ilya@denmark:~$ g++ -O3 test.cc && ./a.out
Total time: 0 sec
Iterations: 500000000 it
Time per iteration: 0 msec
Rate: inf it/sec
Total time: 15.5 sec
Iterations: 500000000 it
Time per iteration: 3.1e-05 msec
Rate: 3.22581e+07 it/sec

ilya@denmark:~$ g++ -O2 test.cc && ./a.out
Total time: 0 sec
Iterations: 500000000 it
Time per iteration: 0 msec
Rate: inf it/sec
Total time: 15.76 sec
Iterations: 500000000 it
Time per iteration: 3.152e-05 msec
Rate: 3.17259e+07 it/sec

ilya@denmark:~$ g++ -O1 test.cc && ./a.out
Total time: 0 sec
Iterations: 500000000 it
Time per iteration: 0 msec
Rate: inf it/sec
Total time: 19.23 sec
Iterations: 500000000 it
Time per iteration: 3.846e-05 msec
Rate: 2.6001e+07 it/sec

ilya@denmark:~$ g++ -O0 test.cc && ./a.out
Total time: 18.64 sec
Iterations: 500000000 it
Time per iteration: 3.728e-05 msec
Rate: 2.6824e+07 it/sec
Total time: 16.89 sec
Iterations: 500000000 it
Time per iteration: 3.378e-05 msec
Rate: 2.96033e+07 it/sec

I checked the same code with callgrind and from call graph it looks like strchr() call was inlined while string::find() wasn't. It could be the reason for the difference in the performance. Maybe compiler is even smarter and optimized whole cycle with strchr() out. I'm not sure that the benchmark is completly fair. Anyway one thing is certain: I'll should try to rewrite my URL parser using strchr() and see if the real code is faster.

Syndicated 2007-12-06 13:06:00 from Ilya Martynov's blog

21 Sep 2007 »

Beyound XSS and SQL injections

What is common about HTML, XML and CSV files, SQL and LDAP queries, filenames and shell commands? All these things are based on text which is often generated by programs. And one commonly observed flaw in such programs is encoding rules are not being followed. These days many developers are aware about SQL injection and XSS problems as many books, online tutorials, blogs, coding standards, etc speak about them. Yet I'm not sure there is enough education so that developers use correct methods to protect their code from these problems. And besides this there is a lack of awareness that it is not just SQL and HTML. Definitely developers should think more broadly: if you generate programmatically any kind of text you must think about proper encoding of all data used in the generated text.

Talking about correct methods to secure code from text encoding related problems one my pet peeve is when people try to strip input data when they really should be thinking about protecting output. Nitesh Dhanjani covers this really well in his blog "Repeat After Me: Lack of Output Encoding Causes XSS Vulnerabilities". Quote:
The most common mistake committed by developers (and many security experts, I might add) is to treat XSS as an input validation problem. Therefore, I frequently come across situations where developers fix XSS problems by attempting to filter out meta-characters (<, >, /, “, ‘, etc). At times, if an exhaustive list of meta-characters is used, it does solve the problem, but it makes the application less friendly to the end user – a large set of characters are deemed forbidden. The correct approach to solving XSS problems is to ensure that every user supplied parameter is HTML Output Encoded
A good example of wrong approach is PHP's invention called magic quotes. I have mixed feelings about this thing. On one hand it was probably a good thing because so many web based software is developed by dilettantes so overall we are living in a slightly better world as magic quotes do somewhat limit damage from bad code. On the other hand it teaches bad habits while not fixing all problems in bad code. Also it causes everybody else to suffer. Good news is that they are getting rid of this abomination in PHP6.

Now let's take a look for some examples how not to generate text which I saw in real life. I'll skip HTML and SQL as this is well covered elsewhere and I'll take a look on other things I mentioned in the beginning of this article.

XML files: bad code which generates XML often shares similar problems as bad code which generates HTML - after all these two are closely related. But as XML is a more generic tool it is used in many domains other then web development where developers are not "blessed" with knowledge of XSS like problems. Moreover I noticed even web developers for some reason often consider XML to be something very different then HTML and suddenly forget they have to escape data. I'm especially amused when that many people are not aware that you cannot put arbitrary binary data in XML. You have to either encode it into text (base64 encoding is quite popular for this) or put it outside of the XML document.

CSV files: this format is still quite popular for exchange of tabular data between programs. Guess what? I've seen so many naive CSV producers and parsers that ignore reserved characters and which break later when these programs get real data. No, to write CSV file you cannot just do
print join ",", @columns
What if one of columns contains say "," (comma)?

LDAP queries: being text based query language it is a target of very similar problems as SQL. But while many developers are aware of SQL injection problem, not many are aware that you have exactly the same problem with LDAP queries too. Also it doesn't help that while nearly all SQL libraries provide tools to escape data in SQL queries it doesn't always seem to be the case for LDAP libraries. For example: PHP's LDAP extension - there is no API to escape data at all.

Using shell to execute commands: if you are running a command using system() in C, Perl, PHP or any other language and you are constructing the command from your data you again should treat this as a problem of proper encoding. The example below is from mozilla's source code
sprintf(cmd, "cp %s %s", orig_filename, dest_filename);
system(cmd);
Guess what happens if any of these filenames were not escaped for characters which are special for shell?

While I'm at this I'd mention that it is probably a good idea to avoid APIs which use shell to execute commands at all. Simply because shell programming is too hard to get right.

What would help a lot if tools would support developers better when writing correct code which deals with text based APIs. Sometimes it is just lack of documentation on encoding rules. For example a month ago I was learning Facebook APIs. One of the provided APIs is API to execute so called FQL queries. This is an SQL like query language and naturally I'd expect FQL injections to be covered in documentation. They don't, it is not even documented how to escape string data in FQL queries! I played with different queries in FQL console and it seems like standard SQL-like method (i.e. using "\" (backslash)) does work as an escape character in strings but why do I have to find this on my own? It is also shame when libraries built around text APIs do not provides means to properly encode data for used text formats. I mentioned one such example above: PHP's LDAP extension provides no functions to escape data for LDAP queries. How hard is it to add this? If you are creating text based APIs or libraries around such APIs it is your duty to help developers who will be using them. So do document encoding rules and do provide tools to automatically encode data!

Syndicated 2007-09-21 22:55:00 from Ilya Martynov's blog

6 Sep 2007 »

Perl as replacement for shell scripting (Part I)

By shell scripting I mean bash as it is what most (all?) Linux distributions use. Bash can be used as a quite capable programming language. Bash allows programmer to build rather complex scripts by using other programs as building blocks. System comes with a number of such building blocks: find, grep, sed, awk and many others and unsurprisingly there is a lot you can do with them. But it is often a challenge to write robust shell scripts which work or at least fail gracefully for any kind of input. The main reason is that historically shell scripts could use one only data type - string*. Those building blocks, external programs you use in shell scripts have very restricted interface: there are program arguments which are strings, stream of strings as input, stream of strings as output and exit code.

Even a simple concept like a list have to be emulated. For example a list of file names often is passed as a string which contains these file names separated by whitespace. But what if one of these file names contains whitespace? You get a problem. To fix it you need to escape whitespace characters in the filename. And it is rather easy to miss places where you have to do escaping. A bit convolved example:
rm `ls`
This would delete all files in the current directory .. unless they have whitespace characters in their names. There are many similar cases where an unwary programmer can make a mistake in his(her) shell script. Passing data from one process to another often requires a lot of care and the simplest code is often wrong. Another problem is that you are very limited in how you can handle errors in shell scripts - you only have process's exit code to tell you if it finished successfully. And usually it is just a boolean value saying you if there was any error or not. Quote from the linked document:
However, many scripts use an exit 1 as a general bailout upon error. Since exit code 1 signifies so many possible errors, this probably would not be helpful in debugging.
If say mkdir fails your script cannot easily tell if it is because another directory with the same name already exists or you just don't have permissions for this operation.

So any solutions to this problem? As for myself any moment I see my shell script getting longer then three lines of code I rewrite whole thing into Perl. In Perl you don't need to use external programs as much as often as you need in bash. Therefore you are not limited to their restrictive interfaces of them (remember, only strings and exit codes for input and output); native Perl APIs can be much more expressive when they need to.

There is a price though. Perl code is not always as compact as similar shell code for some scripting tasks. This is because the shell scripting is optimized very well to handle interaction of processes and Perl is not as much. It is worth to mention that many things which come for granted in the shell scripting often require you using Perl modules including non standard CPAN Perl modules. It is not problem as such except that not all Perl programmers know where to look for things if they are not covered by perlfunc. This mainly a concern for newbie Perl programmers but it is still a real problem. Also using CPAN modules is not always an option.

Of course in your Perl program you can fail back to using same external programs you would use in a shell script but then you lose advantages of Perl over shell scripting. So .. don't do this if possible. As interesting example of this principle: Perl before version 5.6.0 would fail back to shell to execute operation glob. That was causing various problems for Perl developers: for example I saw Perl programs using glob to fail when run on one tightly secured web hosting server because binary Perl was calling was simply removed from the server for security reasons. In later versions of Perl the implementation of glob was changed: it is implemented purely in Perl now and doesn't use external programs.

To be continued in Part II: mapping between common shell operations and corresponding Perl modules.


[*] New versions of bash support arrays. I'd argue that usefulness of arrays in bash is limited as programs you call from shell scripts cannot use them to pass output data. You are still limited to string streams and exit codes. Not to mention that this is not very portable across different systems.

Syndicated 2007-09-06 14:33:00 from Ilya Martynov's blog

23 Aug 2007 (updated 24 Aug 2007 at 10:21 UTC) »

libxml++ vs xerces C++


When I was reading "API: Design Matters" I recalled one example of good API vs bad API. Actually my example is more about good API documentation vs bad API documentation but I suspect there is a correlation between these two things. It is definitely hard to write good documentation if your API sucks.

So my story is that I had a task to read XML data in C++ application. XML data was small and performance of this part of the application was not critical so it looked like the simplest way to read this data was to load DOM tree for XML document and just use DOM API and maybe couple simple XPath queries. It was the first time I needed to do this in C++; I had no previous experience with any XML C++ libraries. So, I do google search (or maybe it was apt-cache search - I don't remember) and the first thing I find is xerces C++. Quote from project's website:
Xerces-C++ makes it easy to give your application the ability to read and write XML data.
Sounds good, just what I need. So I dig documentation and find it to be completely unhelpful as it is just Doxygen autogenerated undocumentation. Fine, I can read code, let's check sample code then. I open sample code and I find that the shortest example how to parse XML into DOM tree and how to access data in the tree (DOMCount) consists of two files which are more then 600 lines long in total. Huh? I don't want to read 15 pages of code just to learn how to do two simple actions: parse XML into DOM and get data from DOM. Other examples are even more bad. Several files, several classes just to read and print freaking XML (DOMPrint). You've got to be kidding me. It cannot be that hard.

I don't really want to waste hours to learn API I'm unlikely to use ever again. After all I don't write much C++ code and I definitely don't write much C++ code that needs XML. So time to search further. Next hit is libxml++. It is C++ wrapper over popular C XML library libxml. This time there is actually some documentation that does try to explain how to use the library. And this documentation contains an example which while being just about 150 lines manages to demonstrate most of library's DOM API.

End result: I finish my code to read my XML data in next 30 minutes using libxml++. It is simple, short and it works.

So what's wrong with xerces C++? There is no introduction level documentation at all. Examples look too complex for the problem they are supposed to show solution for. And the reason for this is that API is just bad: it requires writing unnecessary complex client code.

Update: boris corrected me about lack of introduction level documentation in a comment to this blog post. Turned out I missed it. As a weak excuse I'll blame bad navigation on the project's site :)

Syndicated 2007-08-23 20:54:00 (Updated 2007-08-24 10:00:03) from Ilya Martynov

23 Aug 2007 (updated 23 Aug 2007 at 21:10 UTC) »

4 silly mistakes in use of MySQL indexes


1. Not learning how to use EXPLAIN SELECT

I'm really surprised how many developers who use MySQL all the time and who do not know or understand how to use EXPLAIN SELECT. I've seen several times developers proposing serious architectural changes to their code to minimize, partition or cache data in their database when the actual solution was to spend 30 minutes thinking over result of EXPLAIN SELECT and adding or changing couple indexes.

2. Wasting space with redundant indexes

If you have multicolumn index it means you don't need a separate index which is subset of the first index. It is easier to explain with an example:
CREATE TABLE table1 (
col1 INT,
col2 INT,
PRIMARY (col1, col2),
KEY (col1)
);
Index on col1 is redundant as any search on col1 can use primary index. This just wastes disk space and might make some queries which change this table a bit slower.

There is one but! See below..

3. Incorrect order of columns in index

Order of columns in multicolumn index is important. From MySQL documentation:
MySQL cannot use an index if the columns do not form a leftmost prefix of the index.
Example:
CREATE TABLE table2 (
id INT PRIMARY,
col1 INT,
col2 INT,
col3 INT,
KEY (col1, col2)
);
MySQL wont use any indexes for query like
SELECT * FROM table2 WHERE col2=123
EXPLAIN SELECT shows this instantly. If you want to run this query faster either change order of columns in the index or add another one.

4. Not using multicolumn indexes when you need to

MySQL can use only one index per table in a time so if you query by several columns in the table you may need to add multicolumn index. Example:
CREATE TABLE table3 (
id INT PRIMARY,
col1 INT,
col2 INT,
col3 INT,
KEY (col1)
);
Query like
SELECT * FROM table2 WHERE col1=123 AND col2=456
would use the index on col1 to reduce number of rows to check but MySQL can do much better if you add multicolumn index which covers both col1 and col2. The effect of adding such index is very easy to see with EXPLAIN SELECT.

Syndicated 2007-08-16 12:22:00 (Updated 2007-08-23 21:02:00) from Ilya Martynov

3 older entries...

 

IlyaM certified others as follows:

  • IlyaM certified cwinters as Journeyer
  • IlyaM certified ask as Journeyer
  • IlyaM certified chromatic as Journeyer
  • IlyaM certified jmcnamara as Journeyer
  • IlyaM certified IlyaM as Journeyer
  • IlyaM certified pudge as Journeyer
  • IlyaM certified merlyn as Master
  • IlyaM certified japhy as Journeyer
  • IlyaM certified Simon as Master
  • IlyaM certified Spoon as Journeyer
  • IlyaM certified autarch as Journeyer
  • IlyaM certified petdance as Journeyer
  • IlyaM certified markjugg as Apprentice
  • IlyaM certified chaoticset as Apprentice
  • IlyaM certified adrianh as Journeyer

Others have certified IlyaM as follows:

  • dtucker certified IlyaM as Apprentice
  • IlyaM certified IlyaM as Journeyer
  • Spoon certified IlyaM as Journeyer
  • markjugg certified IlyaM as Journeyer
  • adrianh certified IlyaM as Master
  • chaoticset certified IlyaM as Journeyer
  • autarch certified IlyaM as Journeyer

[ Certification disabled because you're not logged in. ]

New Advogato Features

FOAF updates: Trust rankings are now exported, making the data available to other users and websites. An external FOAF URI has been added, allowing users to link to an additional FOAF file.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page