...a decent MP3 player with stream support. Finally.
Hello, MOC[1]! XMMS, go eff yourself.
Of course I'm sorry for not using hmp3[2], but it doesn't do streams, appearently. At least not that I can tell from the website. Also missing Ogg support is a bit of a let-down.
Soo, everything went well different than planned. What was supposed to be a holidy clean-up rewrite of a fun weekend project has turned into a half-year side project running next to university.
To recap, I initially set out to implement a Markdown[1] parser in Haskell so I could post formatted text to my Advogato blog. An email-to-Advogato gateway was quickly whipped up[2]. The first prototype version of a Markdown parser was also finished within reasonable time[3]. Unfortunately, the code was a mess, so I set out for the rewrite[4]. Much progress was made but it kept screwing up in certain minor but annoying cases and the code still looked convoluted. Basically, Parsec just didn't want to bend in the right direction...
So I replaced Parsec. The module is called Text.ParserCombinators.Nonlinear[5] because it allows one to slurp in parts of the document in one part of the parser and reparse them again later. This allowed me to split up the document according to its block-level structure and re-assemble, for instance, the text pieces of quoted or indented lines (without the leading quote marks/indetation) and run the corresponding parser over the thus extracted subdocument. Such embedded parses can also work with a completely different token type than the enclosing parser, a capability which also came in handy.
I recently came across "Frisby"[6], a Haskell implementation of PEG grammars, which I had never heard of before. The description sounds cool. I wonder if my Markdown variant could be represented by one? My parser library is neither optimized for space nor speed, and PEGs sound compelling in that regard...
Anyway, the implementation based on my nonlinear parsers worked out really nice wrt. the code structure and doesn't show any of the kinks that plagued the Parsec version. Since I've deviated somewhat from Markdown syntax in the places I didn't like, I've dubbed the package k-tex. I've still got to update the documentation but if anyone is interested in looking at or even improving the code, you can find it at http://www.khjk.org/~sm/code/k-tex/.
Best regards, Sven Moritz
PS. Yep, the Advogato gateway[7] already uses k-tex, and if this post appears on my blog[8], it's working. ;)
So, the first working version of my "structural plain-text" parser is just working and I'm already dissatisfied with it. Well, the reasons are simple:
It doesn't work correctly in all cases.
The code is a mess, even though it was already the third or so iteration internally ("plan to throw...", you know).
PS. Oh yeah, after all, I've included "strong emphasis" in this version, because there are uses for it, specifically in technical documentation (which I intend to write). It carries a Fat Warning (TM) in the docs, so I'm happy and people can do whatever they want. ;-)
A few weeks ago I posted my initial thoughts about how to represent a structured text document in Haskell. As often with initial thoughts, they were flawed. ;-) In particular, I wrote:
I've tried to design the above types in such a way as to minimize the possibility of forming non-sensical or ambiguous documents. That's why there is such deep nesting of different types instead of just one big algebraic data type with constructors for concatenation, paragraph and section breaks, etc.
Without going into too much detail, I stumbled over the problem in the representation of block quotes: The corresponding constructor expected a full Doc and I couldn't break sufficiently small fragments out of those for quoting. For example, had I wanted to quote one item out of a numbered list, it would have gotten the number "1" again in the quote, because it would have been the first item within that document (fragment)...
So, I thought, should I make one big data type for the whole document after all? But still, I didn't want to include too much possibilities for such senseless combinations as bulleted lists and blockquotes within section headings. After some pondering I've decided on using a two-level structure, with one data type representing documents (or document fragments) at the block level (paragraphs, sections, etc.) and a second one for in-line structures only. In particular, a document is taken to be a list of "block-level tokens" ('Btok's):
type Doc = [Btok]Analogously, (logical) lines are represented by line-level tokens:
type Line = [Ltok]Apart from plain strings, these consist of things like emphasis, code spans, etc.:
data Ltok = St String -- character string
| Em Line -- emphasis
| Co String -- inline code
...
Note that under their intended interpretation, these constructors form
homomorphisms into the monoid of Ltok-lists (wrapping their
result in a single-element list): e.g. [St a] ++ [St b] is expected
to be equal to [St (a++b)], [Em a] ++ [Em b] == [Em (a++b)], etc.
This enables one to split a line in two at an arbitrary point without
"losing any information". :-)
I noticed that if something like the above could hold for the block-level Doc type as well, my block quote problem would be solved pretty elegantly. And indeed, I found the following representation:
data Btok = TEXT [Line] -- text block (logical lines)
| PARA Line -- new paragraph (w. opt. title)
| SECT Int Line -- new section (of given level)
| QUOT Doc -- blockquote
| BULL [Doc] -- bulleted list
...
Notice that the PARA and SECT constructors do not wrap the
corresponding paragraph or section. Their meaning is analogous to
that of a new-line character in plain-text: they start a new structure
which spans everything up to the next break. This is exactly what makes
QUOT work: We can break the document apart in the middle of a
section or paragraph and rightfully consider the resulting parts
fragments in the common sense that they can be glued back together to
form the original. Or to put it differently, a quoted fragment
carries all of its "own" structural information (where the meaning of
"own" also results from the data type's structure).
Remember that the whole point of my foray into structured documents originated in my little project of posting to my Advogato diary via email. So, say I had a document to be posted in memory as a value of type Doc as above. I'd want to convert it to HTML (or Advogato's subset thereof). While the representation of section breaks as stand-alone SECTs within a stream of block tokens closely fits with the <h1-6> tags to appear in the HTML, paragraphs are supposed to be wrapped in <p> tags. One obvious (but annoying) solution to this discrepancy would be to build a second data type, probably a little more like my original idea again, and convert to HTML via the detour of this intermediate type.
Luckily I remembered that Oleg[1] wrote something about folds being superior to cursors[2] as interfaces to data collections, so I implemented a function of the following humiliating type signature. ;)
-- Fold over a Doc's block-level structure.
folddoctree
:: (Biblio -> Line -> line) -- logical line
-> ([line] -> tok) -- text block
-> (String -> tok) -- code block
-> (doc -> tok) -- blockquote
-> ([doc] -> tok) -- bulleted list
-> (Int -> [doc] -> tok) -- numbered list
-> ([(String,line)] -> tok) -- bibliography
-> ( line -- para. title
-> [tok] -- para. content
-> para ) -- paragraph
-> ( Int -- sect. level
-> line -- sect. title
-> doc -- sect. content
-> sect ) -- section
-> ( [para] -- initial paragraphs
-> [sect] -- following sections
-> doc ) -- document
-> Doc -> doc
As you can see, it takes ten functions as parameters, each telling it
what to do with a certain part of the document, which it then
transforms accordingly. Given this function, an HTML export for
Docs is straight-forward to define.
NB. Notice how the different type variables are threaded through the type signature. It feels almost like LISP, in the sense that the functions can return any type they want, as long as everything fits. But unlike in a dynamically-typed language, the typechecker will tell you up-front if you make a mistake. :-P
Ah right, most important! The full definitions of the Doc type, the folddoctree function, and some further utilities can be seen at the following address:
http://www.khjk.org/~sm/code/advopost/Doc.lhs
Enjoy! :-)
I'd like to share the results of my latest "freshmeat" spree. For context, I keep all my email on a University server, so I can access it from anywhere (No, there are no webmailers!). I have access through IMAP which I used with Thunderbird, until yesterday -- when I finally got fed up with it. Thunderbird regularly mixes up its idea of which mail folders there are on the IMAP server, and where. So there I was, looking for a new MUA. Lot's of dissatisfaction ensued. Until I found two wonderful programs:
The first does something very simple (but in a smart way): It synchronizes IMAP mailboxes with a local repository. Both ways. So I set it up with all my IMAP settings (through a nice, and "unuserfriendly" config file -- sic!) and let it sync every five minutes. New mails get pulled into local Maildir folders where any old MUA can conveniently handle them. And whatever changes I make to them are propagated back to the server. And no, it doesn't lose any mails; so the author assures. The author, btw., turns out to be a fellow Haskell programmer -- John Goerzen, thank you: You just made my day!
Thanks to OfflineIMAP I can finally use my old-time favourite MUA again: Mutt. But wait, as I discovered, there is even Mutt-NG now, which finally adds the one thing I had been missing all along: A sidebar that shows an overview of (new/old/etc.) messages in my incoming mail folders! Oh, and for extra pleasure, it also speaks SMTP, so I don't even have to set up a local delivery agent.
And here's some more general Mutt praise: With just a few pretty simple config file (sic!) settings, mutt pretty much takes care of moving read mails away into yearly archive folders all by itself:
save-hook ~A =sa-spam
This means that pressing 's' to move a message to another folder presents sa-spam as default, which happens to be the folder my spamassasin learns from.
set record="=Archive/`date +%Y`/Sent"
This is where mutt saves Fcc's, i.e. copies of sent messages. Notice the automatic year selection.
set mbox="=Archive/`date +%Y`/Inbox" mbox-hook =Somelist =Archive/`date +%Y`/Somelist mbox-hook =Otherlist =Archive/`date +%Y`/Otherlist
Finally, these are the real fun: Remember how default Mutt setups usually ask on quit whether they should save read messages to "mbox" in your homedir? A completely stupid question, nobody wants that. But with the above, this suddenly changes to something sensible: The correct archive folder for the current year, sorting different input folders into a corresponding archive subfolder. Wonderful.
Maybe you cannot really appreciate how good this is. But that's usually the case with improved usability, you don't notice until afterwards how much better things feel with just that little tweak. Compare my previous post about bitlbee. Life just started being fun again! Time to throw out all those black clothes and start listening to eurodance.
Enjoy.
PS. Friendly apologies for praising config files go to lkcl who recently bitched about "this crap obscure text file editing shit". ;-P
I've just returned from a two-semester exchange at the Charles University[1] of Prague. There I had the chance to take a course about classical cryptography, for which I wrote an essay that might be interesting to others:
Everybody knows about the Enigma machine[2]. It was a cryptographic device developed by the Germans, used extensively by Hitler's troops, and famously broken by the allies at Bletchley Park. It was an "offline" device in that the plaintext was keyed in letter by letter, indicating the corresponding ciphertext character with each keystroke. Now, what many are not aware of is that the Germans also used a series of online crypto devices during WWII. One of these was the Siemens T52[3], a.k.a. the "Geheimschreiber". This machine was an actual teleprinter with built-in automatic encryption and decryption, i.e. it encrypted and transmitted the characters as they were typed (or fed via punched tape), and vice-versa. It's cryptographic algorithm is considered significantly more sophisticated than that of the Enigma.
When Germany leased several telegraph lines from Sweden they were immediately tapped but the T52 was soon employed on them. My essay[4] describes how the Swedish cryptanalyst Arne Beurling broke the encryption without ever having seen one of the machines. It's mainly a condensed form of the description from "Codebreakers"[5], with less history and more mathematics.
Enjoy.
References:
Update: Typesetting a reference list in Advogato
For those of you who, like me, are not completely fluent in CSS (and maybe a little stuck with HTML wisdom of the 90's), here's how you produce hanging indents as in the reference list above:
<div style="float: left">[1]:</div> <div style="margin-left: 5ex"> "Univerzita Karlova v Praze"<br> <a href="http://www.cuni.cz">http://www.cuni.cz/</a> </div>
For a reference list, it also looks good to indent the whole thing a little by putting it in another <div style="padding: 1ex">.
Enjoy
I wrote previously about my plan to use Markdown as the input format for advopost. I decided against re-using the existing Markdown-to-HTML converter, because I would have to strip the resulting output down to the Advogato subset of HTML in postprocessing; feels too clutchy. So I'm going to implement a parser for (a variant of) Markdown that reads the input into a structured Haskell data type 'Doc'. Here is my current design for that type:
module Doc where
data Doc = Doc String -- title
[Para] -- body
[Doc] -- subsections
data Para = Paragraph String -- paragraph title
[Block] -- paragraph body
data Block = Blockquote Doc
| Bulleted [[Para]] -- unordered list
| Numbered [[Para]] -- numbered list
| Codeblock [[Inline]] -- list of lines, ignore Codespans
| Line [Inline]
data Inline = Str String -- ignore linebreaks
| Codespan [Inline]
| Emph [Inline]
| Link [Inline] -- link text
String -- link target
String -- link title
| Image [Inline] -- fallback alternative for this image
String -- image location
String -- image title
I want both the input format and the Haskell data structure to be independent of the output format being HTML. Therefore I'm not going to support inline-HTML in the input. I also want structural markup (as opposed to presentational), so I left out horizontal rules and forced linebreaks. Lastly, I've never heard of using a "strong emphasis" (as opposed to normal emphasis) in typesetting, so I dropped that as well.
I've tried to design the above types in such a way as to minimize the possibility of forming non-sensical or ambiguous documents. That's why there is such deep nesting of different types instead of just one big algebraic data type with constructors for concatenation, paragraph and section breaks, etc.. Comments welcome.
I hope that the 'Doc' type will be useful in further coding. For example, it would be really cool to have a fancy combinator library for 'Doc's along with a pretty-printer to turn them back into plaintext: Then we could use them for general pretty output from Haskell programs. While there are several existing pretty-printing libraries, to my knowledge none of them use structural markup and they are all targeted at console output only.
Posting to Advogato via email
If this messages reaches my Advo diary, my new mail-to-advo gateway program is not only working but also correctly embedded into my mail setup via procmail. Thus, albeit being undocumented, underfeatured, and unpolished, the program is quite functional. Lest I forget it myself, here are the quick set-up instructions:
Step 1: Dependencies.
The program is written in Haskell and has been tested with GHC 6.4.1. It requires a number of Haskell packages, namely: MissingH (0.13.1), HaXml (1.13), NewBinary (2005-05-30), Crypto (3.0.3), HaXml (1.13), http (2005-04-23), haxr (2006-05-30, custom-patched)
I've already submitted my patches for haxr (the XML-RPC library) to its author, Björn Bringert. Hopefully, he'll accept them soon. In short, the particular problem was that Advogato returns <value> elements with whitespace (i.e. character data) around the actual data element. I had to modify a function called fromXRValue to filter them out -- /unless/ the whole <value> consists of /only/ character data, which is supposed to be interpreted equivalently to a <string> data element. Duh. (Rant: This format is supposed to be processed by /machines/ for fsck's sake!)
Step 2: Fetch and build.
There's just a single source file, http://www.khjk.org/~sm/code/advopost/advopost.lhs. Fetch it and build with the straight-forward command 'ghc -o advopost --make advopost.lhs'. Make sure you've got all the above dependencies visible (I had to manually 'ghc-pkg expose' HaXml, I think).
Step 3: Procmailrc.
Choose an email account to direct your postings to. I'm posting to a mailing list and the posts loop back to my own account. There, the following procmail recipe passes any posts from me to that list through advopost:
:0 c * ^TO_the@mailing.list * ^From:.*my@email.addr * ! ^In-Reply-To: | /path/to/advopost my-advo-name my-advo-pass
Put this recipe at the beginning of your respective .procmailrc; the 'c' flag in the first line means "copy", i.e. a copy of the mail is passed to advopost and the original continues through the rest of .procmailrc. The third condition line makes this rule apply only to messages that don't carry an "In-Reply-To" header, i.e. only fresh posts are passed to Advogato, not my replies to other people's messages.
Enjoy.
PS. Er, naturally, advopost is open-source software. You may peruse the published version under the terms of the common 3-clause BSD license.
PPS. I know, this post is missing lots of markup. While advopost would pass it all through untouched, I don't want to burden the mailing list I'm cross-posting to with lots of HTML tags. My plan is to finally interpret the input to advopost as Markdown, cf. http://daringfireball.net/projects/markdown/.
Update. Aha, of course, non-ASCII characters come through wrong. I'm pretty sure I'm transmitting them as UTF-8, or at least so I had assumed. Will have to investigate...
Update 2. Yay, Björn has already accepted my patches! Advopost will now work with the latest haxr from darcs (2006-05-31).
Posting to Advogato via email
Sooo, I want to post to my diary by email. If you read this, the program is working. :)
Update. Oh, of course you want to see the code! There it is, in all its unpolished glory. Note that about 90% of it is concerned with working around bugs in the crappy email parser. ;-/
New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!