The Chicago Project: Making access to Excel data easy
Posted 27 Feb 2002 at 17:10 UTC by jackshck
As it currently stands, if you need to access Excel files under UNIX
you have to write your own code to do so. There is no standard
open-source library to access Excel data. This article will discuss
the Chicago Project, and how it attempts to fix this problem.
The Chicago Project will
create an open source library to read and write Excel files. You
may wonder why this is needed. After all, Gnumeric and StarCalc
deal with these files right?
Well yes they do, but each one of them is at a different state of
support.
If you want to read or write Excel data from your application,
how would you do it? Well you could look at the Gnumeric and KSpread
code to figure out how they do it. Or you could write your own code.
It would be a lot simpler, if there was a standard API for
accessing Excel files under UNIX. The Chicago Project, will attempt
to provide that API. To do so however, we need your help. The goals
of the Chicago Project are:
1. An open source C library to read and write OLE files.
A good amount of code exists to deal with these files. It needs to be
combined into one project. I propose the name libOLE for this project.
2. An open source C library to read and write Excel files.
A good amount of code exists to deal with these files. It needs to be
combined into one project. I propose the name libXLS for this project.
3. Complete documentation of the OLE file format.
Documentation on this file format, is scattered all over the net. It
needs to be put in one place. As I find documentation on the file
format, I am linking to it at my web site. If you know of any
documentation on the file format, please let me know.
4. Complete documentation of the Excel file format.
The most complete source of documentation on this format, is located at
http://sc.openoffice.org in PDF and XML format. However several sections
are marked 2do. They need to be filled in.
5. Documentation of the Excel Encryption Algorithm.
I have spent a good deal of time searching for information on this
topic. All the information that I have found, can be found at:
http://chicago.sourceforge.net/devel/docs/excel/encrypt.html
Currently no open source spreadsheet, can open password protected
Excel workbooks. This is NOT ACCEPTABLE! The wvWare project
(http://wvware.sf.net), can open password protected word documents if
the correct password is supplied. I have began to modify this code, to
deal with Excel files.
However, as no standard library is available to deal with excel files, I
am running into a brick wall while modifying the code.
If you would like to help with the Chicago Project, please e-mail me
at jackshck@yahoo.com
OLE : the gnome project's libole is certainly not the most beautiful
code and could use some cleaning. However, complete rewrite seems a
bit over the top but that is a matter of taste.
MS Excel : A library to fully manipulate XL files is going to require
support for storing all of the different objects in a spreadsheet.
Doing that basicly implements a spreadsheet at some level. IMO the
best bet will be to help the work on gnumeric to split
out 'libgnumeric'. anything less than a full spreadsheet + extensions
to handle all the extras in MS Excel is going to be lacking. Basic
capabilities such as those in the read/write excel perl modules should
be sufficient for most tasks, to go further than that is going to
require infrastructure.
I applaud your efforts to get docs on the encryption, that would
certainly be a boon. As would any additional docs you can dig up on
undocumented or murky corners. However, your desire to rewrite large
blocks of complex code from scratch seems misplaced. There is not
enough gain to warrant it.
Jody wrote
OLE : the gnome project's libole is certainly not the most beautiful
code and could use some cleaning. However, complete rewrite seems a bit
over the top but that is a matter of taste.
IMO libole2 is very complex, To see
how simple it is to handle OLE files, please download libOle from the
Chicago project.
MS Excel : A library to fully manipulate XL files is going to require
support for storing all of the different objects in a spreadsheet. Doing
that basicly implements a spreadsheet at some level.
If you are writing a spreadsheet anyway, then that point isn't
valid.
I applaud your efforts to get docs on the encryption, that would
certainly be a boon. As would any additional docs you can dig up on
undocumented or murky corners.
Thank you. What should I look for docs on?
However, your desire to rewrite large blocks of complex code from
scratch seems misplaced. There is not enough gain to warrant it.
I think there is the potential for a lot of gain. The reason is this:
every person that writes a spread sheet, has to write the code to access
Excel files from scratch. If there was a library that provided access to
Excel files, it would save a lot of time and effort.
Looking at the website:
> The API will be written in ANSI C, and will be licensed under the GNU
General Public License.
I believe GPL will not serve community's best interest in this case.
Since this is an interoperability project it should also be possible to
use this piece in non-gpl projects too. LGPL license (among other
licenses) would allow that. LGPL also permits forking back to GPL.
Hello
I have considered placing the Chicago project code under the LGPL.
I am not dead set on the GPL. The final decision will be based on
input from the user community.
xlhtml, posted 28 Feb 2002 at 00:16 UTC by grant »
(Journeyer)
I was about to recommend xlhtml, at least for reading Excel files,
since I've used it successfully for that purpose.
The xlhtml.org site was nowhere to
be found, but it appears that the Chicago Project builds from xlhtml
and uses its libraries.
Re: xlhtml, posted 28 Feb 2002 at 00:53 UTC by jackshck »
(Journeyer)
grant wrote:
I was about to recommend xlhtml, at least for reading Excel files, since
I've used it successfully for that purpose.
It is a rather good reader, so it would be a good recommendation
The xlhtml.org site was nowhere to be found
I noticed that. It was up yesterday, but is not today.
but it appears that the Chicago Project builds from xlhtml and uses its
libraries.
You're right and wrong on this point. I have contributed code to the
xlhtml project, but am not using any of there code at the moment. One
goal of the xlhtml project, is to turn xlhtml into a library, and a
application to use that library. I am going to help them with this
goal. When they have completed the library, I will integrate it into
the Chicago project.
why?, posted 1 Mar 2002 at 23:38 UTC by ishamael »
(Journeyer)
im confused. why exactly do you feel we need a standard open source
library to access Excel data? perhaps forgive my narrowmindedness, but
why would you need to read Excel data unless you were writing some sort
of spreadsheet application. then, if you were writing such an
application i would ask, why? gnumeric, kspread, etc, there are already
plenty, and theyre already pretty damn slick and complete. so.... why?
Hello
Thank you for mentioning this. I have already talked with the
author about the Excel file format.
Re Why, posted 2 Mar 2002 at 17:23 UTC by jackshck »
(Journeyer)
I'm confused. why exactly do you feel we need a standard open source
library to access Excel data? perhaps forgive my narrowmindedness, but
why would you need to read Excel data unless you were writing some sort
of spreadsheet application.
Hello
You may want to write Excel files from a database application,
or something like that. If Excel can't read that format, then you
have to Export a CSV file. If a standard library exists to read and
write Excel data, you don't have to write a CSV file.
Hi, We've corresponded in the past and I've emailed you before to see if
you were interested in collaborating. There *IS* a standard project
that is far along and can read/write Excel files from UNIX and its even
linked (to the old site) on your page. Jakarta POI. We've also
recently received some new donations that allow us to read Word (in the
process of integrating) and Document Summary information. Next, we have
the cleanest most complete port of the OLE 2 Compound Document format
that I know of. The project is of course in Java as opposed to ANSI C,
but that pretty much means you can run it darn near anywhere (we
actively test it on Linux and Windoze). And obviously its APL and not
GPL. So I just wanted to register that I take objection to the
statement "if you need to access Excel files under UNIX you have to
write your own code to do so". This response has been typed from a
Linux box where all POI code is tested darn near daily. To clarify for
the deceived: there IS an open source API for reading, creating and
writing Excel files that runs on just about ANY platform. Its
implemented using pure Java. Furthermore, there is even an XLS
serializer for Cocoon for
those who prefer to write XML rather than Java. (And it is compatible
with the Gnumeric tag library). - Thanks.
OLE trademark?, posted 3 Mar 2002 at 22:42 UTC by acoliver »
(Journeyer)
BTW, I'm curious. LibOLE2 uses "OLE" in its name. I have a feeling OLE
is a trademark.. Anyone have some legal knowledge on this?
Jody wrote: MS Excel : A library to fully manipulate XL files is
going to require support for storing all of the different objects in a
spreadsheet. Doing that basicly implements a spreadsheet at some level.
jackshck wrote:
If you are writing a spreadsheet anyway, then that point isn't valid.
I think there is the potential for a lot of gain. The
reason is this: every person that writes a spread sheet, has to write
the code to access Excel files from scratch. If there was a library that
provided access to Excel files, it would save a lot of time and
effort.
This is really funny, for some reason I have a hard time seeing the
target audience of people writing their own spreadsheet as being that
big. People wanting to hack on a spreadsheet should do so by joining
Gnumeric or StarCalc IMHO.
Hi, We've corresponded in the past and I've emailed you before to see if
you were interested in collaborating.
Hello
I am interested in collaborating. I guess it is a bit silly
to start another project, when yours is rather mature. The only problem
I have is with the APL license. But then I can be convinced to overlook
that :-}.
I am interested in collaborating. I guess it is a bit silly to start
another project, when yours is rather mature. The only problem I have is
with the APL license. But then I can be convinced to overlook that :-}.
welcome: send mail to poi-dev-subscribe@jakarta.apache.org. If you
would prefer write a C library as opposed to Java, then we can all still
collaborate on fully documenting these formats (actually we donate all
of our Excel documentation to the OpenOffice.org document that you
mentioned). We've fully documented the OLE 2 Compound Document format
(but corrections/clarifications/etc are always welcome).
The areas we most need help on are: Word format, Excel Formulas, Pivot
Tables, dunno if Glen needs help on Graphing or not.
As for APL, I'm just a programmer, I get paid to write software (not
POI, but other software), I strongly believe in *free* software but I
really could care less about *which* kind of *free* software. All of
the politics of GNU etc bore the living crap out of me to be honest. I
want to write code, not argue pedantic issues and philisopical issues
about the *meaning of free* and yada yada.. snore. Those kind of
discussions usually get boring so quickly for me.
POI is mostly about the intellectual challange of cracking those suckers
wide open. Secondly, its a way for me to never have to use Windows
again in a server situation!
As for Jody's confusion about *use cases* for a generic library. I
really couldn't care less about *writing* a spreadsheet from a client
perspective. I barely even know how to use Excel (shocking huh). But I
often am required to interoperate with such software. Try developing a
reporting system in Java without Excel interoperability coming up in
discussion from the users. For the work I do professionally I plan to
use the POI serializers for Cocoon. I'll create reports and publish in
XML. Through configuration and maybe a stylesheet I'll answer the
*business requirements* and get paid. Others will use the API. Next, I
work on Lucene as well (Java search engine) -- try deploying a search
engine in a Fortune 500 w/o Word and Excel search capability. Then try
and get paid.. There's some use cases.
Look forward to seeing you on the poi-dev list and working with you!
OLE 2 CDF docs, posted 5 Mar 2002 at 12:38 UTC by acoliver »
(Journeyer)
http://jakarta.apache.org/poi/poifs/fileformat.html
IANAL, but I would say that the name of OLE (which is an acronym for
Object Linking and Embedding) is a descriptive, generic term, and as
such could not be adequately defended as a trademark. Of course, that
is unlikely to stop Microsoft ("where do you want to go today (tm)?")
from attempting it.
There is certainly a place for code to read/generate xls and other MS
formats outside spreadsheets. The perl modules and the POI project
seem to handle that quite nicely for their respective development
environments. I take issue with the creation of _another_ such
library. There are few enough of us working on these things that
duplication sees foolish.
Hi Jody, posted 6 Mar 2002 at 13:02 UTC by acoliver »
(Journeyer)
I took no offense to your questions. I was just offended by the idea
that there was NO way to do Excel on UNIX (when the author surely knew
there is). I just thought you were confused in what the use case for
the Excel based library outside of a spreadsheet GUI. I agree with you
that a project for documenting the formats might be a bit out of line.
I think it would be nice if OpenOffice.org and Gnome Office projects
could collaborate on a plugin for these formats. Of course the reaons
that may not ever happen is more political and legalistic than anything
else, but *shrugs*.
As for POI we already collaborate with OpenOffice.org on their
documentation of the Excel format (Daniel Rentz is a super nice guy) and
furthermore we will certainly collaborate with whomever is willing on
documenting the Word file format. And as you are aware the serializer
(now part of Cocoon) we've developed reads the gnumeric tag language
(and soon the generator will output in it), as a result POI developer
Marc Johnson developed the Gnumeric XML schema and donated it to
Gnumeric. *shrugs* in my view this lazy method of collaboration is the
BEST way to develop opensource^H^H^H^H^H^H^H^H^H^H software.
Anyone who is interested in collaborating on a method of reading/writing
any Office file format in Java is certainly welcome to join us!
(http://jakarta.apache.org/poi)
I do agree that libole2 needs to be rewritten. It hurts my head :-).
Corrolary #2 to Rule #3
'Never attribute to malice what can be explained by laziness'
I don't think the main impediment to increased reused between the
communities is politcal. As you mention Danial Rentz is quite
amicable, and I'd hope the gnumeric folk have also been friendly. The
trouble is that these really are different systems. The same problem
arises when attempting to select a common xml format. File formats are
at some level always going to be tied to their parent application's
data model. Even when attempting to model similar behaviors
differences are inevitable. eg, Gnumeric uses MERGE records from XL
OpenCalc uses the merged flag in the XF record. Gnumeric's xml uses
shared expression tags, openCalc does not. Borders are handled very
differently... the list goes on. Collaboration of projects is
difficult, a master slave relationship is more tenable.
I'll do what I can to work towards increasin the amount of overlap with
OpenCalc (eg using their structured file format) but it is a slow
process.
Clarification, posted 8 Mar 2002 at 00:08 UTC by acoliver »
(Journeyer)
Hi Jody,
I don't think the main impediment to increased reused between the
communities is politcal. As you mention Danial Rentz is quite amicable,
What I was actually referring to was the GNU politics. Gnu's stance
that licenses cannot be mixed freely without the so called viral clause
coming in to play, etc etc. Its somewhat of a hinderence often to
collaboration. Anyhow I absolutely do not want to debate that with
anyone, just mentioning it is often an issue.
As you mention Danial Rentz is quite amicable, and I'd hope the gnumeric
folk have also been friendly.
Oh yes, you all have been just fine, I didn't meant to imply otherwise.
(BTW Marc Johnson would still like
to be listed as having contributed the XML Schema on whatever is the
relevant place).
Gnumeric uses MERGE records from XL OpenCalc uses the merged flag in the
XF record.
BTW AFAIK using the Merged Record is correct as of Biff8. Excel ignores
the XF record's merged setting. I believe recent versions of OpenCalc
fix this.
I'll do what I can to work towards increasin the amount of overlap with
OpenCalc (eg using their structured file format) but it is a slow ? process.
Great, I think this is a better approach (aside from the obvious need to
rewrite libole2 due to its love of hundreds of layers of #define
expansions) for sure then creating another C project. I'd like to see
those things componentized but that's a longview.
As an aside:
The POI Serializer for Cocoon made its way into Cocoon. It reads the
Gnumeric XML format (generated via gnumeric or a stylesheet) and outputs
it in XLS (purely via Java). While I realize you'd have to rewrite it
in C, you might want to take a look at the SAX based approach we've and
Cocoon have used. In my opinion this event-based *generator*
*serializer* based model would simplify the effort you have to go
through. (I do quietly monitor the gnumeric list but I've not seen much
discussion about these things).
In my opinion, both projects have their advantages and disadvantages.
From a third party point of view. I use gnumeric most of the time for
my testing because it is lighter weight and more stable. OpenOffice
saves BIFF8 files that are closer to Excel (as of StarOffice 5.2). The
Gnumeric XML format is superior and more developed that OpenOffice's
(the only issue I've really had is all of the style regions that are
created for blank cells rather then just having a *default* style cover
blank cells or normalzing these somehow).
Anyhow keep up the great work on Gnumeric and I hope to collaborate with
you further in the future.
For those who wonder why you'd want to manipulate excel data on
unix: we have an internal
site that customers upload data that was originally in excel but is
converted by a vbscript into plain text so that our unix box can read
through it and do the appropriate things with it. The data is
performance metric data. So, sure, the customer edits the excel file,
but we want to take that file and automatically read and possibly
manipulate it and get rid of the vbscript at the same time. We could
probably use one of the already-existing code that's out there, but I
really haven't seen anything like a simple library API to do this
(though I haven't looked all that hard either :)
I'd prefer a C API over a Java API any day to access and
manipulate
excel files in an automated fashion.
/s.
http://scottg.net
C API, posted 24 Mar 2002 at 16:11 UTC by acoliver »
(Journeyer)
I'd prefer a C API over a Java API any day to access and manipulate
excel files in an automated fashion.
Great come join us over at the POI project in producing good
documentation for these file formats so that other APIs can grow from
them. Or why not help out in librarizing the Gnumeric or OpenOffice.org
filters?