On Thu, 2009-10-29, Gerhard Fiedler wrote: > James Kanze wrote:
>>> Re the precision issue: When writing out text, there isn't really a >>> need to go decimal, too. Hex or octal numbers are also text. Speeds >>> up the conversion (probably not by much, but still) and provides a >>> way to write out the exact value that is in memory (and recreate >>> that exact value -- no matter the involved precisions).
>> But it defeats one of the major reasons for using text: human >> readability.
> Not that much. For (casual, not precision) reading, a few digits are > usually enough, and most people who read this type of output (meant to > be communication between programs) are programmers, hence typically > reasonably fluent in octal and hex.
I disagree there, in two ways:
- I belong to the school that claims protocols should be human-readable, because, well, it opens them up. They get so much easier to manipulate, and even talk about. Take HTTP as an example, or SMTP.
- I doubt that programmers are that good with hex. Even if I limit myself to unsigned int, I can't tell what 0xbabe is. Probably 40000 or so. Or 30000? Who knows? There is a reason decimal is the default base in pretty much every language I know of ... including assembly languages.
...
> Since what we're talking about is only relevant for huge amounts of > data, doing anything more with that data than just a cursory look at > some numbers (which IMO is fine in octal or hex) generally needs a > program anyway.
But for the text version of the data, that "program" is often a Unix pipeline involving tools like grep, sort and uniq, or a Perl one-liner you make up as you go. Or it can be fed directly into gnuplot or Excel. If the data is binary, you probably simply won't bother.
I think we have been misled a bit here, too. I haven't read the whole thread, but it started with something like "dump a huge array of floats to disk, collect it later". If you take the more common case "take this huge complex data structure and dump it to disk in a portable format", you have a completely different situation, where the non-text format isn't that much smaller or faster.
/Jorgen
-- // Jorgen Grahn <grahn@ Oo o. . . \X/ snipabacken.se> O o .
> On Thu, 2009-10-29, Gerhard Fiedler wrote: > > James Kanze wrote:
> >>> Re the precision issue: When writing out text, there isn't really a > >>> need to go decimal, too. Hex or octal numbers are also text. Speeds > >>> up the conversion (probably not by much, but still) and provides a > >>> way to write out the exact value that is in memory (and recreate > >>> that exact value -- no matter the involved precisions).
> >> But it defeats one of the major reasons for using text: human > >> readability.
> > Not that much. For (casual, not precision) reading, a few digits are > > usually enough, and most people who read this type of output (meant to > > be communication between programs) are programmers, hence typically > > reasonably fluent in octal and hex.
> I disagree there, in two ways:
> - I belong to the school that claims protocols should be human-readable, > because, well, it opens them up. They get so much easier to > manipulate, and even talk about. Take HTTP as an example, or SMTP.
> - I doubt that programmers are that good with hex. Even if I limit > myself to unsigned int, I can't tell what 0xbabe is. Probably 40000 > or so. Or 30000? Who knows? There is a reason decimal is the default > base in pretty much every language I know of ... including assembly > languages.
> ...
> > Since what we're talking about is only relevant for huge amounts of > > data, doing anything more with that data than just a cursory look at > > some numbers (which IMO is fine in octal or hex) generally needs a > > program anyway.
> But for the text version of the data, that "program" is often a Unix > pipeline involving tools like grep, sort and uniq, or a Perl one-liner > you make up as you go. Or it can be fed directly into gnuplot or > Excel. If the data is binary, you probably simply won't bother.
> I think we have been misled a bit here, too. I haven't read the whole > thread, but it started with something like "dump a huge array of > floats to disk, collect it later". If you take the more common case > "take this huge complex data structure and dump it to disk in a > portable format", you have a completely different situation, where the > non-text format isn't that much smaller or faster.
I guess you're saying that the results are closer in some cases because there's a lot of non-numeric data involved in those complex data structures. But aren't you ignoring scientific applications where the majority of the data is numeric?
Much earlier in the thread, Allnor wrote, "Binary files are usually about 20%-70% of the size of the text file, depending on numbers of significant digits and other formatting text glyphs." I don't think anyone has directly disagreed with that statement yet.
On Nov 5, 11:36 pm, Brian <c...@mailvault.com> wrote:
> On Nov 4, 3:47 pm, Jorgen Grahn <grahn+n...@snipabacken.se> wrote:
[...]
> > I think we have been misled a bit here, too. I haven't read > > the whole thread, but it started with something like "dump a > > huge array of floats to disk, collect it later". If you > > take the more common case "take this huge complex data > > structure and dump it to disk in a portable format", you > > have a completely different situation, where the non-text > > format isn't that much smaller or faster. > I guess you're saying that the results are closer in some > cases because there's a lot of non-numeric data involved in > those complex data structures. But aren't you ignoring > scientific applications where the majority of the data is > numeric?
He spoke of the "more common case". Certainly, most common cases do include a lot of text data. On the other hand, the origine of this thread was dumping doubles: purely numeric data. And while perhaps less common, they do exist, and aren't really rare either. (I've encountered them once or twice in my career, and I'm not a numerics specialist.)
> Much earlier in the thread, Allnor wrote, "Binary files > are usually about 20%-70% of the size of the text file, > depending on numbers of significant digits and other > formatting text glyphs." I don't think anyone has > directly disagreed with that statement yet.
The original requirement, if I remember correctly, included rereading the data with no loss of precision. This means 17 digits precision for an IEEE double, with an added sign, decimal point and four or five characters for the exponent (using scientific notation). Add a separator, and that's 24 or 25 bytes, rather than 8. So the 20% is off; 33% seems to be the lower limit. But in a lot of cases, that's a lot; it's certainly something that has to be considered in some applications.
> On Nov 5, 11:36 pm, Brian <c...@mailvault.com> wrote:
> > On Nov 4, 3:47 pm, Jorgen Grahn <grahn+n...@snipabacken.se> wrote:
> [...]
> > > I think we have been misled a bit here, too. I haven't read > > > the whole thread, but it started with something like "dump a > > > huge array of floats to disk, collect it later". If you > > > take the more common case "take this huge complex data > > > structure and dump it to disk in a portable format", you > > > have a completely different situation, where the non-text > > > format isn't that much smaller or faster. > > I guess you're saying that the results are closer in some > > cases because there's a lot of non-numeric data involved in > > those complex data structures. But aren't you ignoring > > scientific applications where the majority of the data is > > numeric?
> He spoke of the "more common case".
As I recall, I started by a purely technical question about binary typecasts. Others started bringing in text formats. I have only attempted to explain - in vain, it seems - why text-based numerical formats is a no-go in technical applications.
> Certainly, most common > cases do include a lot of text data.
I am not talking about 'common' cases. I am talking about heavy-duty work. Once you are talking about numeric data in the hundreds of MBytes (regardless of the storage format), any amount of accompagnying text is irrelevant. One page of plain text takes about 2 kbytes.
There was, in fact, an 'improvment' to the ancient SEG-Y seismic data format,
where a lot of the auxillary (numeric) information was specificed to be stored on text format. I first saw the SEG-2 spec about ten years ago, but I have never heard that it has actually been used. The speed losses involved with converting data back and forth from text to binary would fully explain why SEG-2 does not gain wide- spread acceptence among the heavy-duty users.
> On Nov 5, 11:36 pm, Brian <c...@mailvault.com> wrote:
> > On Nov 4, 3:47 pm, Jorgen Grahn <grahn+n...@snipabacken.se> wrote:
> [...]
> > > I think we have been misled a bit here, too. I haven't read > > > the whole thread, but it started with something like "dump a > > > huge array of floats to disk, collect it later". If you > > > take the more common case "take this huge complex data > > > structure and dump it to disk in a portable format", you > > > have a completely different situation, where the non-text > > > format isn't that much smaller or faster. > > I guess you're saying that the results are closer in some > > cases because there's a lot of non-numeric data involved in > > those complex data structures. But aren't you ignoring > > scientific applications where the majority of the data is > > numeric?
> He spoke of the "more common case". Certainly, most common > cases do include a lot of text data. On the other hand, the > origine of this thread was dumping doubles: purely numeric data. > And while perhaps less common, they do exist, and aren't really > rare either. (I've encountered them once or twice in my career, > and I'm not a numerics specialist.)
I've worked on one scientific application for a little over six months. I hope to work with/on more scientific projects in the future.
> > Much earlier in the thread, Allnor wrote, "Binary files > > are usually about 20%-70% of the size of the text file, > > depending on numbers of significant digits and other > > formatting text glyphs." I don't think anyone has > > directly disagreed with that statement yet.
> The original requirement, if I remember correctly, included > rereading the data with no loss of precision. This means 17 > digits precision for an IEEE double, with an added sign, decimal > point and four or five characters for the exponent (using > scientific notation). Add a separator, and that's 24 or 25 > bytes, rather than 8. So the 20% is off; 33% seems to be the > lower limit. But in a lot of cases, that's a lot; it's > certainly something that has to be considered in some > applications.
Yes. I brought it up because I wasn't sure if Grahn was agreeing with something Fiedler said about it being just a few more bytes. Even if it were 70% I wouldn't describe that as a minor difference.
> On 6 Nov, 10:03, James Kanze <james.ka...@gmail.com> wrote: > > On Nov 5, 11:36 pm, Brian <c...@mailvault.com> wrote: > > > On Nov 4, 3:47 pm, Jorgen Grahn <grahn+n...@snipabacken.se> wrote: > > [...] > > > > I think we have been misled a bit here, too. I haven't read > > > > the whole thread, but it started with something like "dump a > > > > huge array of floats to disk, collect it later". If you > > > > take the more common case "take this huge complex data > > > > structure and dump it to disk in a portable format", you > > > > have a completely different situation, where the non-text > > > > format isn't that much smaller or faster. > > > I guess you're saying that the results are closer in some > > > cases because there's a lot of non-numeric data involved in > > > those complex data structures. But aren't you ignoring > > > scientific applications where the majority of the data is > > > numeric? > > He spoke of the "more common case". > As I recall, I started by a purely technical question about > binary typecasts.
Which, of course, raises the question as to why. They're not very useful unless you're doing exceptionally low level work.
> Others started bringing in text formats.
The original comment was just that---a parenthetical comment. Text formats have many advantages, WHEN you can use them. It's also obvious that they have additional overhead---not nearly as much as you claimed in terms of CPU, but they aren't free either, neither in CPU time nor in data size.
> I have only attempted to explain - in vain, it seems - why > text-based numerical formats is a no-go in technical > applications.
And you blew it by giving exagerated figures:-). Other than that: they're not a no-go in technical applications. They do have too much overhead for some applications (not all), and in such cases, you have to use a binary format. Depending on other requirements (portability, external requirements, etc.), you may need a more or less complicated binary format.
> > Certainly, most common cases do include a lot of text data. > I am not talking about 'common' cases. I am talking about > heavy-duty work. Once you are talking about numeric data in > the hundreds of MBytes (regardless of the storage format), any > amount of accompagnying text is irrelevant. One page of plain > text takes about 2 kbytes.
Yes. I understand that.
In fact, now that you've mentionned seismic data, I agree that a text format is probably not going to cut it. I've actually worked on one project in the field, and I know just how much floating point data they can generate.
On 8 Nov, 15:27, James Kanze <james.ka...@gmail.com> wrote:
I'm getting tired with re-iterating this for people who are not interested in actually evaluating the numbers.
Look for an upcomimg post on comp.lang.c++.moderated, where I distill the problem statement a bit, as well as present a C++ test to see what kind of timing ratios I am talking about.
On Nov 8, 11:11 am, Rune Allnor <all...@tele.ntnu.no> wrote:
> On 8 Nov, 15:27, James Kanze <james.ka...@gmail.com> wrote:
> I'm getting tired with re-iterating this for people who > are not interested in actually evaluating the numbers.
> Look for an upcomimg post on comp.lang.c++.moderated, where > I distill the problem statement a bit, as well as present > a C++ test to see what kind of timing ratios I am talking about.
> Rune
I took the liberty of copying your post from clc++m to here as this newsgroup is faster as far as getting the posts out there.
Hi all.
A couple of weeks ago I posted a question on comp.lang.c++ about some technicality about binary file IO. Over the course of the discussion, I discovered to my amazement - and, quite frankly, horror - that there seems to be a school of thought that text-based storage formats are universally preferable to binary text formats for reasons of portability and human readability.
The people who presented such ideas appeared not to appreciate two details that counter any benefits text-based numerical formats might offer:
1) Binary files are about 70-20% of the file size of the text files, depending on the number of significant digits stored in the text files and other formatting text glyphs. 2) Text-formatted numerical data take significantly longer to read and write than binary formats.
Timings are difficult to compare, since the exact numbers depend on buffering strategies, buffer sizes, disk speeds, network bandwidths and so on.
I have therefore sketched a 'distilled' test (code below) to test what overheads are involved with formatting numerical data back and forth between text and binary formats. To eliminate the impact of peripherical devices, I have used a std::stringstream to store the data. The binary bufferes are represented by vectors, and I have assumed that a memcpy from the file buffer to the destination memory location is all that is needed to import the binary format from the file buffer. (If there are significant run-time overheads associated with moving NATIVE binary formats to the destination, please let me know.)
The output on my computer is (do note the _different_ numbers of IO cycles in the two cases!):
Sun Nov 08 19:48:54 2009 : Binary IO cycles started Sun Nov 08 19:49:00 2009 : 1000 Binary IO cycles completed Sun Nov 08 19:49:00 2009 : Text-format IO cycles started Sun Nov 08 19:49:16 2009 : 100 Text-format IO cycles completed
A little bit of math produces *average*, *crude* numbers for IO cycles:
Text: 6 seconds / (1000 * 1e6) read/write cycles = 6e-9 s per r/w cycle Binary: 16 seconds / (100 * 1e6) read/write cycles = 160e-9 s per r/w cycle
which in turn means there is an overhead on the order of of 160e-9/6e-9 = 26x associated with the text formats.
Add a little bit of other overheads, e.g. caused by the significantly larger text file sizes in combination with suboptimal buffering strategies, and the relative numbers easily hit the triple digits. Not at all insignificant when one works with large amounts of data under tight deadlines.
So please: Shoot this demo down! Give it your best, and prove me and my numbers wrong.
And to the textbook authors who might be lurking: Please include a chapter on relative binary and text-based IO speeds in your upcoming editions. Binary file formats might not fit into your overall philosophies about human readability and universal portability of C++ code, but some of your readers might appreciate being made aware of such practical details.
> On Nov 8, 11:11 am, Rune Allnor <all...@tele.ntnu.no> wrote:
> > On 8 Nov, 15:27, James Kanze <james.ka...@gmail.com> wrote:
> > I'm getting tired with re-iterating this for people who > > are not interested in actually evaluating the numbers.
> > Look for an upcomimg post on comp.lang.c++.moderated, where > > I distill the problem statement a bit, as well as present > > a C++ test to see what kind of timing ratios I am talking about.
> > Rune
> I took the liberty of copying your post from clc++m to here > as this newsgroup is faster as far as getting the posts out > there.
> Hi all.
> A couple of weeks ago I posted a question on comp.lang.c++ about some > technicality > about binary file IO. Over the course of the discussion, I discovered > to my > amazement - and, quite frankly, horror - that there seems to be a > school of > thought that text-based storage formats are universally preferable to > binary text > formats for reasons of portability and human readability.
That seems to me an inaccurate description of this thread. Kanze has pointed out the strengths of text formats, but has also noted that there are times when binary formats are needed. Who has been saying that text formats are "universally preferable" to binary formats?
On Nov 8, 6:11 pm, Rune Allnor <all...@tele.ntnu.no> wrote:
> On 8 Nov, 15:27, James Kanze <james.ka...@gmail.com> wrote: > I'm getting tired with re-iterating this for people who > are not interested in actually evaluating the numbers.
I actually did some measures, to check the numbers. Your numbers were wrong. More to the point, actual numbers will vary enormously from one implemenation to the next.
> Look for an upcomimg post on comp.lang.c++.moderated,
Not every one reads that group. Not everyone agrees with its moderation policy (as currently practiced).
On Nov 8, 11:44 pm, Brian Wood <woodbria...@gmail.com> wrote:
> On Nov 8, 4:15 pm, Brian Wood <woodbria...@gmail.com> wrote: > > On Nov 8, 11:11 am, Rune Allnor <all...@tele.ntnu.no> wrote:
[...]
> > A couple of weeks ago I posted a question on comp.lang.c++ > > about some technicality about binary file IO. Over the > > course of the discussion, I discovered to my amazement - > > and, quite frankly, horror - that there seems to be a school > > of thought that text-based storage formats are universally > > preferable to binary text formats for reasons of portability > > and human readability. > That seems to me an inaccurate description of this thread. > Kanze has pointed out the strengths of text formats, but > has also noted that there are times when binary formats > are needed. Who has been saying that text formats are > "universally preferable" to binary formats?
I think he missed a "when possible", or something similar. Binary formats are an optimization: you sometimes need this optimization (and you certainly should be aware of the possibility of using it), but you don't use them unless timing or data size constraints make it necessary.
> On Nov 8, 6:11 pm, Rune Allnor <all...@tele.ntnu.no> wrote: >> On 8 Nov, 15:27, James Kanze <james.ka...@gmail.com> wrote:
>> I'm getting tired with re-iterating this for people who >> are not interested in actually evaluating the numbers.
> I actually did some measures, to check the numbers. Your > numbers were wrong. More to the point, actual numbers will vary > enormously from one implemenation to the next.
>> Look for an upcomimg post on comp.lang.c++.moderated,
> Not every one reads that group. Not everyone agrees with its > moderation policy (as currently practiced).
Would you care to elaborate on that hinting, please.
> On Nov 8, 11:44 pm, Brian Wood <woodbria...@gmail.com> wrote:
> > On Nov 8, 4:15 pm, Brian Wood <woodbria...@gmail.com> wrote: > > > On Nov 8, 11:11 am, Rune Allnor <all...@tele.ntnu.no> wrote:
> [...]
> > > A couple of weeks ago I posted a question on comp.lang.c++ > > > about some technicality about binary file IO. Over the > > > course of the discussion, I discovered to my amazement - > > > and, quite frankly, horror - that there seems to be a school > > > of thought that text-based storage formats are universally > > > preferable to binary text formats for reasons of portability > > > and human readability. > > That seems to me an inaccurate description of this thread. > > Kanze has pointed out the strengths of text formats, but > > has also noted that there are times when binary formats > > are needed. Who has been saying that text formats are > > "universally preferable" to binary formats?
> I think he missed a "when possible", or something similar.
*You* are accusing *me* of missing the fine print??!!
[RA] > > File I/O operations with text-formatted floating-point data > > take time. A *lot* of time.
[JK] > A lot of time compared to what?
[RA] Wall clock time. Relative time, compared to dumping binary data to disk. Any way you want.
... [RA] > > The rule-of-thumb is 30-60 seconds per 100 MBytes of > > text-formatted FP numeric data, compared to fractions of a > > second for the same data (natively) binary encoded (just try > > it).
[JK] > Try it on what machine:-).
[RA] Any machine. The problem is to decode text-formatted numbers to binary.
... Here is a test I wrote in matlab a few years ago, to demonstrate the problem (WinXP, 2.4GHz, no idea about disk):
[matlab code snipped]
Output: ------------------------------------ Wrote ASCII data in 24.0469 seconds Read ASCII data in 42.2031 seconds Wrote binary data in 0.10938 seconds Read binary data in 0.32813 seconds ------------------------------------
Binary writes are 24.0/0.1 = 240x faster than text write. Binary reads are 42.2/0.32 = 130x faster than text read.
... The timing numbers (both absolute and relative) would be of similar orders of magnitude if you repeated the test with C++. ... The application I'm working with would need to crunch through some 10 GBytes of numerical data per hour.
I think these excerpts should be sufficient to sketch what kind of world I am living and working in.
Do note thet I never - unlike some other paricipants in this thread - claimed my numbers to be exact. I am fairly certain my English is good enough that the above would reasonably be expected to be interpreted by a reader as *representative* numbers. If you look closely, I also commented that coding up a program in C++ instead of matlab as I had done, would result in *different* numbers, but not solve the fundamental problem.
So I can't see any reason why you attack me for my numbers being "wrong"; I never stated they were exact.
[RA] So what does text-based formats actually buy you? [JK] Shorter development times, less expensive development, greater reliability...
In sum, lower cost.
[RA] As long as you keep two factors in mind: 1) The user's time is not yours (the programmer) to waste. 2) The users's storage facilities (disk space, network bandwidth etc) are not yours (the programmer) to waste.
[JK] The user pays for your time. Spending it to do something which results in a less reliable program, and that he doesn't need, is irresponsible, and borders on fraud.
This one really pissed me off. Here I had explained to you what application I am working with, made you aware of the users requirements in the operational situation, and you explicitly state that paying attention to such concerns is 'borderline fraud'!
So I can not interpret this in any other way than that you will use text-based formats, come hell or high water. Which essentially invalidate any otherwise relevant arguments you might have presented throughout thread.
> Binary formats are an optimization:
No, it's not. The selection of file formats is a strategic desing decision on a par with using binary O(lgN) or linear O(N) search engines; like choosing betweene a O(NlgN) quick sort or a O(N^2) bubble sort algorithm.
Such factors govern what problems can be handled by the software with reasonable effort and within reasonable time.
True, both binary and text-based numerical IO are O(N), but since text-based numerical IO is orders of magnitude slower, the strategic impact on design decisions is the same.
> you sometimes need this > optimization (and you certainly should be aware of the > possibility of using it), but you don't use them unless timing > or data size constraints make it necessary.
Hipocrate!
This is exactly what I have been arguing for days and weeks already. What changed?
> On 9 Nov, 02:14, James Kanze <james.ka...@gmail.com> wrote: > > On Nov 8, 11:44 pm, Brian Wood <woodbria...@gmail.com> wrote: > > > On Nov 8, 4:15 pm, Brian Wood <woodbria...@gmail.com> wrote: > > > > On Nov 8, 11:11 am, Rune Allnor <all...@tele.ntnu.no> wrote: > > [...] > > > > A couple of weeks ago I posted a question on comp.lang.c++ > > > > about some technicality about binary file IO. Over the > > > > course of the discussion, I discovered to my amazement - > > > > and, quite frankly, horror - that there seems to be a school > > > > of thought that text-based storage formats are universally > > > > preferable to binary text formats for reasons of portability > > > > and human readability. > > > That seems to me an inaccurate description of this thread. > > > Kanze has pointed out the strengths of text formats, but > > > has also noted that there are times when binary formats > > > are needed. Who has been saying that text formats are > > > "universally preferable" to binary formats? > > I think he missed a "when possible", or something similar. > *You* are accusing *me* of missing the fine print??!!
[...]
> I think these excerpts should be sufficient to sketch what > kind of world I am living and working in.
I fully understand what kind or world you're working in. As a consultant, I've worked on seismic applications too, albeit not recently.
> Do note thet I never - unlike some other paricipants in this > thread - claimed my numbers to be exact.
Off by more than an order of magnitude is not just a question of "exact".
> I am fairly certain my English is good enough that the above > would reasonably be expected to be interpreted by a reader as > *representative* numbers. If you look closely, I also > commented that coding up a program in C++ instead of matlab as > I had done, would result in *different* numbers, but not solve > the fundamental problem. > So I can't see any reason why you attack me for my numbers > being "wrong"; I never stated they were exact.
First, I didn't "attack" you. On the whole, I understand your problem. Stating that the difference is some 100 times is misleading, however.
> A few posts further out: > http://groups.google.no/group/comp.lang.c++/msg/0abdc440e78f98d6 > [RA] So what does text-based formats actually buy you? > [JK] Shorter development times, less expensive development, greater > reliability... > In sum, lower cost. > [RA] As long as you keep two factors in mind: > 1) The user's time is not yours (the programmer) to waste. > 2) The users's storage facilities (disk space, network > bandwidth etc) are not yours (the programmer) to waste. > [JK] The user pays for your time. Spending it to do something > which > results in a less reliable program, and that he doesn't need, > is > irresponsible, and borders on fraud. > This one really pissed me off. Here I had explained to you > what application I am working with, made you aware of the > users requirements in the operational situation, and you > explicitly state that paying attention to such concerns is > 'borderline fraud'!
I didn't say that. I said that ignoring issues of development time and reliability is fraud. You have to make a trade off; if text based IO isn't sufficiently fast for the users needs, or requires too much additional space, then you use binary. But you consider the cost of doing so, and weigh it against the other costs.
> So I can not interpret this in any other way than that you > will use text-based formats, come hell or high water.
How do you read that into anything I've said. I've simply pointed out that using text does buy you something, or in other words, using binary has a cost. There's no doubt that using text has other costs. Engineering is about weighing the difference costs; if you don't know what text based formats buy you, then you can't weigh the costs accurately.
> Which essentially invalidate any otherwise relevant arguments > you might have presented throughout thread. > > Binary formats are an optimization: > No, it's not. The selection of file formats is a strategic desing > decision on a par with using binary O(lgN) or linear O(N) search > engines; like choosing betweene a O(NlgN) quick sort or a O(N^2) > bubble sort algorithm.
Which are also optimizations:-).
There are optimizations and optimizations. Sometimes you do know up front that you'll need the optimization; if you know that you'll have to deal with millions of elements, you know up front that a quadratic algorithm won't do the trick.
In the case of choosing binary, the motivation for doing so up front is a bit different---after all, the difference will never be other than linear. Partially, the motivation can be calculated: if you know the number of elements, you can calculate the disk space needed up front. In many cases, however, you know that you'll be locked into the format you choose, so you have to consider performance issues earlier. Once you start considering performance issues, however, you're talking about optimization.
> Such factors govern what problems can be handled by the software > with reasonable effort and within reasonable time. > True, both binary and text-based numerical IO are O(N), but since > text-based numerical IO is orders of magnitude slower, the strategic > impact on design decisions is the same.
There you go exagerating again. It's not orders of magnitude slower. At the most, it's around 10 times slower, and often the difference is even less. That doesn't mean that its irrelevant, and sometimes you will have to use a binary format (and sometimes, you'll have to adapt the binary format, to make it quicker).
On Nov 9, 5:06 am, "Alf P. Steinbach" <al...@start.no> wrote:
> * James Kanze: > > On Nov 8, 6:11 pm, Rune Allnor <all...@tele.ntnu.no> wrote: > >> On 8 Nov, 15:27, James Kanze <james.ka...@gmail.com> wrote: > >> I'm getting tired with re-iterating this for people who > >> are not interested in actually evaluating the numbers. > > I actually did some measures, to check the numbers. Your > > numbers were wrong. More to the point, actual numbers will > > vary enormously from one implemenation to the next. > >> Look for an upcomimg post on comp.lang.c++.moderated, > > Not every one reads that group. Not everyone agrees with > > its moderation policy (as currently practiced). > Would you care to elaborate on that hinting, please.
"Not everyone" means "at least me". I stopped participating in the group because I found the moderation was becoming too heavy in some cases. Others, I know, aren't bothered with it. To each his own.
> On Nov 9, 10:57 am, Rune Allnor <all...@tele.ntnu.no> wrote:
> > Such factors govern what problems can be handled by the software > > with reasonable effort and within reasonable time. > > True, both binary and text-based numerical IO are O(N), but since > > text-based numerical IO is orders of magnitude slower, the strategic > > impact on design decisions is the same.
> There you go exagerating again. It's not orders of magnitude > slower. At the most, it's around 10 times slower, and often the > difference is even less. That doesn't mean that its irrelevant, > and sometimes you will have to use a binary format (and > sometimes, you'll have to adapt the binary format, to make it > quicker).
This Gianni Mariani quote indicates he saw some differences of more than 10x.
"However, reading and writing binary files can have HUGE performance gains. I once came across some numerical code where it would read and write large datasets. These datasets were 40-100MB. The performance was horrendus. Using mapped files and binary data made the reading and writing virtually zero cost and it improved the performance of the product by nearly 10x times and in some tests over 1000x. Be careful - this is one application and the bottle neck was clearly identified. This may not be where your application spends its time."
I hope to beef up the C++ Middleware Writer's support for writing and reading data more generally. To begin with I'm going to focus on integral types and assume 8 bit bytes. Currently we don't have support for uint8_t, uint16_t, etc. I guess those are the types I'll start with. I'm going through the newsgroup archives to find snippets that are helpful in this area. If anyone has a link wrt this, I'm interested.
> On Nov 9, 7:37 am, James Kanze <james.ka...@gmail.com> wrote:
> > On Nov 9, 10:57 am, Rune Allnor <all...@tele.ntnu.no> wrote:
> > > Such factors govern what problems can be handled by the software > > > with reasonable effort and within reasonable time. > > > True, both binary and text-based numerical IO are O(N), but since > > > text-based numerical IO is orders of magnitude slower, the strategic > > > impact on design decisions is the same.
> > There you go exagerating again. It's not orders of magnitude > > slower. At the most, it's around 10 times slower, and often the > > difference is even less. That doesn't mean that its irrelevant, > > and sometimes you will have to use a binary format (and > > sometimes, you'll have to adapt the binary format, to make it > > quicker).
> This Gianni Mariani quote indicates he saw some > differences of more than 10x.
> "However, reading and writing binary files can have HUGE > performance gains. I once came across some numerical code > where it would read and write large datasets. These datasets > were 40-100MB. The performance was horrendus. Using mapped > files and binary data made the reading and writing virtually > zero cost and it improved the performance of the product by > nearly 10x times and in some tests over 1000x. Be careful - > this is one application and the bottle neck was clearly > identified. This may not be where your application spends > its time."
> I hope to beef up the C++ Middleware Writer's support > for writing and reading data more generally. To begin > with I'm going to focus on integral types and assume > 8 bit bytes. Currently we don't have support for uint8_t, > uint16_t, etc. I guess those are the types I'll start with. > I'm going through the newsgroup archives to find snippets > that are helpful in this area. If anyone has a link wrt > this, I'm interested.