Fast plotting

nikster · Post by **nikster** » Mon May 03, 2004 3:14 pm

i was looking at FastScatterPlot and it sure is fast, but it also removes a lot of the functionality of other plots.

i think we could unify "fast" plotting with nice plotting by just adding getX() and getY() methods to the Dataset interface classes.

the main problem with fast plotting is that these interfaces require the data be returned as Number objects and therefore force object creation. not good.

i think that all of the functionality gained from access through Number could be done with
double getX()
and
double getY()

as well (this is for XYDataset, but it's the same thing for all other kinds and varieties).

namely:
- use as integer / long: (int) getX(); / (long) getX()
- check for x == null: Double.isNaN(x)

as far as i can see, that's the only things the Number object (as opposed to a double value) is used for.

i also had some ideas on how to make a faster number object:

1) make a mutable subclass of Number and return that from your data. if you use final modifiers for doubleValue, this would be almost as fast as direct array access. since Numbers are supposed to be immutable in Java it would be a dangerous hack - but it might work in the context of JFreeChart drawing...

2) make a Number subclass with final() methods for doubleValue etc. that would make it faster (by about a factor of 2), but it would still not do anything for memory performance.

what do ppl think of this and are there other ideas? i ran some tests and found that direct array access is largely useless. it's not any faster than a final method call (the JVM does indeed optimize the method call away), but it's a lot less flexible. IMHO there is never a need to directly access arrays. it's always better to wrap it in an interface.

Post by **david.gilbert** » Tue May 04, 2004 9:33 am

A lot of the (relative) speed in the FastScatterPlot class comes from the fact that the data is rendered as a simple dot using:

g2.fillRect(transX, transY, 1, 1);

There is no shape lookup, translation, or outlining to worry about, so it is faster than the more general renderer mechanism. Some gain (of course) comes from the use of double primitives and the direct array access, but I'm fairly certain the data access time is usually small relative to the data drawing time (although I have no reproducible benchmarking code to back this up, yet).

My preference for using Number objects in the dataset interfaces is mainly for the following reasons:

(1) It is more convenient for displaying datasets using Swing's JTable / TableModel (important to me);

(2) It allows the use of 'null' to represent missing values.

(3) It works fast enough for a large fraction of common requirements.

I'm turning my focus now towards getting a stable version 1.0 completed, and therefore I think it is unlikely that the dataset interfaces will be changed. I'd change my mind if (a) there was some solid benchmark code that shows a dramatic improvement in performance (reproducible across platforms) resulting from dataset changes, and (b) support for the change among a large group of JFreeChart users.

nikster · Post by **nikster** » Tue May 04, 2004 10:42 am

david,

i think you are heads on with the notion that fast drawing equates drawing as little as possible. drawing as little as possible most likely has the largest effect on drawing speed. i agree that _accessing_ the Number object is fast enough...

david.gilbert wrote: I'm turning my focus now towards getting a stable version 1.0 completed, and therefore I think it is unlikely that the dataset interfaces will be changed. I'd change my mind if (a) there was some solid benchmark code that shows a dramatic improvement in performance (reproducible across platforms) resulting from dataset changes, and (b) support for the change among a large group of JFreeChart users.

i love JFreeChart, so i want to make sure it will work well for large datasets. basically all scientific applications deal with large data sets at one point or another.
i was not concerned with drawing speed as much as with creating the Number objects.

here is my user case, a typical scientific application: i have 30,000 data points, and display them on screen. then, i apply some sort of calculation to the data points. imagine i do this interactively by dragging around a slider.

if the API enforces the use of a Number object, this will be too slow, solely because i need to create 30,000 objects each time the data changes.

what i am afraid of is that enforcement of Number makes JFreeChart unusable for apps with a lot of data. the solution would be to offer Number as part of the interface for things that need it (like JTable) but to make it optional.

test results show that the overhead for object creation is significant. in the worst case (direct set of a value vs. creating a Number object containing that value) the difference is 50:1.

i also assume there will be a pretty big difference in memory use, both in static allocation (Number using a lot more memory than double) and in dynamic performance (creating lots of Objects causes lots of garbage collects/generally bad memory performance).
some of these issues _could_ be addressed by caching a mutable Number objects (and returning them to the cache when not used anymore). but not all.

i agree that Number is generally better than primitive types for all the reasons you mentioned. but it will also make writing scientific apps in JFreeChart a lot harder.

of course, we could just double all code and make fast versions of it, but in the interest of the architecture it would really be a lot better to find a common solution.

i think one approach would be to have both methods, e.g.
public Number getXValue(int series, int item);
and
public double getX(int series, int item)
in the interface.

if we then assume that Double.NaN is a valid replacement for null value, we could gradually rewrite the code so that it uses getX() whenever possible, and getXValue() only when neccessary.

this would cover JTable (where you have to convert your data to objects anyway even in scientific apps)... i think we can have the cake, and eat it, too.

my suggestion is therefore to add getX() and getY() - the primitive versions - to the XYDataset interface before 1.0 ships. then, we can still gradually make everything "fast" later... in effect that means gradually moving away from using Number objects everywhere where they are not needed.

Post by **david.gilbert** » Tue May 04, 2004 11:18 am

OK, you are beginning to convince me. Regarding the Double.NaN representing missing values - is this a good assumption to make? It doesn't *feel* right to me, but I can't think of a valid reason why...are there any downsides to it?

nikster · Post by **nikster** » Tue May 04, 2004 12:08 pm

david.gilbert wrote:OK, you are beginning to convince me. Regarding the Double.NaN representing missing values - is this a good assumption to make? It doesn't *feel* right to me, but I can't think of a valid reason why...are there any downsides to it?

Good point. I had the same feeling, but, for lack of other options, suggested it anyway. Now i did some research...
to quote the VM spec at http://java.sun.com/docs/books/vmspec/2 ... s.doc.html

The NaN value is used to represent the result of certain invalid operations such as dividing zero by zero

for the purpose of plotting, the difference is not relevant. it just marks a point that can't be plotted.

i think that concludes the argument: missing and NaN are two different things. NaN has very specific semantics (such as that all operations involving NaN return NaN as well, and all boolean operations involving NaN return false.

the question remains on whether or not we can live with that in JFreeChart.
actually... i can think of a problem: an algorithm could ignore missing data points but not NaN data points. ie. maybe you would want to do some extrapolation if the data points are missing, but, at the same time, pass on invalid results (as they should)... great, i just shot down my own suggestion.

too bad... if we can't use NaN, we would have to use an explicit
isMissing()
method in Dataset... maybe that is actually the cleanest option.

Post by **david.gilbert** » Tue May 04, 2004 9:59 pm

nikster wrote:too bad... if we can't use NaN, we would have to use an explicit isMissing() method in Dataset... maybe that is actually the cleanest option.

Yeah, I think that is the best approach. If I can't think of any other obstacles, I'll try to get these changes made for the 0.9.19 release.

nikster · Post by **nikster** » Wed May 05, 2004 8:59 am

great!

we will have something like (names are just examples...)
getX()
getY()
isMissing(series, item)

& eventually use them everywhere we use getXValue() now, except places where we must have a Number object.

=> voila, scientific apps (and other apps with lots of data) just work in JFreeChart.

e.g. i can implement a super-fast XYDataset type with direct array access encapsulated in final versions of the "fast" methods above and it will work will all the rest of JFreeChart. which doesn't prevent us from doing things like FastScatterPlot for things that need even more speed. it's up to me how i implement isMissing() - from always returning false to checking for NaN to keeping a separate list

impact on existing Number-based applications is minimal since the three methods are trivial to implement if you have the Number object.

very cool

Post by **david.gilbert** » Wed May 05, 2004 9:14 am

Yes, thanks for suggesting the approach. Most others---unless I misunderstood them---have advocated *replacing* the Number objects with double primitives throughout the dataset interfaces. But your approach allows both to co-exist, and I think it is an excellent compromise (it allows a sort of "lazy" Number creation for those that want/need it). Of course, if someone else can see a big gotcha that we're missing, please speak up!

On the method names, do you prefer isMissing() or isUnknown(). I'm leaning towards the latter because a missing value is unknown, but an unknown value isn't necessarily missing. It is a small point though, I'm not that bothered by it...just thought I'd gather some opinions.

nikster · Post by **nikster** » Wed May 05, 2004 9:32 am

i am pretty happy with our co-developed solution too

it's going to work.

david.gilbert wrote: On the method names, do you prefer isMissing() or isUnknown(). I'm leaning towards the latter because a missing value is unknown, but an unknown value isn't necessarily missing. It is a small point though, I'm not that bothered by it...just thought I'd gather some opinions.

no particular opinion.... i am just as happy to take isUnknown()

in the apps i was writing so far, we always had lots of "missing" data (because the data comes from sensors and the sensors sometimes don't work or measure something out of bounds etc). but you might as well call that unknown data. in fact, to say there is some data, but we don't know it is probably more accurate even in my case.

i don't know when or how the missing data condition occurs in other applications...

Post by **david.gilbert** » Wed May 05, 2004 8:35 pm

I just had a thought - the isMissing() method is redundant. Its purpose is only to determine the meaning of getY() when it returns Double.NaN (which is a double primitive)...is it really 'not a number' or is it equivalent to 'null' (a missing or unknown value)?

My first thought was to change isMissing() to isNaN() because that is what it really tells us. But then why not get the same information from the getYValue() method - by checking whether it returns 'null' or a Number object (which is, presuming the dataset is behaving consistently, a Double or Float object where the isNaN() method returns true).

paganene · Post by **paganene** » Wed May 05, 2004 10:53 pm

I'm currently developing an application that displays data at 64 Hz for the duration of 60 s (3840). I ran into the CPU performance problem because the recreation of 3840 Number objects at the rate of 64 Hz. One of my solutions was to sub-class the class Number and allow the application to modify the value it holds. In other words, I made it not a final class and I seemed to solve my problems.

Am I missing something?

nikster · Post by **nikster** » Thu May 06, 2004 8:19 am

i was considering the same thing.

the only problem is that the Number interface contract forbids this - Number is immutable. here is an old article about this: http://www.artima.com/intv/gosling313.html

what you did is create a mutable subclass of Number.

this is not a problem as long as you have full control over the use of your number objects. but because Java defines Number as immutable (at least the included implementations are) it remains a hack. you could run into trouble when classes from the JDK or other classes use the number object because everybody assumes that Number is immutable.
so they will probably, sometime, rely on the immutability of the number object. at best, this creates ambiguities, at worst, errors.

i think our solution will be better because we can avoid the topic altogether.

nikster · Post by **nikster** » Thu May 06, 2004 10:28 am

david.gilbert wrote:I just had a thought - the isMissing() method is redundant. Its purpose is only to determine the meaning of getY() when it returns Double.NaN (which is a double primitive)...is it really 'not a number' or is it equivalent to 'null' (a missing or unknown value)?

well, i am not sure of that. i would think that if isMissing() returns true, the getY() value returned is irrelevant (e.g. undefined).
isNaN() is not really the same in this respect. we want to find out when the value is missing (in order to be compatible with the meaning of null in getYValue())... isMissing therefore must have the same semantics.

My first thought was to change isMissing() to isNaN() because that is what it really tells us. But then why not get the same information from the getYValue() method - by checking whether it returns 'null' or a Number object (which is, presuming the dataset is behaving consistently, a Double or Float object where the isNaN() method returns true).

uh, the whole point was to not have to call getYValue() wherever you don't need a number - in order to prevent unneccessary object creation.

since you don't know whether it's null in advance, you would end up creating all those numbers and the whole idea of doing lazy number initialization goes out the window.

to summarize the solution:

1 - in any kind of XYDataset, you add the following methods:

Code: Select all

double getX(...) {
return getXValue(...).doubleValue();
}
double getY() {
return getYValue(..).doubleValue();
}
boolean isMissing(...) {
return getXValue() == null;
}

2 - in all the drawing code and wherever you can substitute the Number object with the above calls, you do so.

3 - i can then implement the fast dataset like this

Code: Select all

double[] x, y;
boolean[] missingValueMap;
getX(..) {
 return x[..];
}
getY(..) {
return y[..];
}
isMissing(..) {
 return missingValueMap[..]
}
getXValue(..) {
 if (missingValueMap[..])
   return null;
 else
  return new Double(x[..]); // optionally cache the number object here...
}
getYValue(..) {
 if (missingValueMap[..])
   return null;
 else
  return new Double(y[..]);
}

the advantage is that i then only create number objects when i must. and we can make the code work so that is very rarely, e.g. definitely not at drawing time.

the other advantage is that existing code does not change at all and people who don't want to deal with primitives don't have to.

one thing i had not thought of before: can it be that getXValue() returns null and getYValue returns != null? does that have a meaning? if so, we need to parameterize the isMissing method or have two of them...

agreed?

Post by **david.gilbert** » Thu May 06, 2004 11:21 am

nikster wrote:well, i am not sure of that. i would think that if isMissing() returns true, the getY() value returned is irrelevant (e.g. undefined).
isNaN() is not really the same in this respect. we want to find out when the value is missing (in order to be compatible with the meaning of null in getYValue())... isMissing therefore must have the same semantics.

If isMissing() returns true, getY() should always be Double.NaN. That way, you can call getY() and if it is a number, just use it. But if it is Double.NaN, then you can refer to isMissing() to resolve whether it is *really* not-a-number, or actually a missing or unknown value.

My point, though, is that you don't need isMissing() to do the resolution.

nikster wrote:
My first thought was to change isMissing() to isNaN() because that is what it really tells us. But then why not get the same information from the getYValue() method - by checking whether it returns 'null' or a Number object (which is, presuming the dataset is behaving consistently, a Double or Float object where the isNaN() method returns true).
uh, the whole point was to not have to call getYValue() wherever you don't need a number - in order to prevent unneccessary object creation.

since you don't know whether it's null in advance, you would end up creating all those numbers and the whole idea of doing lazy number initialization goes out the window.

the advantage is that i then only create number objects when i must. and we can make the code work so that is very rarely, e.g. definitely not at drawing time.

But notice that you only need to call getYValue() in one special circumstance where the y-value is either 'null' or 'new Double(Double.NaN)'. The former involves no object creation, and the latter could be a static instance shared by all datasets...no object creation required.

nikster wrote:1 - in any kind of XYDataset, you add the following methods:
Code: Select all
double getX(...) {
return getXValue(...).doubleValue();
}
double getY() {
return getYValue(..).doubleValue();
}
boolean isMissing(...) {
return getXValue() == null;
}

I'm using this as the default for getX() and getY() (it went into CVS this morning):

Code: Select all

    /**
     * Returns the x-value (as a double primitive) for an item within a series.
     * 
     * @param series  the series (zero-based index).
     * @param item  the item (zero-based index).
     * 
     * @return The x-value.
     */
    public double getX(int series, int item) {
        double result = Double.NaN;
        Number x = getXValue(series, item);
        if (x != null) {
            result = x.doubleValue();   
        }
        return result;   
    }

    /**
     * Returns the y-value (as a double primitive) for an item within a series.
     * 
     * @param series  the series (zero-based index).
     * @param item  the item (zero-based index).
     * 
     * @return The y-value.
     */
    public double getY(int series, int item) {
        double result = Double.NaN;
        Number y = getYValue(series, item);
        if (y != null) {
            result = y.doubleValue();   
        }
        return result;   
    }

Naturally, a dataset implementation that is backed by double primitives would override these methods and return the values directly from the data structure being used.

nikster wrote: 2 - in all the drawing code and wherever you can substitute the Number object with the above calls, you do so.

Agreed.

nikster wrote:3 - i can then implement the fast dataset like this
Code: Select all
double[] x, y;
boolean[] missingValueMap;
getX(..) {
 return x[..];
}
getY(..) {
return y[..];
}
isMissing(..) {
 return missingValueMap[..]
}
getXValue(..) {
 if (missingValueMap[..])
   return null;
 else
  return new Double(x[..]); // optionally cache the number object here...
}
getYValue(..) {
 if (missingValueMap[..])
   return null;
 else
  return new Double(y[..]);
}
the advantage is that i then only create number objects when i must. and we can make the code work so that is very rarely, e.g. definitely not at drawing time.

If you ensure that y[..] is 'Double.NaN' (a double primitive) when missingValueMap[..] is 'true', then getYValue(..) will return 'null' or 'new Double(Double.NaN)' which tells you the same information as calling isMissing(). Now just modify the getYValue() code slightly to check for Double.NaN and return a static instance to prevent creating a 'new Double(Double.NaN)' every time. Something like this:

Code: Select all

getYValue(..) {
   if (missingValueMap[..])
   return null;
  else if (Double.isNaN(y[..]) )
    return XYDataset.DOUBLE_NAN;
  else
    return new Double(y[..]);
}

nikster wrote:one thing i had not thought of before: can it be that getXValue() returns null and getYValue returns != null? does that have a meaning? if so, we need to parameterize the isMissing method or have two of them...

Life is simpler if we require non-null x values, which is what I've always assumed (but the code may not enforce that everywhere). I can't think of an application that requires null x-values and defined y-values.

nikster · Post by **nikster** » Thu May 06, 2004 11:42 am

david.gilbert wrote:
nikster wrote:well, i am not sure of that. i would think that if isMissing() returns true, the getY() value returned is irrelevant (e.g. undefined).
isNaN() is not really the same in this respect. we want to find out when the value is missing (in order to be compatible with the meaning of null in getYValue())... isMissing therefore must have the same semantics.
If isMissing() returns true, getY() should always be Double.NaN. That way, you can call getY() and if it is a number, just use it. But if it is Double.NaN, then you can refer to isMissing() to resolve whether it is *really* not-a-number, or actually a missing or unknown value.

My point, though, is that you don't need isMissing() to do the resolution.

i see.

let's assume there is no isMissing method and that getX() returns Double.NaN for missing values.

=> system finds Double.NaN
=> checks getXValue() to see if it's null or NaN

my getXValue code then does this:
1) check if it's missing and return null in that case
2) check if it's NaN and return a static NaN number in that case
3) create object Number and return it

only 3) would be expensive.

you are correct that that would work and would not trigger the object creation...

the decision of whether to do it this way or that, in my opinion, depends entirely on which way to do this is more elegant in the calling code / the grand scheme of things.

in general, i am not one to hesitate in adding a new method to an interface if it make things more clear or if it makes code dealing with the class less arcane.

i don't know JFreeChart well enough to see all the nitty-gritty details so i would not be able to make that call...