Reducing huge dataset

hjf · Post by **hjf** » Sat Aug 05, 2006 10:43 pm

Hello, I'm having a problem. I have a sensor that reports 2 times every second. That's almost 173000 times a day.
I want to show the readings on a chart. The problem is, how can I reduce the dataset size to make it more manageable?

If I feed JFreeChart with a TimeSeries of 173000 values, I have to go out for lunch while it prepares the graph.

How can I reduce the dataset size to something like 1000 points (to have nice resolution on 1024x768 for example).

I tried averaging, but the averages hide some of the "peaks" that JFreeChart displays (correctly) when I feed it the whole dataset.

To illustrate this: The sensor is a wind sensor. Wind is bursty. I takes a 2-minute average to calculate the "instant" wind speed, and the 10-minute average to calculate the "current" wind speed. When overlayed on a graph, you have something like a Sine with a line in the middle. The middle line is the "current" wind speed, and the "sine" are the gusts and lulls. I need the graph to show the gusts and lulls, but if I average that data, these are lost (the line flattens).

kelly · Post by **kelly** » Mon Aug 07, 2006 2:35 pm

Before you plot it, remove some data points, for example, so that you only have samples every second or every other second. When removing data points you might want to check them and plot outliers anyway.

SeanMcCrohan · Post by **SeanMcCrohan** » Mon Aug 07, 2006 3:16 pm

Take an average of the whole dataset, to get a baseline. For each minute, create a datapoint that is either the maximum value for that minute or the minimum value, depending on which is further away from the day's average. That is, show the MOST EXTREME value for each time period, rather than averaging (which would do the reverse).

That should give you a better picture of the variation over time, assuming your sensor is reliable. The weakness of the approach comes in if the sensor sometimes gives bad data - there's a very good chance you'd be selecting the spurious datapoints disproportionally, because they'd be far from the mean.

cringe · Post by **cringe** » Wed Aug 09, 2006 12:25 pm

You could create your own renderer that checks if the next point is to be drawn at the same position or in a specific area. This is already done somewhere, search for "performance" in the forum.

Mercer · Post by **Mercer** » Mon Aug 14, 2006 3:08 pm

The number of points you're talking about is no problem. Just don't add them all with notification turned on (a true/false flag in the add method). THAT'S what takes all the time with large series. Add with notification = false, and if a point comes thats out of range, add that point WITH.

If you're working with averages, it will be rare that a point truly breaks the upper and lower ranges, so the current ranges of the graph will be fine. So why check them? While the chart will take 5-10 seconds (Depending on your machine) to redraw, thats a lot better than going out for lunch...

hjf · Post by **hjf** » Mon Aug 14, 2006 3:41 pm

Thanks for all the answers. Unfortunately I haven't had time to try them, but I'll do that as soon as I can and I'll let you know the results.

Mercer: No, what takes a long time is the creation of the graph and the MovingAverage I added for that. That is, I create and populate the dataset first and then I create the graph. It takes a long time (more than 5 minutes) on my computer (Sepron 2200+, 1GB RAM, Windows XP SP2, Oracle JVM).

Mercer · Post by **Mercer** » Mon Aug 14, 2006 7:18 pm

Well if you're truly talking about only 100,000 or so points, 5 minutes seems particularly long.
2 possible solutions.

1) The Element List (may not be QUITE the right name) that is created in teh actual renderer creates a huge array of rectangle2D's that pop up tooltips when your mouse hovers over a particular pixel location. However while I know this takes a prodigious amount of memory for that many points, i DON"T know how long this process might take. Never tested it for speed as we cut this functionality very early on in our process.

2) Search these forums for 'dynamic' or 'efficiency' or some such. You're looking for code that checks wether a given point is within 2 pixels of the previous. If so, don't bother drawing the line. Why bother? The user can't see definition that close anyway.

If neither of these work, double check that this 5minute long process is in fact taking place AFTER your dataset is populated. In my experience the true time is spent firing notifications back and forth.

Further Questions:

Are you dynamically graphing? Or is this a once a day kind of thing?

How often are you updating the graph?

g'luck.

hjf · Post by **hjf** » Mon Aug 14, 2006 7:54 pm

thanks for your answers Mercer.

theoretically, the sensor sends its output 2 times every second, that is 172800 times a day. But, that output needs to be averaged in order to get the actual reading. it's reduced to a 2-minute average, which would reduce the dataset to 1440 points. i tried that but I didn't like the results. taking 120 chunks of data and then average it doesnt give the soft line I need, it's less spiky than the sensor's output but too "random" to be useful anyway.

(again, it's a wind sensor or anemometer. it spins for a while and then it stops. i first thought it wasn't working correctly but after doing some research I was told that was the expected behavior on light winds, especially on a city.)

so I tried the Moving Average, that option gives what I need but then of course the dataset is as large as the original input.

I'll search for performance and see what I can find.

I'd like to update the graph a few times an hour, like every 5 or 10 minutes. I'd like to keep the last 24 hours or so in the graph, as in a FIFO queue, where the old data is just discarded (actually logged to a textfile but not shown in the graph).

dan1son · Post by **dan1son** » Mon Aug 21, 2006 8:45 pm

I too am in the environmental business of sorts. We have wind, temp, humidity... etc. type sensors that get readings whenever they change. This leads to about 1000-200000 readings per day depending on the environment and which sensor it is.

I don't really have a problem with things going from 0 to whatever reading they are all of the time, but I do have varying readings throughout the day. What I am currently doing to make it "shrink" the amount of data is basically deciding how often I need a reading to have 1500 values for one timespan. We have the ability to graph weeks, months, even years at a time so I absolutely had to shrink the number down.

So essentially take a day, divide by 1500 and take one reading every segment. Another thing you can do is make sure you don't grab a zero value and just grab the next one.

One problem with this is that you may miss a spike here or there, but for the most part you should get a graph displaying what the wind did throughout the day...

The problem with having 172800 plots in a graph on a 1280x1024 resolution screen is that it's only capable of graphing so many of those points anyway so shrinking it doesn't really cause the graph to look much different... if at all.

jwenting · Post by **jwenting** » Tue Aug 22, 2006 7:11 am

Much the same for the financial data we plot. It's stored at one minute (roughly, depending on the exact reporting times) intervals, but often charted for a year or even longer (our biggest charts chart 10 year intervals).

Simply determine the most logical interval to chart for each duration and skip the rest of the data.
So for a 1 year chart we may plot only the close price for each day, for a one week chart the close price for each hour, and for an intraday chart the actual price at or 1 minute intervals.

That effectively means plotting 1000-5000 points per plot, with often several plots per chart (value, moving average, trade volume, etc.).
Chart generation takes less than a second, in fact retrieving and analysing the data to be plotted is more expensive.

You should I think take a hard look at your storage system. You may well find that your main performance bottleneck isn't the chart generation at all but the retrieval of the data from your datastore and maybe memory problems caused by the volume.