stats (Statistical Summary)


     stats 'filename' [using N[:M]] [name 'prefix'] [[no]output]]

This command prepares a statistical summary of the data in one or two columns of a file. The using specifier is interpreted in the same way as for plot commands. See ‘plot‘ for details on the index, every, and using directives. Data points are filtered against both xrange and yrange before analysis. See xrange. The summary is printed to the screen by default. Output can be redirected to a file by prior use of the command print, or suppressed altogether using the ‘nooutput‘ option.

In addition to printed output, the program stores the individual statistics into three sets of variables. The first set of variables reports how the data is laid out in the file:

     STATS_records           # total number of in-range data records
     STATS_outofrange        # number of records filtered out by range limits
     STATS_invalid           # number of invalid/incomplete/missing records
     STATS_blank             # number of blank lines in the file
     STATS_blocks            # number of indexable data blocks in the file

The second set reports properties of the in-range data from a single column. If the corresponding axis is autoscaled (x-axis for the 1st column, y-axis for the optional second column) then no range limits are applied. If two columns are being analysed in a single ‘stats‘ command, the the suffix "_x" or "_y" is appended to each variable name. I.e. STATS_min_x is the minimum value found in the first column, while STATS_min_y is the minimum value found in the second column.

     STATS_min               # minimum value of in-range data points
     STATS_max               # maximum value of in-range data points
     STATS_index_min         # index i for which data[i] == STATS_min
     STATS_index_max         # index i for which data[i] == STATS_max
     STATS_lo_quartile       # value of the lower (1st) quartile boundary
     STATS_median            # median value
     STATS_up_quartile       # value of the upper (3rd) quartile boundary
     STATS_mean              # mean value of in-range data points
     STATS_stddev            # standard deviation of the in-range data points
     STATS_sum               # sum
     STATS_sumsq             # sum of squares

The third set of variables is only relevant to analysis of two data columns.

     STATS_correlation       # correlation coefficient between x and y values
     STATS_slope             # A corresponding to a linear fit y = Ax + B
     STATS_intercept         # B corresponding to a linear fit y = Ax + B
     STATS_sumxy             # sum of x*y
     STATS_pos_min_y         # x coordinate of a point with minimum y value
     STATS_pos_max_y         # x coordinate of a point with maximum y value

It may be convenient to track the statistics from more than one file at the same time. The ‘name‘ option causes the default prefix "STATS" to be replaced by a user-specified string. For example, the mean value of column 2 data from two different files could be compared by

     stats "file1.dat" using 2 name "A"
     stats "file2.dat" using 2 name "B"
     if (A_mean < B_mean) {...}

The index reported in STATS_index_xxx corresponds to the value of pseudo-column 0 ($0) in plot commands. I.e. the first point has index 0, the last point has index N-1.

Data values are sorted to find the median and quartile boundaries. If the total number of points N is odd, then the median value is taken as the value of data point (N+1)/2. If N is even, then the median is reported as the mean value of points N/2 and (N+2)/2. Equivalent treatment is used for the quartile boundaries.

For an example of using the ‘stats‘ command to help annotate a subsequent plot, seestats.dem.

The current implementation does not allow analysis if either the X or Y axis is set to log-scaling. This restriction may be removed in a later version.