T - type of itempublic final class ItemsSketch<T> extends Object
The documentation for DoublesSketch applies here except that the size of an ItemsSketch
is very dependent on the Items input into the sketch, so there is no comparable size table as
for the DoublesSketch.
There is more documentation available on DataSketches.GitHub.io.
| Modifier and Type | Field and Description |
|---|---|
static Random |
rand
Setting the seed makes the results of the sketch deterministic if the input values are
received in exactly the same order.
|
| Modifier and Type | Method and Description |
|---|---|
ItemsSketch<T> |
downSample(int newK)
From an existing sketch, this creates a new sketch that can have a smaller value of K.
|
double[] |
getCDF(T[] splitPoints)
Returns an approximation to the Cumulative Distribution Function (CDF), which is the
cumulative analog of the PMF, of the input stream given a set of splitPoints (values).
|
static <T> ItemsSketch<T> |
getInstance(Comparator<? super T> comparator)
Obtains a new instance of an ItemsSketch using the DEFAULT_K.
|
static <T> ItemsSketch<T> |
getInstance(int k,
Comparator<? super T> comparator)
Obtains a new instance of an ItemsSketch.
|
static <T> ItemsSketch<T> |
getInstance(Memory srcMem,
Comparator<? super T> comparator,
ArrayOfItemsSerDe<T> serDe)
Heapifies the given srcMem, which must be a Memory image of a ItemsSketch
|
int |
getK()
Returns the configured value of K
|
T |
getMaxValue()
Returns the max value of the stream
|
T |
getMinValue()
Returns the min value of the stream
|
long |
getN()
Returns the length of the input stream so far.
|
double |
getNormalizedRankError()
Get the rank error normalized as a fraction between zero and one.
|
static double |
getNormalizedRankError(int k)
Static method version of
getNormalizedRankError() |
double[] |
getPMF(T[] splitPoints)
Returns an approximation to the Probability Mass Function (PMF) of the input stream
given a set of splitPoints (values).
|
T |
getQuantile(double fraction)
This returns an approximation to the value of the data item
that would be preceded by the given fraction of a hypothetical sorted
version of the input stream so far.
|
T[] |
getQuantiles(double[] fractions)
This is a more efficient multiple-query version of getQuantile().
|
T[] |
getQuantiles(int evenlySpaced)
This is also a more efficient multiple-query version of getQuantile() and allows the caller to
specify the number of evenly spaced fractional ranks.
|
int |
getRetainedItems()
Computes the number of retained entries (samples) in the sketch
|
boolean |
isDirect() |
boolean |
isEmpty()
Returns true if this sketch is empty
|
boolean |
isEstimationMode() |
void |
putMemory(Memory dstMem,
ArrayOfItemsSerDe<T> serDe)
Puts the current sketch into the given Memory if there is sufficient space.
|
void |
reset()
Resets this sketch to a virgin state, but retains the original value of k.
|
byte[] |
toByteArray(ArrayOfItemsSerDe<T> serDe)
Serialize this sketch to a byte array form.
|
byte[] |
toByteArray(boolean ordered,
ArrayOfItemsSerDe<T> serDe)
Serialize this sketch to a byte array form.
|
String |
toString()
Returns summary information about this sketch.
|
String |
toString(boolean sketchSummary,
boolean dataDetail)
Returns summary information about this sketch.
|
void |
update(T dataItem)
Updates this sketch with the given double data item
|
public static final Random rand
public static <T> ItemsSketch<T> getInstance(Comparator<? super T> comparator)
T - type of itemcomparator - to compare itemspublic static <T> ItemsSketch<T> getInstance(int k, Comparator<? super T> comparator)
T - type of itemk - Parameter that controls space usage of sketch and accuracy of estimates.
Must be greater than 2 and less than 65536 and a power of 2.comparator - to compare itemspublic static <T> ItemsSketch<T> getInstance(Memory srcMem, Comparator<? super T> comparator, ArrayOfItemsSerDe<T> serDe)
T - type of itemsrcMem - a Memory image of a sketch.
See Memorycomparator - to compare itemsserDe - an instance of ArrayOfItemsSerDepublic void update(T dataItem)
dataItem - an item from a stream of items. NaNs are ignored.public T getQuantile(double fraction)
We note that this method has a fairly large overhead (microseconds instead of nanoseconds) so it should not be called multiple times to get different quantiles from the same sketch. Instead use getQuantiles(). which pays the overhead only once.
fraction - the specified fractional position in the hypothetical sorted stream.
These are also called normalized ranks or fractional ranks.
If fraction = 0.0, the true minimum value of the stream is returned.
If fraction = 1.0, the true maximum value of the stream is returned.public T[] getQuantiles(double[] fractions)
This returns an array that could have been generated by using getQuantile() with many different fractional ranks, but would be very inefficient. This method incurs the internal set-up overhead once and obtains multiple quantile values in a single query. It is strongly recommend that this method be used instead of multiple calls to getQuantile().
fractions - given array of fractional positions in the hypothetical sorted stream.
These are also called normalized ranks or fractional ranks.
These fractions must be monotonic, in increasing order and in the interval
[0.0, 1.0] inclusive.public T[] getQuantiles(int evenlySpaced)
evenlySpaced - an integer that specifies the number of evenly spaced fractional ranks.
This must be a positive integer greater than 0. A value of 1 will return the min value.
A value of 2 will return the min and the max value. A value of 3 will return the min,
the median and the max value, etc.public double[] getPMF(T[] splitPoints)
The resulting approximations have a probabilistic guarantee that be obtained from the getNormalizedRankError() function.
splitPoints - an array of m unique, monotonically increasing values
that divide the domain into m+1 consecutive disjoint intervals.public double[] getCDF(T[] splitPoints)
More specifically, the value at array position j of the CDF is the sum of the values in positions 0 through j of the PMF.
splitPoints - an array of m unique, monotonically increasing values
that divide the domain into m+1 consecutive disjoint intervals.public int getK()
public T getMinValue()
public T getMaxValue()
public long getN()
public double getNormalizedRankError()
Suppose the sketch is presented with N values. The raw rank (0 to N-1) of an item would be its index position in the sorted version of the input stream. If we divide the raw rank by N, it becomes the normalized rank, which is between 0 and 1.0.
For example, choosing a K of 227 yields a normalized rank error of about 1%. The upper bound on the median value obtained by getQuantile(0.5) would be the value in the hypothetical ordered stream of values at the normalized rank of 0.51. The lower bound would be the value in the hypothetical ordered stream of values at the normalized rank of 0.49.
The error of this sketch cannot be translated into an error (relative or absolute) of the returned quantile values.
public static double getNormalizedRankError(int k)
getNormalizedRankError()k - the configuration parameter of a ItemsSketchpublic boolean isEmpty()
public boolean isDirect()
public boolean isEstimationMode()
public void reset()
public byte[] toByteArray(ArrayOfItemsSerDe<T> serDe)
serDe - an instance of ArrayOfItemsSerDepublic byte[] toByteArray(boolean ordered,
ArrayOfItemsSerDe<T> serDe)
ordered - if true the base buffer will be ordered (default == false).serDe - an instance of ArrayOfItemsSerDepublic String toString()
public String toString(boolean sketchSummary, boolean dataDetail)
sketchSummary - if true includes sketch summarydataDetail - if true includes data detailpublic ItemsSketch<T> downSample(int newK)
newK - the new value of K that must be smaller than current value of K.
It is required that this.getK() = newK * 2^(nonnegative integer).public int getRetainedItems()
public void putMemory(Memory dstMem, ArrayOfItemsSerDe<T> serDe)
dstMem - the given memory.serDe - an instance of ArrayOfItemsSerDeCopyright © 2015–2017 Yahoo! Inc.. All rights reserved.