sample示例代码如下:
val a = sc.parallelize(1 to 20)
val b = a.sample(true, 0.5, 0)
b.collect.mkString(", ")
val c = a.sample(false, 0.5, 0)
c.collect.mkString(", ")输出如下:
String = 3, 6, 7, 9, 9, 10, 15, 16, 16, 17, 17, 18, 20 String = 1, 3, 4, 6, 8, 10, 11, 18, 19
参数说明:
第一个参数表示是否放回,
第二个参数设置抽样比例,
第三个参数设置seed。
sample函数定义如下:
/**
* Return a sampled subset of this RDD.
*
* @param withReplacement can elements be sampled multiple times (replaced when sampled out)
* @param fraction expected size of the sample as a fraction of this RDD's size
* without replacement: probability that each element is chosen; fraction must be [0, 1]
* with replacement: expected number of times each element is chosen; fraction must be >= 0
* @param seed seed for the random number generator
*/
def sample(
withReplacement: Boolean,
fraction: Double,
seed: Long = Utils.random.nextLong): RDD[T] = withScope {
require(fraction >= 0.0, "Negative fraction value: " + fraction)
if (withReplacement) {
new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
} else {
new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
}
}