## Simple Random Sampling vs. Stratified Sampling

### Simple Random Sampling vs. Stratified Sampling

Hi all,

I want to simulate the difference between sampling variance using simple random sampling (SRS) rather than Stratified Sampling (SS). In particular I want to test the case when stratification is not useful to define homogeneous groups.

I use a simple universe of 50 units. The values VV are drawn from a normal distribution of mean=100000 and variance=1000.
Then I divided the observation in two groups. The inner variance of both groups is very similar so I expect to get no gain using SS instead of SRS.

I run the following code to create any possible sample of 5 elements. The first part simulate the case of stratified sampling (2 elements from the first group of 15 units and 3 elements from the second of 35 units), while the second part simulate the case SRS with a sample of 5 elements.
Using these samples I want to estimate the universe sum.

--------------------------------------------------------------------------------

data univ;
s1=98732.4827088742;
s2=107216.090125439;
s3=89101.1839281418;
s4=93957.4081508908;
s5=119316.121324664;
s6=98747.8986440692;
s7=87339.9701755261;
s8=115679.779608035;
s9=97374.1523717763;
s10=100898.148755368;
s11=110504.163583391;
s12=98789.5193953591;
s13=92648.2587397913;
s14=121355.481294449;
s15=100180.92123355;
s16=115262.139640981;
s17=88184.2086324468;
s18=93293.704392272;
s19=83388.4430699982;
s20=111601.719052123;
s21=109095.720088226;
s22=116420.062820544;
s23=99392.6167007885;
s24=94120.9580356372;
s25=108706.911103218;
s26=104347.884896561;
s27=98898.8747645635;
s28=95823.0205265863;
s29=97495.1151671121;
s30=93106.7009129038;
s31=103422.81509802;
s32=89250.9777055238;
s33=108031.497600314;
s34=93863.4800855652;
s35=93438.609635632;
s36=97742.2362462676;
s37=113571.025192505;
s38=86556.5996515215;
s39=93765.3001325089;
s40=110225.357982563;
s41=101765.147317201;
s42=87894.8528930778;
s43=100758.450369176;
s44=86273.6331124324;
s45=104331.900527177;
s46=104758.589966514;
s47=95794.6215545235;
s48=108587.949196226;
s49=79214.5899846219;
s50=94804.1249808739;
zf1=15/2; zf2=35/3; zf=50/5;
run;

data univ1;
set univ;
array gp1 (*) s1--s15;
array gp2 (*) s16--s50;
do i=1 to (dim(gp1)-1);
do j=i+1 to dim(gp1);
do k=1 to (dim(gp2)-2);
do w=k+1 to (dim(gp2)-1);
do x=w+1 to (dim(gp2));
vvgp1s=gp1(i)* zf1 + gp1(j)* zf1;
vvgp2s=gp2(k)* zf2 + gp2(w)* zf2 + gp2(x)* zf2;
vvstot=vvgp2s+vvgp1s;
output;
end;
end;
end;
end;
end;
run;

data univ2;
set univ;
array gp1 (*) s1--s50;
do i=1 to (dim(gp1)-4);
do j=i+1 to (dim(gp1)-3);
do k=j+1 to (dim(gp1)-2);
do w=k+1 to (dim(gp1)-1);
do x=w+1 to (dim(gp1));
vvtots=(gp1(i)+gp1(j)+gp1(k)+gp1(w)+gp1(x))*zf;
output;
end;
end;
end;
end;
end;
run;

proc means data=univ1;
var vvstot;
run;

proc means data=univ2;
var vvtots;
run;

--------------------------------------------------------------------------------

According to statistical theory, the expected value of sample sum is equal to universe sum for both techniques, while sampling variance should be not greater for SS.

What I get is that variances are very similar, but SRS variance is slightly smaller than SS variance!!!

--------------------------------------------------------------------------------

SS
N                = 687225
Mean        = 4995031.42
Std.Dev.   = 210736.37

SRS
N                = 2118760
Mean        = 4995031.42
Std.Dev.   = 208152.14

--------------------------------------------------------------------------------

Could you help me to understand where is the mistake?

Thanks for any suggestion.

Gianluca

### Simple Random Sampling vs. Stratified Sampling

Quote:> This is a multi-part message in MIME format.

Oh.  Please don't do that.  It mucks up some people's
news readers, and can make a mess in the SAS-L digest
version.  (Just thought you'd want to know.)

Quote:> I want to simulate the difference between sampling variance using

simple random sampling (SRS) rather than Stratified Sampling (SS). In
particular I want to test the case when stratification is not useful to
define homogeneous groups.

That's *all* cases.  If the groups are homogenous with
respect to the the character to be estimated, then don't
do stratified sampling.  Find another approach.

Quote:> I use a simple universe of 50 units. The values VV are drawn from a

normal distribution of mean=100000 and variance=1000.
Then I divided the observation in two groups. The inner variance of both
groups is very similar so I expect to get no gain using SS instead of
SRS.

Correct.  But you're forgetting that stratified sampling
loses on variances by losing all those joint inclusion
probabilities across the strata.  So...

Quote:> I run the following code to create any possible sample of 5 elements.

The first part simulate the case of stratified sampling (2 elements from
the first group of 15 units and 3 elements from the second of 35 units),
while the second part simulate the case SRS with a sample of 5 elements.

Quote:> . . . .
> What I get is that variances are very similar, but SRS variance is

slightly smaller than SS variance!!!

I didn't study your code that intimately, but you can
PROC SURVEYMEANS.  (I did not.)

Still, I don't see anything wrong with the size of your
results.  People assume stratified sampling has magical
properties, but it only works when used *right*.  In
fact, even if you have clear strata with known homogeneity,
so that stratified sampling seems preferable to SRS,
a mere 20% misclassification rate can completely obliterate
any performance gains due to stratified sampling.  (I believe
the best reference on this would be Olsen and Stevens, or
Olsen and Urquhart.  The correct ref would be in the Proceedings
of the American Statistical Association, perhaps in the Section
on Statistics and the Environment, about 10 years ago, if I
recall correctly...)

HTH,
David
--
David Cassell, CSC

Senior computing specialist
mathematical statistician