Hi all,

I want to simulate the difference between sampling variance using simple random sampling (SRS) rather than Stratified Sampling (SS). In particular I want to test the case when stratification is not useful to define homogeneous groups.

I use a simple universe of 50 units. The values VV are drawn from a normal distribution of mean=100000 and variance=1000.

Then I divided the observation in two groups. The inner variance of both groups is very similar so I expect to get no gain using SS instead of SRS.

I run the following code to create any possible sample of 5 elements. The first part simulate the case of stratified sampling (2 elements from the first group of 15 units and 3 elements from the second of 35 units), while the second part simulate the case SRS with a sample of 5 elements.

Using these samples I want to estimate the universe sum.

---------------------------------------------------------------------------

data univ;

s1=98732.4827088742;

s2=107216.090125439;

s3=89101.1839281418;

s4=93957.4081508908;

s5=119316.121324664;

s6=98747.8986440692;

s7=87339.9701755261;

s8=115679.779608035;

s9=97374.1523717763;

s10=100898.148755368;

s11=110504.163583391;

s12=98789.5193953591;

s13=92648.2587397913;

s14=121355.481294449;

s15=100180.92123355;

s16=115262.139640981;

s17=88184.2086324468;

s18=93293.704392272;

s19=83388.4430699982;

s20=111601.719052123;

s21=109095.720088226;

s22=116420.062820544;

s23=99392.6167007885;

s24=94120.9580356372;

s25=108706.911103218;

s26=104347.884896561;

s27=98898.8747645635;

s28=95823.0205265863;

s29=97495.1151671121;

s30=93106.7009129038;

s31=103422.81509802;

s32=89250.9777055238;

s33=108031.497600314;

s34=93863.4800855652;

s35=93438.609635632;

s36=97742.2362462676;

s37=113571.025192505;

s38=86556.5996515215;

s39=93765.3001325089;

s40=110225.357982563;

s41=101765.147317201;

s42=87894.8528930778;

s43=100758.450369176;

s44=86273.6331124324;

s45=104331.900527177;

s46=104758.589966514;

s47=95794.6215545235;

s48=108587.949196226;

s49=79214.5899846219;

s50=94804.1249808739;

zf1=15/2; zf2=35/3; zf=50/5;

run;

data univ1;

set univ;

array gp1 (*) s1--s15;

array gp2 (*) s16--s50;

do i=1 to (dim(gp1)-1);

do j=i+1 to dim(gp1);

do k=1 to (dim(gp2)-2);

do w=k+1 to (dim(gp2)-1);

do x=w+1 to (dim(gp2));

vvgp1s=gp1(i)* zf1 + gp1(j)* zf1;

vvgp2s=gp2(k)* zf2 + gp2(w)* zf2 + gp2(x)* zf2;

vvstot=vvgp2s+vvgp1s;

output;

end;

end;

end;

end;

end;

run;

data univ2;

set univ;

array gp1 (*) s1--s50;

do i=1 to (dim(gp1)-4);

do j=i+1 to (dim(gp1)-3);

do k=j+1 to (dim(gp1)-2);

do w=k+1 to (dim(gp1)-1);

do x=w+1 to (dim(gp1));

vvtots=(gp1(i)+gp1(j)+gp1(k)+gp1(w)+gp1(x))*zf;

output;

end;

end;

end;

end;

end;

run;

proc means data=univ1;

var vvstot;

run;

proc means data=univ2;

var vvtots;

run;

---------------------------------------------------------------------------

According to statistical theory, the expected value of sample sum is equal to universe sum for both techniques, while sampling variance should be not greater for SS.

What I get is that variances are very similar, but SRS variance is slightly smaller than SS variance!!!

---------------------------------------------------------------------------

SS

N = 687225

Mean = 4995031.42

Std.Dev. = 210736.37

SRS

N = 2118760

Mean = 4995031.42

Std.Dev. = 208152.14

---------------------------------------------------------------------------

Could you help me to understand where is the mistake?

Thanks for any suggestion.

Gianluca