Download: 
pdf | 
pdfSample Allocation to Increase the Expected Number of
Publishable Cells in the Survey of Occupational
Injuries and Illnesses
October 2, 2015
Diem-Tran Kratzke∗
Daniell Toth∗
Abstract
The Survey of Occupational Injuries and Illnesses (SOII) is an establishment survey that provides
annual estimates for the incidence count and rate of employer-reported work-related injuries and
illnesses. Results of the survey are published by industry for the nation and participating states.
Low response rates for some industries within a state result in many of the state industry-level
estimates not being published because of quality and/or confidentiality concerns. The SOII sample
is stratified by state, ownership, industry, and size. The number of sample units from each sampling
stratum is currently determined by the Neyman allocation, which is intended to minimize the
expected sample variance of the estimator for total recordable cases given the fixed sample size.
Our goal for the study is to develop a new sample allocation to increase the publishability of
estimates at the state industry level while constraining the variance for the fixed sample size. In
this paper, we explore a method for assigning sample allocation that aims to maximize the number
of publishable cells while constraining the variance of the estimator for total recordable cases.
Key Words: Constrained optimization, establishment survey, optimum allocation, generalized
variance function (GVF), stratified sample, gradient descent.
1. Survey Background
The Survey of Occupational Injuries and Illnesses (SOII) is an annual Federal/State program that collects reports of employee injury and illness along with total employment and
total hours worked by all workers from about 240,000 business establishments. The SOII
summary program produces estimates of incidence count and rate of nonfatal injuries and
illnesses in the workplace by geographical area, ownership, and industry for workers in establishments that are in the scope of the survey in the fifty states, District of Columbia,
Puerto Rico, Virgin Islands, and Guam. The survey excludes self-employed workers, workers
in agricultural production businesses with less than eleven employees, private households,
US postal service, and federal government. Workers in railroad and mining are in-scope but
are not sampled; their data are obtained from other sources. National estimates exclude
Guam, Puerto Rico, and the Virgin Islands.
The SOII uses a stratified simple random sample design. The sampling units are business establishments. We will use the terms “unit” and “establishment” interchangeably in
this paper. Within each survey year, the in-scope establishments are stratified into state,
ownership (whether private, state government, or local government), and industry. SOII
data are published at this stratified level. In addition, each estimation stratum is divided
into sampling strata by grouping establishments into five size classes (a size class is defined
by the establishment’s annual average employment which is the average of employment over
a twelve-month period). A fixed sample size is assigned to each state and ownership code.
Within each sampling stratum, the non-certainty units are sorted by the annual average
employment and units are then selected systematically with a single random start. The
current SOII uses the Neyman allocation method that minimizes the variance of the estimator for the number or incidence count of Total Recordable Case (T RC) because there is
a high correlation between T RC and other characteristics being measured.
∗ U.S.
Bureau of Labor Statistics, 2 Massachusetts Ave, N.E. Washington DC 20212
2. Motivation
At the 2012 Occupational Safety and Health Statistics National Conference, questions of
publishability arose. Some states wanted to know the reason why only a few of their
Target Estimation Industries (TEIs) were published. Target Estimation Industries are
specific industries that participating states request to sample for publication. The Statistical
Methods Group in the Office of Compensation and Working Conditions of the Bureau of
Labor Statistics conducted a review and found that some industries have unusually low
response rates compared to others. In 2013, some states requested to increase the sample
sizes for certain TEIs to account for low response rates. This request raised the question
of how we could optimize the sample allocation by maximizing publishability while keeping
the variance at or below an acceptable level.
The SOII sample is stratified by ownership and industry for participating states into
H strata for estimation purposes. For sampling purposes, each estimation stratum h is
stratified into five size classes hj , for j = 1, . . . , 5. Estimates at the state/ownership/industry
level are provided for as many of the H strata that are deemed publishable by program office
economists.
The goal of this research is to find a sample allocation method that will maximize the
number of strata that are predicted to be published based on our model, while keeping the
variance below an acceptable level.
The decision on whether a given stratum-level estimate is publishable takes into account
the perceived quality of the estimate as well as privacy concerns of responders. However,
exact criteria for making this decision can be complex and are often not available. Therefore,
in order to define an allocation procedure that increases the number of TEIs that get
published, we have to rely on a model of the propensity that a TEI, given an allocation,
will be published. We use historical SOII data to model the probability that a TEI gets
published.
3. Assumptions and Definitions
In order to obtain an allocation to maximize the number of publishable estimates, we need
a model for the probability of publishing estimates
for an estimation stratum h given its
sample size, denoted by nh = nh1 , . . . , nh5 where nhi is the sample size of size class j in
stratum h. We denote the probability of publishing stratum h by ph (nh ).
Based on results of exploratory analyses on the historical SOII data, we determined that
the probability that a given TEI h is published is closely associated with 1.) the relative
variance of its T RC estimate, called RVh , and 2.) the ratio of the number of usable units to
the number of units in the population, called uh . Usable units are units that provide data
that are used in estimation. The relative variance RVh and the usable ratio uh are both
random variables that depend on the sample allocated to TEI h. In other words, the model
of the probability that TEI h gets published is a function of the form
ph RVh , uh = ph RVh (nh ), uh (nh ) = ph (nh ),
To be useful in practice, the model must be based on variables with known values at
the time of allocation, before data are collected. Since the values of RVh and uh will not be
known until after the sample is collected, we need to estimate them from historical data.
In estimating RVh and uh , we make a number of simplifying assumptions.
First, we assume that the relative variance for a state/ownership/industry/size stratum
hj is given by
RVhj = σh2 j /nhj .
where σhj is a fixed constant over time, so it can be estimated using previous survey results.
Since the relative variance is defined by
RVhj = RSEh2j = Vhj /T RCh2j
where
• Vhj is the variance of the T RC estimate for stratum h, size class j
• RSEhj is the relative standard error of the T RC estimate for stratum h, size class j
• T RChj is the Total Recordable Case for stratum h, size class j
• nhj is the sample size for stratum h, size class j,
we can then write:
Vhj = σh2 j · 1/nhj · T RCh2j
The variance for the state/ownership/industry stratum h is the sum of the variances for
each size class over all five size classes,
Vh =
5
X
Vhj =
j=1
5
X
σh2 j · 1/nhj · T RCh2j .
(1)
j=1
We can compute RVh , the relative variance for stratum h, by
RVh = Vh /T RCh2
(2)
P5
where T RCh = j=1 T RChj .
Secondly, we assume that the probability that a unit responds to the survey given that
it was selected in the sample is the same for all units in a given size class j in TEI h and
that this response rate is fixed over time. The response rate for each stratum, rhj , can then
be estimated from previous survey results.
The sample response rate for size class j in TEI h, rhj , is defined as the ratio of the
number of usable units to the number of sample units in that sampling stratum. Usable
units are sample units that responded and their data were used in estimation. Given the
fixed response rate rhj , the ratio of the number of usable units to the number of frame units
at the TEI level h is computed by
P5
uh =
j=1 rhj
P5
j=1
· nhj
Nhj
(3)
We use the logistic regression model
−1
ph (nh ) = 1 + exp{−Xh (nh )β}
,
(4)
where β is the column vector of parameters (β1 , β2 , β3 )0 , estimated from the historical SOII
data and
X(nh ) = RVh (nh ), uh (nh ), RVh (nh )uh (nh )
to model the probability of TEI h being published.
In order to obtain a sample allocation that will result in collecting data that allows more
of the H stratum-level estimates to be published, we allocate sample units in a way that
maximizes
H
X
pˆh (nh ),
(5)
h=1
where pˆh (nh ) is the estimated probability that TEI h will be published given a sample
allocation of nh , assuming the logistic model (4).
P
The expected number of published stratum-level estimates is h ph (nh ), where ph (nh ) is
the true probability that the estimate for TEI h will be published, given that the sample size
for each size class j is nhj . If pˆh (nh1 , . . . , nh5 ) does a good job of estimating ph (nh1 , . . . , nh5 ),
then the resulting sample allocation will come close to maximizing the expected number of
published stratum-level estimates.
The maximization is done under the constraints that the total sample size n is fixed,
n=
H X
5
X
nhj ,
h=1 j=1
and the relative standard error is at or below an acceptable level,
−1/2
σhj nhj
≤ σM ,
where σM is a fixed constant set before the allocation procedure.
4. Modeling Probability of Publishability
Since the criteria for deciding whether to publish a given stratum-level estimate is left to
judgment and not exact, we model the probability ph (nh ) that an estimate for a given TEI
P5
h is published given its sample size nh = j=1 nhj , using logistic regression.
Our research data include past information for all strata and size classes for participating
states in four years 2009-2012. We use the GLM function of the R package to run logistic
regression to determine the variables that are important in predicting the probability of
publishing. We evaluate different models by using the first three years of data for modeling
and the last year of data for testing our model fit. We identify two stratum-level variables
that seem to drive the decision on publishability: the relative variance of the T RC estimate
(RVh ) and the usable ratio (uh ), which is the ratio of number of usable units to number of
frame units.
Recall that formulas (1) and (3) in section 3 show that the variance and usable ratio
variables are functions of quantities that we must have estimates for at the time of allocation, except for nhj and Nhj . We estimate T RChj and rhj by taking the averages of their
estimates over the four years. We estimate σh2 j in various ways, including by obtaining
the coefficients of modeling the regression of 1/nhj on RVhj at different levels (ownership/industry group/size, industry group/size, and size levels). In the end, we find using
the average of the estimates of σh2 j = RSEh2j · nhj over the four years of data works best.
We fit the following logistic regression of publishability on relative variance and usable
ratio:
ph
ln
= α + β1 RVh + β2 uh + β3 RVh · uh
1 − ph
Relative variance and usable ratio, which are not available at the time of allocation, are
estimated using historical data by:
P5
ˆ 2
σ
ˆ 2 · 1/nhj · T RC
hj
ˆ h = j=1 Phj
RV
5
ˆ h )2
( j=1 T RC
j
and
P5
u
ˆh =
ˆhj
j=1 r
P5
j=1
· nhj
Nhj
where
• ph is the probability of being published for stratum h
ˆ h is the estimate for RVh
• RV
ˆ 2 over four years)
• σ
ˆh2 j is the estimate for σh2 j (average of nhj· RSE
hj
• nhj is the sample size in stratum h size class j
ˆ h is the estimate for T RCh (average of T RC
ˆ h over four years)
• T RC
j
j
j
• u
ˆh is the estimate for uh
• rˆhj is the estimate for rhj (average of rˆhj over four years)
• Nhj is the population size in stratum h size class j
5. Optimum Allocation
To find the allocation n = (n11 , . . . , n15 , . . . , nH1 , . . . nH5 ) that maximizes equation (5), we
explore the partial derivative of this sum with respect to nhj . This is equal to
∂ph
= (1 − ph )ph ∇j (Xh )β,
∂nhj
(6)
where ph is the function (4) evaluated at nh and
∇j (Xh ) =
 ∂RV
h
∂nhj
,
∂uh ∂RVh uh 
.
,
∂nhj
∂nhj
ˆ h ≤ M and n is fixed, optimizing
We find that under the specified constraints that RV
equation (5) is not possible by numeric optimizers or by solving directly with the Lagrange method. We implement a numerical routine to find an allocation that satisfies our
constraints and increases the expected number of published cells. The routine we adopt
∂ph
and
reduces nhj for sampling cells with relatively small values of the partial derivative ∂n
h
j
increases nhj for sampling cells with relatively large partial derivatives. This is done using
the following algorithm:
1. For a given allocation n, compute
∂ph
∂nhj
for all nhj .
2. Find the vector U of nhj that have values of
∂ph
∂nhj
in the top α%, and nhj <
min(Nhj , 800)
3. Find the vector B of nhj that have values of
∂ph
∂nhj
in the bottom α% and nhj >
min(Nhj , 2).
4. Randomly choose a value in B from which to subtract one and a value in U to which
to add one.
5. If the new allocation n0 satisfies the conditions, repeat the above steps 1-4 with n0 .
Otherwise, repeat the above steps 1-4 with n.
This algorithm is repeated until the value of equation (5), given n0 , begins to stabilize.
6. Empirical Results
6.1
Logistic Regression
To determine the best logistic regression model for our purposes, we use the first three years
of data (2009-2011) to model, pooling data from all states and territories. We determine
that the relative variance of T RC estimate, the ratio of usable units to frame units, and
their interaction are significant factors in predicting the probabilities of being published.
The coefficients for our logistic regression model fitting the log odds of an industry being
published on these three variables are as follows:
ln
ph
= 2.83 − 12.29RVh − 2.46uh + 13.53(RVh · uh )
1 − ph
We apply the above fitted model to the 2012 data of all states and territories to measure
how well the model predicts publishability. We compute the predicted probability of each
TEI being published by using the estimated coefficients of our fitted model and compare
the sum of probabilities of being published over all TEIs to the actual number of TEIs being
published in the 2012 sample data. The evaluative statistics are the mean squared error
(MSE) computed at the macro and micro levels as follows:
s P
P
( h pˆh − h publh )2
P
Macro MSE =
(7)
( h publh )2
and
− publh )2
,
(8)
h
where publh equals 1 if stratum h was published, 0 otherwise. Our model yields a macro
MSE of about 1.3% and a micro MSE of about 14%.
P
Micro MSE =
6.2
ph
h (ˆ
Optimization
We test our optimization scheme on the private ownership of five states that have expressed
a desire to increase their publishability rates in the past. We use the Neyman allocation
sample sizes as the initial values to our optimization program. We compute the partial
derivatives of the probability of a TEI being published with respect to the sample size at
each size class level on 2012 data for each state. At each iteration step of the optimization
program, we change the sample sizes according to the routine described in section 5. We
run the iterations until the increase in the sum of predicted probabilities is negligible. In
all states except for one, the change in the sum of probabilities being published between
1,000 and 3,000 iterations is less than .5%. For the state in exception, we run into a
convergence problem after 500 iterations. However, the change in the sum of probabilities
being published between 400 and 500 iterations for this state is also about .5%.
Figure 1 shows the distribution of the predicted probabilities pˆh over the non-published
TEIs (the left-hand column labeled 0 on the horizontal axis) and over the published TEIs
(the right-hand column labeled 1 on the horizontal axis) for each state when we fit the
model on 2012 data. The distribution of pˆh over the published TEIs concentrates more on
the upper portion of the probability scale from 0 to 1 (on the vertical axis). This indicates
that our model gives higher probabilities of being published to the TEIs that were actually
published, which is reasonable.
Figure 1: Distribution of pˆh for unpublished and published TEIs
Figure 2 compares the sum and the mean of the predicted probabilities of being published
when we apply the fitted model to the 2012 data over all TEIs in the current allocation
method (which is the Neyman method to minimize variance) and the proposed optimal
allocation method (which is our method to maximize publishability). The means are shown
for published TEIs
P vs. non-published TEIs.
We see that h pˆh is higher for the proposed optimal method in all five states, which
confirms that our method generally increases the chances of TEIs being published. We
also note that under the current allocation, the mean of pˆh in published TEIs is higher
than the mean of pˆh in non-published TEIs, with the difference ranging from 22% to 31%,
which indicates that our model works reasonably well in predicting probabilities of being
published. When we compare pˆh under the two methods, we see that we are successful in
increasing the mean of pˆh for both published and non-published TEIs. The mean of pˆh for
published TEIs increases from about .80 to about .85 for published TEIs, and from about
.55 to about .80 for non-published TEIs.
Figure 2: Comparing pˆh of Current and Optimal Allocation
Next, we use the predicted probabilities of being published obtained from the model to
make predictions on publishability with .5 as the cut-off point. That is, we consider any
TEI having a predicted probability of being published greater than .5 as publishable. Thus
we can compare the number of TEIs deemed publishable by the model to the number of
TEIs that were actually published. The results are shown in Figure 3 under the columns
“Actual Current Allocation” and “Predicted Current Allocation.”
The predicted numbers of publishable TEIs are very close to the actual numbers of TEIs
being published in the sample year 2012 for the first three states. In fact, it is exact for
state #2. For the last two states, the predicted numbers of publishable TEIs are much
higher than the actual numbers of TEIs being published, indicating that the cut-off point
of .5 tends to overestimate the number of publishable TEIs.
Figure 3: Actual vs Predicted publishability and Current vs. Optimal predictions
We can also compare the predicted numbers of publishable TEIs of the current allocation
to those of the proposed allocation. These results are shown in Figure 3 under the columns
“Predicted Current Allocation” and “Predicted Optimal Allocation.” Here we see that the
predicted number of publishable TEIs in the proposed allocation is higher than the predicted
number of publishable TEIs in the current allocation for all five states. The increase in the
number of publishable TEIs ranges from 5 to 11 and the percentage increase ranges from
8% to 17%.
7. Conclusions
In conclusion, the preliminary results for publishability in the five states are very promising.
The predictions for publishability based on our logistic regression model seem to work
reasonably well. Although we tend to overestimate the number of TEIs being published,
the prediction and the actual number are positively correlated. That is, higher predicted
probability tends to yield higher percentage of cells being published. We are able to increase
the sum of predicted probabilities for a number of states while keeping the variance below
an acceptable level. We expect that this increase will translate to increasing the number of
publishable TEIs if we were to implement our optimal allocation method in production.
We need to do further research to make sure our optimal allocation method works well
in practice. That is, we need to obtain results for all states and ownership types to make
sure the program provides expected results under different scenarios and that the positive
results are not limited to only a hanful of states. We also need to test our optimal allocation
in production to address unseen problems that may arise in real situations.
Disclaimer
Any opinions expressed in this paper are those of the authors and do not constitute policy
of the Bureau of Labor Statistics.
REFERENCES
Bureau of Labor Statistics (2013), BLS Handbook of Methods, Chapter 9: Occupational
Safety and Health Statistics, http://www.bls.gov/opub/hom/homch9.htm
Bureau of Labor Statistics, Quarterly Census of Employment and Wages,
http://www.bls.gov/cew/cewover.htm
Huband, E., and Bobbitt, P. (2013), “ Nonresponse bias in the survey of occupational
injuries and illnesses,” in Proceedings of the Section on Government Statistics, Joint
Statistical Meetings.
Selby, P., Burdette, T., and Huband, E. (2008), “Overview of the Survey of Occupational
Injuries and Illnesses Sample Design and Estimation Methodology,” in Proceedings of
the Section on Survey Research Methods, Joint Statistical Meetings.
| File Type | application/pdf | 
| File Title | Sample Allocation to Increase the Expected Number of Publishable CElls in the Survey of Occupational Injuries and Illnesses | 
| Author | Diem-Tran Kratzke and Daniell Toth | 
| File Modified | 2019-08-21 | 
| File Created | 2015-10-27 |