Statistical terms and concepts

A
absolute risk

The probability of an outcome occurring in a specific group. It is calculated by dividing the number of events that occur in a group by the number of people in that group.

For example, women in Australia have an absolute risk of about 14% of developing breast cancer in their lifetime. That means out of every 100 women, about 14 will develop the disease at some point in their life.

Absolute risk is different from relative risk, where 2 groups of people are compared. For example, women who smoke might be 20% more likely to get breast cancer than women who do not smoke, meaning their risk is increased 20% relative to the risk of people who do not smoke. This can also be expressed as a relative risk of 1.2 or 120%.

See also relative risk and incidence.

administrative data

Data collected by government departments, businesses and other organisations as part of their delivery of services or conducting their business. It can include operational, financial, clinical, and patient/client management data. While the data are primarily collected for non-statistical purposes, the ‘secondary use’ of the data can provide a rich source of information about the health and wellbeing of Australians and information about how Australians interact with health and other community services.

age-specific rate

The rate calculated for a specific age group. The numerator and denominator relate to the same age group. For example, the age-specific rate of smoking among people aged 25–29 is 160 per 1,000 (which means that 160 people in every 1,000 in this age group smoke).

See also rate, crude rate.

age-standardisation

A method of removing the influence of age when comparing populations with different age structures. This can be useful in understanding differences between populations as people of different ages often have different health or welfare outcomes.

For example, a community with more older people will likely have more hospitalisations than one with more younger people. The age structures of these different communities are adjusted for, and then the hospitalisations that would have occurred within that structure are calculated and compared.

See also age-standardised rate, crude rate and age-specific rate.

age-standardised death rate

Number of deaths in a specific time period, adjusted to take account of the population age structure. Usually expressed per 1,000 population or 100,000 population.

May also be called an ‘age-standardised mortality rate’.

See also age-standardisation, age-specific rate and death rate.

age-standardised rate

A rate that is adjusted to account for differences in the age structure of the populations being compared.

See also age-standardisation, crude rate and age-specific rate.

aggregated data

Data that have been collected and combined from multiple individuals/units.

This data can then be used to report on a population or group level. For example, to report on Medicare rebates by state/territory, the data on rebates claimed by individuals is combined. The individual data are not presented, just the rebate totals.

See also unit record data.

associated cause(s) of death

Cause(s) listed on the Medical Certificate of Cause of Death, other than the underlying cause of death. They include the immediate cause, any intervening causes, and conditions that contributed to the death but were not related to the disease or condition causing death.

See also cause(s) of death, underlying cause of death.
Australian Statistical Geography Standard (ASGS)
A social geography, developed to reflect the location of people and communities. It is used for the publication and analysis of official statistics and other data. For example:
- Statistical Areas Level 3 (SA3s) are designed for the output of regional data and most have populations between 30,000 and 130,000 people.
- Statistical Areas Level 4 (SA4s) are designed for the output of a variety of regional data and represent labour markets and the functional area of Australian capital cities. Most SA4s have a population of over 100,000 people.
The ASGS is updated every 5 years to account for growth and change in Australia’s population, economy and infrastructure (ABS 2021).
B

benchmark

A standard or point of reference for measuring quality or performance. For example, the agreed benchmark for Staphylococcus aureus bloodstream infection is no more than 1 case per 10,000 days of hospital patient care.

burden of disease (and injury)

The quantified impact of a disease or injury on a population using the disability-adjusted life years (DALYs) measure, which measures how much healthy life has been lost through premature death or living with illness or injury.

See also disability-adjusted life years (DALYs), years lived with disability (YLDs), years of life lost (YLLs).
C
causation

Causation indicates that an outcome is the result of an event or treatment. This is known as a causal relationship and may also be referred to as cause and effect. Establishing causation in statistics is important but difficult, as many events have other factors driving them. For example, smoking is a major cause of lung cancer. However, data also show that lung cancer death rates are higher in regional and remote areas. Other factors (aside from smoking rates) may help to account for this, such as the age of the population and reduced access to timely and specialised health care in regional and remote areas.

See also correlation and confounding variable.

cause(s) of death

All diseases, morbid conditions or injuries that either resulted in or contributed to death – and the circumstances of the accident or violence that produced any such injuries – that are entered on the Medical Certificate of Cause of Death. Causes of death are commonly reported by the underlying cause of death. 

See also associated cause(s) of death and underlying cause of death.

cohort

A group of data units sharing a common experience or characteristic. For example, the cohort for a study may be people aged 25-34, defence force veterans, or people born in Australia.

confidence interval

A way of presenting the level of certainty in a statistic. It involves calculating a range that the true value falls within given a specified level of confidence. For example, a 95% confidence interval of 0.3-0.8% indicates that there is a 95% chance that the true value is between 0.3% and 0.8%.

See also confidence level.

confidence level

A confidence interval provides a range of values within which the true population parameter is likely to lie, based on sample data, with a specified confidence level. For example, a 95% confidence level means that if you were to take many random samples and compute a 95% confidence interval for each, 95% of those intervals would contain the true parameter value.

See also confidence interval.
confidentialisation
Confidentialisation means:
- Removing details that directly identify someone (like name or address), an organisation or a business.
- Checking and reducing the chance that someone, an organisation or a business could still be identified indirectly (for example, by combining age and location).
confounding variable

A factor other than the one being studied that is associated with what you are measuring (dependent variable), and the factor being studied (independent variable), potentially leading to a misunderstanding of the influence of the independent variable. So, the confounding variable acts to confuse our understanding of the effect of the factor being studied.

For example, a hypothesis that people who drink coffee (independent variable) are more likely to develop heart disease (dependent variable) than people who do not drink coffee may be influenced by another factor – such as smoking. People who drink coffee may be more likely to smoke or smoke more cigarettes than people who do not drink coffee. In this example, smoking is the confounding variable.

correlation

The value of variables that are correlated fluctuate in relation to each other. Correlation refers to the relationship independent of causation. Things with a causal relationship are correlated, things that are correlated are not necessarily causal. Correlation does not prove that one thing causes one another. For example, ice cream sales and violent crime rates both increase in summer. These 2 events are correlated but one does not cause the other. Instead, they are affected by a third variable – hot weather.

See also causation and confounding variable.

crude rate

A crude rate is one that is calculated using unadjusted data. For example, let us look at crude incidence rate for heart disease in 2 populations population A has 50,000 people and there are 1,000 new cases of heart disease, the crude rate is 20 per 1,000 (1,000/50,000 = 0.02). Population B has 60,000 people and there are 1,600 new cases of heart disease, the crude rate is 27 per 1,000 (1,600/60,000 = 0.027)

Population B has a higher crude rate (27 per 1,000 compared with 20 per 1,000 for Population A). While this rate tells us the overall average rate of disease, it does not consider factors that could affect the rate of disease in a population. For example, age is a risk factor for heart disease – that is, the likelihood of having heart disease increases as you age. So, in this example, Population B might comprise more elderly (or older) people (LaMorte 2016)

See also age-specific rate, age-standardisation, age-standardised rate.
D

data

Measurements or observations that are collected as a source of information (ABS n.d.a)

data custodian

A person or position/organisation with delegation to exercise overall responsibility for a specific data collection, in accordance with legislation, policies, guidelines and any specific conditions for use applicable to that data collection. Subject to these requirements, a data custodian has the power to release data to other bodies or people.

data item

A data item is a characteristic (or attribute) of a data unit that is measured or counted, such as a person’s height or country of birth. A data item is also called a variable or data element because the characteristic may vary between data units and over time (ABS n.d.a)

data linkage/linked data

Data linkage connects data from multiple sources to better understand and compare complex interactions Australians have with health and welfare services.

Data are collected whenever a person engages with a service, from seeing their doctor and collecting a script, to presenting to hospital or disability support service. These are stored in data collections based on service type, such as hospital or disability data collections.

Linking data assets is one way to bring together pieces of data that are collected in different places to enable us to look at them all together. This makes it easier to conduct complex cross-sector and cross-jurisdictional analysis. This can then be used to improve policies, for example to provide better access to services and inform treatment pathways and care for chronic disease management, for the benefit of all Australians.

For example, if you have data about people’s exercise routines in one data set and data about their heart health in another, data linkage could help to investigate possible patterns between exercise habits and heart health.

It is important to note that the privacy of individual people is protected through de-identification – a process where a code or key is used so that while a person’s use of services can be followed through multiple data sets, their identity is never revealed.

Data linkage can also include data collected through other means, such as surveys/questionnaires.

data set

A collection of data. One or several related data sets may be called data collections. For example, the National Prisoner Health Data Collection includes various data items about people who enter and leave prison over a 2-week period, including their clinic death rate.

Also called data collection or data asset.

death rate

The rate of people who die in a population during a given time period. Usually expressed as deaths per 1,000 or 100,000.

Also called mortality rate.

decile

A decile is one of 10 equal groups into which a population can be divided according to the distribution of values of a particular variable. The first decile contains the population members with values in the bottom 10%. The second decile contains the population members with values in the next 10% and so on.

See also quantile, quartile, quintile and percentile. Quartile includes a detailed example.

de-identified

De-identification involves removing or altering information that identifies an individual or is reasonably likely to enable their identification’ (OAIC 2024)

dependent variable

What you are measuring in a study. The dependent variable is assumed to be influenced by the value of at least one other variable within the scope of a specified research study question. For example, if looking at the effect of tobacco smoking on the incidence of asthma, the number of new asthma cases is the dependent variable.

See also independent variable.

disability-adjusted life years (DALYs)

A measure of healthy life lost in a population. Presented as years of life lost through premature death or living with disability due to illness or injury. It is the basic unit used in burden of disease and injury estimates. Often used synonymously with health loss.

See also burden of disease (and injury), years of life lost (YLLs), and years lived with disability (YLDs).
E

estimated resident population (ERP)

Official Australian Bureau of Statistics estimate of the Australian population. The ERP is based on Census counts and updated 4 times a year, using births, deaths, and migration data.
G

Gini coefficient

Statistical measure of economic distribution, ranging from 0 to 1. Zero represents perfectly equal income distribution (all people have the same amount of money), while 1 represents perfectly unequal income distribution (one person has all available money).
I
identifier

An identifier is a data (like a name, number, letter or symbol) that is sufficient to identify a person, organisation or business. Examples include names, passport or Medicare numbers, and general practice provider numbers.

incidence

The occurrence of a specified event. For example, if 100 people are newly diagnosed with diabetes in a town of 10,000 people during one year, the incidence for that year is 1 per 100 population.

See also prevalence.

independent variable

A variable that is hypothesised to influence an event or state (the dependent variable). For example, if looking at the effect of tobacco smoking on the incidence of asthma, smoking is the independent variable.

See also dependent variable.

Index of Relative Socio-economic Advantage and Disadvantage (IRSAD)

Summarises information about the economic and social conditions of people and households within an area. This index includes both relative advantage and disadvantage measures.

A low score indicates relatively greater disadvantage and a lack of advantage in general. For example, an area could have a low score if there are: many households with low incomes, or many people in unskilled occupations, and a few households with high incomes, or few people in skilled occupations.

A high score indicates a relative lack of disadvantage and greater advantage in general. For example, an area may have a high score if there are: many households with high incomes, or many people in skilled occupations, and few households with low incomes, or few people in unskilled occupations.

A number of variables are considered in determining whether an area is advantaged/disadvantaged, including household income, education level, employment status, occupation, home mortgage payments, and whether a person is living with disability (ABS 2023a).

See also Index of Relative Socio-economic Disadvantage (IRSD) and Socio-Economic Indexes for Areas (SEIFA).
Index of Relative Socio-economic Disadvantage (IRSD)
A way of calculating socioeconomic disadvantage of people living in an area using attributes such as income level, educational attainment and employment status. This index can also be used to rank areas based on socioeconomic disadvantage.

A low score indicates relatively greater disadvantage in general. For example, an area could have a low score if there are:
- many households with low income
- many people with no qualifications
- many people in low skill occupations (ABS 2023a).
A high score indicates a relative lack of disadvantage in general. For example, an area may have a high score if there are:
- few households with low incomes
- few people with no qualifications
- few people in low skilled occupations (ABS 2023a).
Other variables included in the index include the proportion of people in an area who do not speak English well, the proportion of one-parent families with dependent children, and the proportion of people aged under 70 who have a long-term health condition or disability.

See also Index of Relative Socio-economic Advantage and Disadvantage and Socio-Economic Indexes for Areas (SEIFA).
indicator

A key statistical measure selected to help describe (indicate) a situation concisely to track change, progress and performance; and to act as a guide for decision making. For example, people who do not have adequate, stable housing can find it difficult to participate in society. One indicator of access to stable housing might be the number of people who are experiencing homelessness, and how this has changed over time.

interquartile range

The interquartile range is a measure of the dispersion or spread of the data item. The range contains the middle 50% of the values of a data item. It is equal to the difference between the 1st and 3rd quartile.

See also quartile for an example.
L

life expectancy

Life expectancy is one of the most common measures of overall health of a population. It is the expected average lifespan for members of a population group given our current understanding of death rates. It is expressed as either the number of years a newborn baby is expected to live, or the expected years of life remaining for a person at a given age. For example, a boy born in 2020–2022 is expected to live to the age of 81.2 years and a girl to 85.3 years.

likelihood

When calculating likelihood, you are trying to determine if you can trust the parameters in a model based on the sample data you have observed (Gupta 2023).

For example, if you toss a coin once, there is equal probability that it will land on heads or tails. There is an expectation that, if the coin is ‘fair’, no matter how many times you toss it, it will land on heads and tails a similar number of times.

You then toss the coin 100 times, and it lands on heads 25 times. You would now say that the likelihood that the coin is fair is quite low. If the coin were fair, you would expect it to land on heads around 50 times.

However, if you toss a coin 100 times and it comes up heads 48 times, there is a high likelihood that the coin is fair (Gupta 2023).

See also probability.

logistic regression

Logistic regression is a common method used to analyse the association between a binary outcome (the dependent variable) and a number of exposure variables (the independent variables). It is used in statistics to estimate the probability of an event occurring (such as a long-term health condition) based on the underlying data used to create a model.

It is important to note the modelling shows association between variables but does not explain what is causing the association.

longitudinal data

Data collected from the same people multiple times over a specific period. For example, a longitudinal study could be used to examine the educational progress of the same children every year from Year 1 to Year 6.
M

margin of error

The margin of error (MoE) describes the distance from the population value that the sample estimate is likely to be within for a specified level of confidence. Confidence levels typically used are 90%, 95% and 99%. For example, at the 95% confidence level the MoE indicates that there are about 19 chances in 20 that the estimate will differ by less than the specified MoE from the population value.

mean

A measure of the central tendency (average) of the values of a specified variable. This is obtained by adding together all the observations in a data set and dividing by the number of observations.

For example, the mean of a data set containing numbers 1 to 10 is:

Mean = (1+9+3+6+4+5+7+8+2+10)/10

Mean = 55/10 = 5.5

See also median and mode.

measures of frequencies

Measures of frequency in epidemiology include absolute frequency (number of occurrences of a specified event or value) and relative frequency (proportions, ratios and rates).

Proportion and rate each compare a numerator (for example, ‘cases’) and denominator (for example, ‘population’). The appropriate use of proportion, ratio or rate depends on the units of measure, and whether the denominator represents the whole group. (ABS n.d.b)

See also proportion, rate, rate ratio and ratio.

median

Middle value of a specified variable in a data set (after the numbers have been arranged from least to greatest). If there are an even number of data, the median is the mean of the middle 2 numbers.

For example, the median of a data set containing numbers 11 to 20 is:

11, 12, 13, 14, 15, 16, 17, 18, 19, 20

The median is the average of 15 + 16 That is 31/2 = 15.5

See also mean and mode.

metadata

Information about how data are defined, collected and structured. It provides meaning and context and helps with interpreting the data.

For example, the data value 17 is meaningless by itself. Is it a street number, a clinical measurement, a test result, the number of services provided, or something else?

When a data value such as 17 is associated with a specific metadata item (e.g. Person—tobacco smoking start age, total years N[NN]), its meaning becomes clear. Now we know it is the age that someone started smoking, in years – and that this particular person started smoking when they were 17.

Once we add information (metadata) about the unit of measurement (for example, years), what the data relates to (for example, a person), and what it is measuring (for example, their age when they started smoking), we get a more useful data value. All these things – the units of measurement, what the data relates to, what it is measuring – are metadata. For more information, see the AIHW’s central metadata repository, METEOR

mode

Most frequent value of a specified variable in a data set.

For example, the mode of a data set containing:

1, 2, 4, 4, 6, 7, 9, 9, 9, 11, 12 is 9 because 9 occurs 3 times and all the other values occur fewer than 3 times.

See also mean and median.

mortality rate

See death rate.
N
non-sampling error
Error caused by factors other than those related to sample selection. It refers to the presence of any factor, whether systemic or random, that results in the data values not accurately reflecting the 'true' value for the population.

Non-sampling error can occur at any stage of a census or sample study and are not easily identified or quantified.

Non-sampling error can include:
- Coverage error: this occurs when a unit in the sample is incorrectly excluded or included, or is duplicated in the sample (for example, a field interviewer fails to interview a selected household or some people in a household).
- Non-response error: this refers to the failure to obtain a response from some unit because of absence, non-contact, refusal, or some other reason.
- Response error: this refers to a type of error caused by respondents intentionally or accidentally providing inaccurate responses.
- Interviewer error: this occurs when interviewers incorrectly record information; are not neutral or objective; influence the respondent to answer in a particular way; or assume responses based on appearance or other characteristics.
- Processing error: this refers to errors that occur in the process of data collection, data entry, coding, editing and output (ABS, n.d.e)
See also sample/sampling
O

odds

The ratio of the probability of occurrence of an event to that of non-occurrence. For example, if, out of 100 patients, 80 had a favourable outcome and 20 had an unfavourable outcome, the odds of a favourable outcome are 80/20 = 4.

odds ratio

An odds ratio is a measure of association between an exposure and an outcome. It represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure. For example, odds ratio may be used to show how the odds of developing lung cancer change with exposure to smoking. A value of one means that the exposure does not affect the odds of an outcome, a value of less than one means that the exposure is associated with lower odds of the outcome and a value of greater than one means the exposure is associated with higher odds of the outcome.
P

percentage

Percentage is a way of expressing a proportion as the number of cases per 100. It is commonly used to represent a portion of a whole. A percentage is calculated by dividing the number of times a particular value for a variable has been observed, by the total number of observations in the population, then multiplying this number by 100.

For example, in a survey of 1,000 adults about their smoking habits, if 180 respondents reported being current smokers, the percentage of smokers in the sample would be 18% (180 divided by 1,000, multiplied by 100). Alternatively, the percentage of non-smokers would be 82% (820 divided by 1,000, multiplied by 100).

See also rate, ratio and proportion.

percentile

A percentile expresses where an observation falls in a range of other observations. For example, take the 30th percentile: 30 per cent of all the numbers recorded are equal to or lower than its value.

Percentiles are found by arranging numbers in a data set in ascending order and dividing the set into 100 groups with an equal number of data points. Each of the 99 dividing points is called a percentile of the data set.

See also decile, quantile, quartile and quintile.

person-year

A person-year is a unit used in the measurement of the time a group of people spends at risk of an event occurring. The time at risk of a group measured in person years is calculated by summing the amount of time (in years) that each person in the group spends at risk of the event during the study.

It follows that, if the incidence of heart disease in Australia is 30 cases per 1,000 person-years, this means there will be, on average, 30 cases of heart disease diagnosed if 1,000 people are observed for one year.

prevalence

The number of new and pre-existing cases (of an illness or injury) in a population at a specific time or period of time.
For example, if at the end of January there are 200 people in a town of 10,000 people with diabetes (some people were diagnosed in January and some people were diagnosed earlier), the prevalence at the end of January is 2.0%.

See also incidence.

probabilistic linkage

A data linkage method that combines information from different data sets based on the likelihood (probability) they are about the same people or groups. Probabilistic and other forms of data linkage make separate pieces of data more useful by bringing them together to form new insights.

Where one or more key variables defining an individual or group (across two or more data sets) have accurate, complete, consistent and unique values, then linkage can be deterministic. However, in the absence of such unique keys, probabilistic linkage may be able to assign a likelihood that potentially matched data refer to the same individual or group.

probability

Refers to the possibility of something happening (that is, the event has yet to occur). The probability of a specified event lies between 0 and 1 inclusive – 0 meaning the probability of something happening is 0% and 1 meaning the probability is 100%.

For example, if you toss a coin once, there is equal probability that it will land on heads or tails. There is an expectation that, if the coin is ‘fair’, no matter how many times you toss it, it will land on heads and tails a similar number of times.

projection

A statistic indicating what a value would be if the assumptions about future trends used in its calculation hold true. These assumptions are often based on patterns of change that have previously occurred.

For example, data collected about the total number of general practice providers in a regional area show an increase from 8 in first year, to 12 in the second year, to 18 in the third year. It could therefore be projected that if the number of providers continues to expand following the same pattern, there will be 27 after the fourth year.

A projection is not making a prediction or forecast about what is going to happen, it is indicating what would happen if the assumptions which underpin the projection actually occur (ABS n.d.c)

proportion

A part, share, or number considered in relation to a whole. In this ratio, the numerator is included in the denominator. It is calculated by dividing the number of times a particular value for a variable has been observed, by the total number of values in the population.

For example, a preschool has 60 children, 50 of whom have a sibling. The proportion of children in the preschool with a sibling is 50/60 or 5 in 6.

See also rate, ratio and percentage.

p-value

The p-value is an estimate of how likely the observed results would be if there were no real differences between the things being compared or no real trends or relationships among the analysed data. For example, a p-value of 0.05 (5%) suggests that you would observe a difference, trend or relationship that was at least as big or strong as the one that has been observed one time out of 20 just by chance. A p-value less than to 0.05 is typically taken to represent a statistically significant result.

It is important to note that significance tests are based on a set of assumptions. For example, assumptions might include that a data set has a normal distribution of values, and that the groups that are being compared have similar variance. If any assumptions are violated, then statistical significance could be indicated by the test.

See also statistically significant
Q
quantile

Quantiles are used to divide a group of observations into equal ordered subgroups based on the value of a specified variable. Quartiles, quintiles, deciles and percentiles are all types of quantiles. A quartile divides a data set in 4 (that is, a quartile is one-fourth), a quintile divides a data set in 5 (one-fifth), and a decile divides a data set in 10 (one-tenth). A percentile divides a data set in 100 (one-hundredth.

See also decile, percentile, quartile and quintile. Quartile includes a detailed example.
quartile
Is one of 4 equal groups into which a population can be divided according to the distribution of values of a particular variable. The 3 threshold values used to divide a population into 4 equal groups are known as Q1, Q2 and Q3, and each of these comprise one quarter (25%) of the observations.

For example:
- First quartile, or Q1, is the lower quartile. 25% of observations have a value below that point, and 75% of observations have a value above that point.
- Second quartile, or Q2, is also known as the median (the middle value in an ordered list). It is the value separating the lower 50% of observations from the upper 50%, forming 2 equal groups.
- Third quartile, or Q3, is the upper quartile. 75% of observations have a value below that point and 25% have a value higher than that point.
For example, in a set of 11 numbers, sorted in ascending order, Q2 (the median) is the 6th observation (with the value of 13). Q2 equally divides this set into 5 observations below and 5 observations above.

2 4 5 8 10 13 15 19 20 22 30

In this same set of 11 numbers, the lower 5 observations are equally divided by Q1, the third observation (with the value of 5). The upper 5 observations are equally divided by Q3, the ninth observation (with the value of 20).

The interquartile range is defined as:
- the difference in values between Q3 and Q1, or
- the middle 50% of values, which lie in the range between Q1 and Q3:
2 4 5 8 10 13 15 19 20 22 30

In this case, the interquartile range is 20–5 = 15.

See also decile, percentile, quantile.
quintile

One of 5 equal groups into which a population can be divided according to the distribution of values of a particular variable.

See also decile, percentile, quantile and quartile. Quartile includes a detailed example.
R
range

Difference between the highest and the lowest score in a data item. For example, if the data set is 2,6,8,10,3, then the range will be 10 – 2 = 8.
rate
Frequency with which disease, illness, or other event occurs in defined population during specified period of time. Numerator is the number of cases or events. For example: number of deaths; number of cases of disease; number of injuries; etc. Denominator in AIHW reports is usually population unit or per time at risk unit.

For example:
- 134 hospitalisations per 100,000 people
- 1.4 falls per 1,000 patient days
- 48 injury deaths per 100,000 person-years
See also percentage, person-year, proportion, rate difference, rate ratio and ratio.
rate difference

A rate difference compares the rates of 2 groups, presenting the absolute difference between the rate of the study group and the comparison group. It is calculated by subtracting the incidence rate in the comparison group from the incidence rate in the study group. For example, if 45 cases of influenza per 1,000 person-years were recorded among people exposed to chemical fumes in their workplace and 10 cases per 1,000 persons were reported among people not exposed, the rate difference would be 35 cases per 1,000 persons.

See also percentage, person-year, proportion, rate, rate ratio and ratio.

rate ratio

This measure is derived by comparing 2 groups for their likelihood of an event occurring at any given point in time. It is also called the risk ratio because it is the ratio of the risk in the ‘exposed’ population divided by the risk in the ‘unexposed’ population.

For example, research is undertaken to explore whether heart disease rates are higher for people who smoke. Results indicate that the rate of heart disease among smokers is 30 in 1,000 person-years and for non-smokers it is 10 in 1,000 person-years. The rate ratio is therefore 30/10, or 3.

This means that smokers had 3 times the rate of heart disease as non-smokers.

See also percentage, person-year, proportion, rate, rate difference and ratio.

ratio

The value obtained by diving one quantity by another. Rates and proportions are 2 types of ratios.

The numerator is divided by the denominator. For example, a ratio of women to men in a population group is 30 females / 20 males.

See also percentage, person-year, proportion, rate, rate difference and rate ratio.

regression analysis

Regression analysis consists of a set of statistical methods used to assess the relationships between a dependent variable and one or more independent variables. For example, when analysing simultaneously the effects of several categorical or quantitative explanatory variables. inferences about the model parameters evaluate which explanatory variables truly are associated with the response variable, while adjusting for the effects of possible confounding variables. For example, a study looking at whether presentations to hospital emergency departments increase over summer, might also consider the impact of reduced opening hours of general practitioners over the holiday period.

See also dependent and independent variable.

relative risk

Relative risk is a ratio of the probability of an event occurring, in a specified time period, in an exposed group versus a non-exposed group.

For example, the relative risk of developing breast cancer (event) in women who smoke (exposed group) versus women who do not smoke (non-exposed group) would be the probability of developing breast cancer for women who smoke divided by the probability of developing breast cancer for women who do not smoke.

See also absolute risk.

relative standard error (RSE)

The standard error expressed as a proportion of the associated estimated value. It is usually displayed as a percentage. RSEs are a useful measure as they provide an indication of the relative size of the error likely to have occurred due to sampling. A high RSE indicates less confidence that an estimated value is close to the true population value.

Where published statistics contain an indication of the RSEs they can be used to compare statistics from different studies of the same population (ABS n.d.d).

remoteness areas

Remoteness areas divide Australia into 5 classes of remoteness on the basis of a measure of relative access to services. These regions are based on the Accessibility/Remoteness Index of Australia and defined as remoteness areas by the Australian Statistical Geographical Standard (ASGS) in each Census year. The 5 remoteness areas are: Major cities, Inner regional, Outer regional, Remote, and Very remote (ABS 2021)

risk

Risk in statistical terms refers simply to the probability that an event will occur.

For example, the risk of being diagnosed with cancer increases with age.
S
sample estimate

A value that is inferred for a population based on data collected from a sample of units from that population.

For example, if the data from a simple random sample shows that 51% of the sample are female, then the population value will be estimated to be 51% .

An estimate is not a guess, it is a value based on sampled data which has been adjusted using statistical estimation procedures (ABS n.d.c).

See also sample/sampling and sampling error

sample/sampling

The selection of a subset or a statistical sample of individuals from within a statistical population to estimate characteristics of the whole population

For example, to research access to health services in regional areas compared with urban areas, researchers survey a sample of 1,000 people living in regional areas and 1,000 living in urban areas from a total population of 50,000 and 1,000,000 respectively.

See also sample estimate and sampling error.
sampling error
The difference between an estimate of a population parameter based on data from a sample and the 'true' value of that parameter that would result if a census (complete count) were taken.

Sampling error can occur when:
- the proportions of different characteristics within the sample are not similar to the proportions of the characteristics for the whole population
- the sample is too small to accurately represent the population
- the sampling method is not random
Sampling error can be measured and controlled in random samples where each unit has a chance of selection, and that chance can be calculated. In general, increasing the sample size will reduce the sample error (ABS n.d.e.).
socioeconomic group

Indication of the socioeconomic status of a person or group. Socioeconomic groups are mostly reported using the Socio-Economic Indexes for Areas, typically for 5 groups (quintiles), from the most disadvantaged areas (worst off or lowest socioeconomic group) to the least disadvantaged areas (best off or highest socioeconomic group).

See also Socio-Economic Indexes for Areas (SEIFA)
Socio-Economic Indexes for Areas (SEIFA)
SEIFA combines Australian Bureau of Statistics Census data such as income, education, employment, occupation, housing and family structure to summarise the socioeconomic characteristics of an area.

Each area receives a SEIFA score indicating how relatively advantaged or disadvantaged that area is compared with other areas.

SEIFA does not show how individuals living in the same area differ, socioeconomically, from each other. It is a collection of 4 indexes, each summarising a different aspect of the socioeconomic conditions in an area using different Census data:
- the Index of Relative Socio-economic Advantage and Disadvantage (IRSAD) focuses on both advantage and disadvantage
- the Index of Relative Socio-economic Disadvantage (IRSD) focuses on relative socio-economic disadvantage
- the Index of Education and Occupation (IEO) focuses on relative education and occupation advantage and disadvantage
- the Index of Economic Resources (IER) focuses on economic advantage and disadvantage.
The same area may score differently for each index (ABS 2023a)

See also socioeconomic group, Index of Relative Socio-economic Advantage and Disadvantage (IRSAD) and Index of Relative Socio-economic Disadvantage (IRSD).
standard error

A measure of the variation between an estimated population value that is based on a sample rather than true value for the population (ABS n.d.d) The standard error is the simplest measure of how precise a survey-based estimate is. It can be used as a guide to help interpret the possible sampling error. In general, the closer the standard error is to zero, the more precise the estimate. Smaller values show a greater level of precision (ONS n.d)

standardised mortality ratios (SMRs)

Standardised mortality ratio (SMR) is a widely recognised measure used to account for differences in age structures (and other relevant variables, such as sex) when comparing death rates between populations. For example, the SMR is used in an analysis of Australian Defence Force (ADF) deaths by suicide. It is used to control for the fact that the 3 ADF service status groups have a younger age profile than the Australian population, and rates of suicide vary by age in both the study populations and the Australian population. The SMRs control for these differences, enabling comparisons of suicide counts between the 3 service status groups and Australia without the confounding effect of differences in age. The SMR is calculated as the observed number of events (deaths by suicide) in the study population divided by the number of events that would be expected if the study population had the same age and sex specific rates as the as the comparison population.

See also relative standard error (RSE).

statistically significant

The validity or importance of research findings is generally assessed by statistical significance tests (Knief and Forstmeier 2021). Statistically significant differences, trends or relationships are unlikely to be observed by chance alone. Tests of statistical significance estimate how likely the observed results would be if there were no real differences, trends or relationships. These tests assume that there is no real difference, trend or relationship (the ‘null hypothesis’) and then estimate the probability of finding a difference that is at least as big or strong as the one that has been observed. The lower the probability, the stronger the evidence that the observed result reflects a real pattern.

The probability estimate produced by a test of statistical significance is usually called a p-value. A p-value of less than 0.05 (1 in 20) is typically taken to represent a statistically significant result, but other thresholds (for example, 0.01) are also used.

It is important to note that significance tests are based on a set of assumptions. For example, assumptions might include that a data set has a normal distribution of values, and that the groups that are being compared have similar variance. If any assumptions are violated, then statistical significance could be indicated by the test.

See also p-value.
suppression
Data (cells) in tables may be hidden:
- to maintain the privacy or confidentiality of a person or organisation
- due to concerns that the data are unreliable for the intended purpose.
For example, when reporting on a rare disease by age, sex, and geography, some totals may include data from one or 2 individuals. These data may be removed before publishing (that is, they are suppressed) to protect privacy or due to concerns about volatility.
survey

A survey involves collecting information from every unit in the population (a census), or from a subset of units (a sample) from the population (ABS 2024b). A unit could be a household, a person or the like. The information is collected for analysis to identify patterns and trends.

survival analysis

A statistical analysis examining the distribution of the time taken for an event to occur in a study population (for example, the time from treatment until death or the time for a leg fracture to heal).
T

time series analysis

A statistical method that examines data points collected at regular intervals to uncover underlying patterns and trends, often adjusted for other sources of variation. For example, spending on Australia’s hospitals has changed since 2016–17, rising from $77 billion to $90 billion in 2020–21.
U

underlying cause of death

The disease or injury that initiated the train of events leading directly to death, or the circumstances of the accident or violence that produced the fatal injury.

See also cause(s) of death and associated cause(s) of death.

unit record data

Information related to an individual person (known as person-level data), an organisation (for example, a general practice), or encounters (such as visits to a doctor). For example, it may include name, sex, date of birth, an episode of care or service event, date of a diagnosis, date of treatment and treatment type. It can either be collected at the person/organisational level or created through data linkage. Services can be followed through multiple data sets. A person’s identity is never revealed.
V

variable

Any characteristic, number, or quantity that can be measured or counted. A variable may also be called a data item. Age, sex and country of birth are examples of variables.
W

weighting

Weighting is the process of adjusting results from a sample survey to infer results for the total in-scope population. To do this, a 'weight' is allocated to each sample unit corresponding to the level at which population statistics are produced (ABS 2023b).

For example, a questionnaire had been sent to a random sample of 100 individuals in a study population of 1000 known to comprise 50% each of females and males. Of the respondents, however, only 25 respondents were female, and 75 were male. Therefore, the female responses were weighted (multiplied) by 20, and the male responses weighted by 6 ⅔, to adjust for this
Y

years lived with disability (YLDs)

The effect on a population of ill health from disease or injury. Presented as the number of years of what could have been a healthy life that were instead spent living with disease or injury. Also known as non-fatal burden.

See also burden of disease (and injury), disability-adjusted life years (DALYs), and years of life lost (YLLs).

years of life lost (YLLs)

The effect on a population of premature death due to disease or injury. Presented as the number of years of life lost due to premature death, defined as dying before the ideal life span. Also known as fatal burden.

See also burden of disease (and injury), disability-adjusted life years (DALYs), and years lived with disability (YLDs).

References Expand

ABS (Australian Bureau of Statistics) (n.d.a) Data, ABS, Australian Government, accessed 5 December 2024.

ABS (n.d.b) Describing frequencies, ABS, Australian Government, accessed 9 January 2025.

ABS (n.d.c) Estimate and projection, ABS, Australian Government, accessed 9 December 2024.

ABS (n.d.d.) Measures of error, ABS, Australian Government, accessed 9 December 2024

ABS (n.d.e) Types of error, ABS, Australian Government, accessed 10 December 2024.

ABS (2021) Australian Statistical Geography Standard (ASGS) Edition 3, ABS, Australian Government, accessed 9 December 2024., ABS, Australian Government, accessed 9 December 2024.

ABS (2023a) Socio-Economic Indexes for Areas (SEIFA), Australia, 2021, ABS, Australian Government, accessed 10 December 2024

ABS (2023b) Weighting, benchmarking and estimation, Personal safety survey: user guide, Reference period 2021-22, ABS, Australian Government, accessed 10 December 2024.

Gupta A (2023) Likelihood v.s probability: what’s the difference? Simplilearn, Melbourne, accessed 11 December 2023.

IBM (n.d.) What is logistic regression?, IBM, accessed 10 December 2024.

Knief A and Forstmeier W (2021) Violating the normality assumption may be the lesser of two evils, Behavior Research Methods, Volume 53, pages 2576–2590, (2021), doi10.3758/s13428-021-01587-5, accessed 13 January 2025.

OAIC (Office of the Information Commissioner Queensland) (2024) Privacy and de-identified data, Office of the Information Commissioner Queensland, Queensland Government, accessed 9 January 2025.

Tulchinsky TH and Varavikova EA (2015) 'Measuring, monitoring, and evaluating the health of a population’, in The new public health, 3rd edn, Elsevier Inc, accessed 13 December 2023.

University of South Australia (2024) ‘Measures of disease frequency: basic measures’ in Measuring health and illness, Online modules in research methods and data analysis, University of South Australia, accessed 3 December 2024.

Zaiontz C (2025). Statistical test assumptions, Real statistics using Excel, accessed 13 January 2025.

Statistical terms and concepts

Jump to

A

B

C

D

E

G

I

L

M

N

O

P

Q

R

S

T

U

V

W

Y

Connect with us

Search

Other ways to browse

Feedback