Large population-based studies provide us with various information such as prevalence, incidence, disease risk factors, treatments used in clinical practice, and prognosis. As the number of participants increases, the selection and participation bias decreases; therefore, large-scale studies are preferred by researchers. Consequently, studies in which it is relatively easy to establish a cohort, such as several specific occupational groups or regions, have been actively conducted . However, it is difficult to represent the entire population when specific occupations or regions are targeted. Population-based data have the advantage of reducing selection bias to a greater extent compared to data that are solely large-scale. However, such population-based data are difficult to obtain. In the 2000s, interest in big data had increased rapidly, and several advantages of using existing data for other research purposes became noticeable. Concurrently, interest in research using the healthcare claims database has increased.
The healthcare claims database comprises secondary data based on claims data accumulated while operating the government’s health insurance system. Countries that have built such a database include South Korea, Japan, Taiwan, and Scandinavian countries (e.g., Sweden and Denmark) . Among them, Taiwan established the National Health Insurance in 1995, and the National Health Research Institute began to build the National Health Insurance Research Database in 1997. Many papers have been published since the late 2000s . In Korea, as in Taiwan, research based on databases provided by the National Health Insurance Service (NHIS) and Health Insurance Review Agency (HIRA) is rapidly increasing. Consequently, we reviewed the characteristics of the claims database, current status of research into rheumatology, and precautions taken during research.
The health insurance system is a system in which citizens typically pay insurance premiums to the NHIS, a single insurer, and the NHIS manages and operates the insurance money and provides insurance benefits when necessary. The purpose is to prevent excessive burden on households owing to high medical expenses caused by an illness or injury. The NHIS provides benefits for health promotion, prevention, diagnosis, and disease and injury treatment, as well as for rehabilitation, birth, and death. For example, it provides free health checkups every two years to all citizens over the age of 40 years for health promotion and prevention, as well as guarantees the cost of diagnosis and treatment. However, medical services that do not meet the aforementioned purpose are not covered by insurance. Oral nutritional supplements for prophylactic purposes or surgery, procedures, and some new treatments for cosmetic purposes with no associated diseases are usually included in these non-reimbursement items.
Efforts to establish a health insurance system first began in 1963, and in 1977, an occupational medical insurance system for industrial workers in workplaces with 500 or more workers was created . Gradually, this system was expanded to public officials, faculty, and medical insurance in rural areas, and universal coverage was achieved in 1989. In 2000, the organization was established as a single insurer by combining various local and social insurance. While it was originally called the National Health Insurance Corporation, its name was changed to the NHIS in 2013 .
Of the country’s total population, 97% belong to the NHIS, and about 3% are registered in the Medical Aid Program (MAP), a program for low-income families. When a person from an MAP uses medical services, little or no medical expenses are incurred, except non-reimbursement items and a certain amount of copayment. As of April 2020, 1.5 million people qualified for medical benefits based on a population of 52.38 million, accounting for 2.8% of the total population. Of the 51.34 million people enrolled in health insurance, 37.24 million (70.5%) are employees, and 14.1 million (26.7%) are self-employed local subscribers .
The NHIS has a unique system referred to as the Individual Copayment Beneficiaries Program (ICBP). Since the early 2000s, ICBP has reduced the burden of medical expenses for patients with cancer or rare and intractable diseases. The attending physician checks whether the patient meets the diagnostic criteria for diseases included in the ICBP through physical, laboratory, imaging, or pathological examinations. If applicable, registration is made through the NHIS, and patients registered in this system pay 5%∼10% of reimbursement items as copayment to medical institutions, while other expenses are paid for by the NHIS.
Korean citizens usually pay an insurance contribution to the NHIS, equivalent to 6.86% of the monthly average wage for employees in 2021 . When citizens receive medical services corresponding to reimbursement items, the NHIS will pay the total medical expenses minus the copayment to the medical institution. In addition to the NHIS, there is one more organization related to the health insurance system. The HIRA reviews medical billing and claims submitted by medical institutions to assess the adequacy of quality and quantity . Based on their review and decisions, the NHIS will pay the medical institution. The HIRA and NHIS are under the Ministry of Health and Welfare (MOHW) and are influenced by the MOHW in the formulation and implementation of policies (Figure 1).
“Healthcare claims data” refer to data based on the statement of medical care benefits billed by a medical institution to receive medical expenses from the NHIS. This specification contains information on medical institutions; patients’ personal information; International Classification of Diseases, 10th revision (ICD-10); medical history (tests, procedures, and surgery); prescriptions; and costs. Both the NHIS and HIRA have data based on this bill. They include not only claims data but also the results of health checkups for citizens over 40 years old and infant checkups conducted by the NHIS. To support policy and academic research, these data are protected using a personal identification code. In the case of the NHIS, a database was created in 2006 and provided to researchers in 2010. The HIRA began to provide this to researchers in 2013 [9,10].
The NHIS operates the National Health Insurance Sharing Service (NHISS) to provide support for policy and academic research using public health information. The NHISS provides a sample cohort database consisting of 2% (about 1 million people) of all citizens, including identified claims, health screening data, and mortality data . Because these sample data are relatively accessible to researchers, many papers using such data have been published. In addition, there are four more cohort databases: the national health screening cohort, the senior cohort, the working women cohort, and the infant medical checkup cohort [11-13]. Currently, it is possible to conduct research targeting the whole country through a customized database, and family history research is also becoming possible through the establishment of a family tree database . The HIRA provides four samples: national inpatient sample (HIRA-NIS), the national patient sample (HIRA-NPS), the aged population sample (HIRA-APS), and the pediatric patient sample (HIRA-PPS) .
To explain the data structure of the NHIS as an example of a customized database, it is essentially composed of qualification, statement, details of treatment, type of disease, and details of prescription. Statement, details of treatment, type of disease, and details of prescription are referred to as T20, T30, T40, and T60, respectively. The contents of each table are indicated in Table 1.
The NHIS and HIRA actively provide data through the establishment of organizations. As they provide several sample datasets that are easy to access, the number of research papers using health claims data has increased quite rapidly. In PubMed, we searched for papers with Korea as the affiliation and “national health insurance” or “health insurance review” in the title or abstract (Figure 2A). We confirmed that, toward the second half of the 2010s, papers using NHIS data were frequently published, and their number of studies was increasing rapidly. When we added representative rheumatic diseases to the above results as a search word (search keywords: rheumatoid arthritis, lupus, ankylosing spondylitis, Behçet, osteoarthritis, gout, Sjögren, myositis, vasculitis, fibromyalgia, systemic sclerosis, antiphospholipid, and adult-onset Still disease), we found additional studies using NHIS data. Among studies using NHIS data, rheumatoid arthritis was the most studied topic with 54 studies. Lupus was the second most studied topic (30 studies), followed by ankylosing spondylitis (23 studies), Behçet’s disease (17 studies), and osteoarthritis (15 studies). There were fewer than 10 or no studies targeting other diseases. Among studies using HIRA data, rheumatoid arthritis was also the most studied disease (14 studies), followed by ankylosing spondylitis (10 studies). Studies of other topics were limited or had not been conducted (Figure 2B).
When reviewing the conducted studies, we confirmed that the following studies were conducted according to the disease course:
If we add several types of data (particularly national health checkup data) to the existing data, we can check the results of people's behavior/habits and laboratory results and use them for research . When the temporal relationship of these results can be interpreted, the risk factors for disease occurrence can be estimated.
The prevalence and incidence survey is considered to be close to the actual data; thus, it is the most frequently conducted study and is useful for rare disease surveys [16,17]. In addition, it is possible to investigate comorbid diseases, and care should be taken when interpreting causal relationships .
It is possible to investigate treatment modalities and treatment drugs that are being implemented . In addition, treatment effects and adverse reactions can be indirectly confirmed, and the cost of treatment can be estimated [20,21].
When data for each claim are accumulated, a long-term cohort is created. Thus, it is possible to study the long-term effects, prognosis, and complications of the treatment . In addition, there is code and death information concerning disability; thus, it is possible to study this as well .
Claims data provide information on a large number of patients and incidents targeting the entire nation. An extremely large sample size has the advantage of allowing for the performance of several studies that cannot be performed in conventional small clinical studies. Conversely, because the data were not constructed for the purpose of research from the beginning, the data collected by this method contain various errors in interpretation. Limitations have been mentioned often by existing authors, but it is worth reorganizing and mentioning other restrictions.
This data format of items is provided by medical institutions to receive expenses from the NHIS. Therefore, clinical information may be omitted or inaccurate if it is not important to the claim . In addition, the results cannot be confirmed because the data were not tailored to the study. Physical examination results including blood pressure, laboratory tests, imaging tests, and pathology results could not be confirmed. Adding the results of the national health checkup to the existing data shows some of the results, but only the results related to cardiovascular diseases are included. Regarding the national health checkup, since 2009, the checkup items have been changed, and the display of the results of the questionnaire on lifestyle has also been changed .
Because the results cannot be viewed, it is difficult to apply the diagnostic/classification criteria used to select participants in general studies. Therefore, researchers must define the diagnosis, namely, the operational definition. Diagnosis can be defined through disease codes, which is the easiest method; however, these codes cannot represent 100% of patients’ diseases. The claim disease code accounts for approximately 70%, which is consistent with the diagnosis of the medical record. The degree of concordance between them decreases in outpatient patients, compared to inpatients, mild diseases, compared to severe diseases, and primary care, compared to general hospitals [3,25]. In addition, the number of participants may vary depending on the scope of the disease code investigation, as more mild diseases are included in additional diagnosis rather than principal diagnosis [26-28]. As described above, the low degree of agreement can be increased by adding processes such as searching for codes of drugs specifically used for diseases and the number of visits to the outpatient department . However, when the aforementioned process is added, it is important for the researcher to develop an appropriate operational definition according to the research topic, as it may be inappropriate for prevalence and incidence research because the sensitivity to disease is lowered. Therefore, it is recommended to perform a self-validation or refer to the algorithms of previously published papers .
When medical institutions claim and receive expenses from the NHIS, they will receive expenses after the HIRA reviews whether the claim is appropriate to the criteria for providing reimbursed services in the NHIS. When the HIRA screens a claim, the provision of benefits is often determined by the presence or absence of a specific disease code; thus, there may be cases in which a disease code that is not directly related to the underlying major disease is added. As the formulary approach is changed, the use of drugs may increase or decrease . Moreover, medications and procedures may vary depending on the diagnosis-related group (DRG) policy and the fee for service policy . Therefore, the interpretation of research conducted using claims data should be based on a long-term and detailed understanding of insurance items.
If the copayment is reduced owing to enrollment in the ICBP, MAP, etc., registered persons’ accessibility to medical services increases. As their use of medical services increases, their prevalence/incidence rates and drug use for diseases may differ from those of other income and disease groups . Therefore, care should be taken when interpreting the results of these groups.
The claims data contain several samples; thus, the statistical power is high. However, these data show a phenomenon, and it is difficult to determine a causal relationship .
Health claims data have the above limitations, but, as real-world data, they also have many advantages that other data cannot have. Health claims data are representative of the nation and are easy to generalize. Since health claims data are population-based data, selection bias can be reduced to a greater degree compared to other large-scale data. They reflect the actual healthcare environment rather than a limited experimental environment and show the current status and trends because they are long-term follow-up data. If health claims data are used in the thesis writing process, data that have already been established are used, thus reducing the time and cost required for data construction, which takes up the most time during such a process. In addition, it is easy to obtain detailed information on medication use, access to actual treatment costs, and research on rare incidents .
To overcome the above limitations, other data (Statistics Administration, National Institute of Environmental Sciences, Meteorological Administration, etc.) can be linked, or data validation for “operational definitions” can be added. Adding these multiple processes will lead to more accurate results. In addition, more studies should be conducted as claims data are continuously building an additional database based on the NHIS.
Health claims data in Korea contain ample medical treatment data, and various studies can be conducted. Considering data limitations and the need for further validation, these data are critical for medical research.
No potential conflict of interest relevant to this article was reported.
J.S.P. was involved in the conception and design of the study. J.S.P and C.H.L. were involved in critically drafting and revising the manuscript for important intellectual content and final approval of the version to be published.