Data Cleaning: From KOSIS Raw Data to Tidy Format

Introduction

This vignette demonstrates how to clean and transform raw Korean census data from the Korean Statistical Information Service (KOSIS) API into a tidy, analysis-ready format. While there are existing R packages for accessing Korean statistical data (such as the kosis package for general KOSIS data access), there are currently no readily available software solutions specifically designed for retrieving and processing Korean census data in a tidy, analysis-friendly format for demographic and socioeconomic research.

The Challenge

Korean census and administrative data presents several challenges for researchers:

Complex API URLs: KOSIS APIs require detailed parameter specifications with cryptic codes
Multiple data sources: Different datasets (population, tax, mortality) come from different API endpoints
Inconsistent formatting: Raw data often comes in wide format with Korean-language column names
Administrative code mapping: Different datasets use different administrative code systems
Data integration: Combining multiple datasets requires careful matching of administrative units

This vignette shows how the tidycensuskr package addresses these challenges by providing pre-processed, tidy datasets and demonstrating the underlying data cleaning workflow.

Raw Data Sources

The data cleaning process involves retrieving data from multiple KOSIS API endpoints:

1. Tax Data (General Tax Revenue)

url_tax_general <- 
  "https://kosis.kr/openapi/Param/statisticsParameterData.do?method=getList&apiKey=인증키없음&itmId=T001+&objL1=A0201+A0202+A0203+A0204+A0205+A0206+A0207+A0208+A0209+A0210+A0211+A0212+A0213+A0214+A0215+A0216+A0217+A0218+A0219+A0220+A0221+A0222+A0223+A0224+A0225+A0301+A0302+A0303+A0304+A0305+A0306+A0307+A0308+A0309+A0310+A0401+A0402+A0403+A0404+A0405+A0406+A0407+A0408+A0409+A0410+A0411+A0412+A0413+A0414+A0415+A0416+A0417+A0418+A0419+A0420+A0421+A0422+A0423+A0424+A0425+A0426+A0427+A0428+A0429+A0430+A0431+A0501+A0502+A0503+A0504+A0505+A0506+A0507+A0508+A0509+A0510+A0511+A0512+A0513+A0514+A0515+A0516+A0517+A0518+A0601+A0602+A0603+A0604+A0605+A0701+A0702+A0703+A0704+A0705+A0706+A0707+A0708+A0709+A0710+A0711+A0801+A0802+A0803+A0804+A0805+A0806+A0807+A0808+A0809+A0810+A0811+A0812+A0813+A0814+A0815+A09+A1001+A1002+A1003+A1004+A1005+A1101+A1102+A1103+A1104+A1105+A1106+A1107+A1108+A1109+A1110+A1111+A1112+A1113+A1114+A1201+A1202+A1203+A1204+A1205+A1206+A1207+A1208+A1209+A1210+A1211+A1212+A1213+A1214+A1215+A1216+A1217+A1218+A1219+A1220+A1221+A1222+A1309+A1301+A1302+A1303+A1304+A1305+A1306+A1307+A1308+A1401+A1402+A1403+A1404+A1405+A1406+A1407+A1408+A1409+A1410+A1411+A1412+A1413+A1414+A1415+A1416+A1417+A1418+A1419+A1420+A1421+A1422+A1423+A1501+A1502+A1503+A1504+A1505+A1506+A1507+A1508+A1509+A1510+A1511+A1512+A1513+A1514+A1515+A1516+A1601+A1602+A1603+A1604+A1605+A1701+A1702+A1703+A1704+A1705+A1706+A1707+A1708+A1709+A1710+A1711+A1712+A1713+A1714+A1715+A1716+A1717+A1718+A1802+A1801+&objL2=15133SGH0M+&objL3=&objL4=&objL5=&objL6=&objL7=&objL8=&format=json&jsonVD=Y&prdSe=Y&newEstPrdCnt=1&outputFields=TBL_ID+TBL_NM+OBJ_ID+OBJ_NM+OBJ_NM_ENG+NM+NM_ENG+ITM_ID+ITM_NM+ITM_NM_ENG+UNIT_NM+UNIT_NM_ENG+&orgId=133&tblId=DT_133N_A3212"

2. Income Tax Data

url_tax_income <- 
  "https://kosis.kr/openapi/Param/statisticsParameterData.do?method=getList&apiKey=&itmId=T001+T002+&objL1=A0101+A0102+A0103+A0104+A0105+A0106+A0107+A0108+A0109+A0110+A0111+A0112+A0113+A0114+A0115+A0116+A0117+A0118+A0119+A0120+A0121+A0122+A0123+A0124+A0125+A0201+A0202+A0204+A0205+A0203+A0206+A0207+A0208+A0209+A0210+A0301+A0302+A0303+A0304+A0305+A0306+A0307+A0308+A0309+A0310+A0311+A0312+A0313+A0314+A0315+A0316+A0317+A0318+A0319+A0320+A0321+A0322+A0323+A0324+A0325+A0326+A0327+A0328+A0329+A0330+A0331+A0401+A0402+A0403+A0404+A0405+A0406+A0407+A0408+A0409+A0410+A0411+A0412+A0413+A0414+A0415+A0416+A0417+A0418+A0501+A0502+A0503+A0504+A0505+A0601+A0602+A0603+A0604+A0605+A0606+A0607+A0608+A0609+A0610+A0611+A0701+A0702+A0703+A0704+A0705+A0706+A0707+A0708+A0709+A0710+A0711+A0712+A0713+A0714+A0715+A08+A0901+A0902+A0903+A0904+A0905+A1001+A1002+A1003+A1004+A1005+A1006+A1007+A1008+A1009+A1010+A1011+A1012+A1013+A1014+A1101+A1102+A1103+A1104+A1105+A1106+A1107+A1108+A1109+A1110+A1111+A1112+A1113+A1114+A1115+A1116+A1117+A1118+A1119+A1120+A1121+A1122+A1201+A1202+A1203+A1204+A1205+A1206+A1207+A1208+A1301+A1302+A1303+A1304+A1305+A1306+A1307+A1308+A1309+A1310+A1311+A1312+A1313+A1314+A1315+A1316+A1317+A1318+A1319+A1320+A1321+A1322+A1323+A1401+A1402+A1403+A1404+A1405+A1406+A1407+A1408+A1409+A1410+A1411+A1412+A1413+A1414+A1415+A1416+A1501+A1502+A1503+A1504+A1505+A1601+A1602+A1603+A1604+A1605+A1606+A1607+A1608+A1609+A1610+A1611+A1612+A1613+A1614+A1615+A1616+A1617+A1618+A1701+A1702+&objL2=ALL&objL3=&objL4=&objL5=&objL6=&objL7=&objL8=&format=json&jsonVD=Y&prdSe=Y&newEstPrdCnt=1&outputFields=TBL_ID+TBL_NM+OBJ_ID+OBJ_NM+OBJ_NM_ENG+NM+NM_ENG+ITM_NM+ITM_NM_ENG+UNIT_NM+UNIT_NM_ENG+&orgId=133&tblId=DT_133001N_4214"

3. Mortality Data

url_mortality <- 
  "https://kosis.kr/openapi/Param/statisticsParameterData.do?method=getList&apiKey=&itmId=T7+&objL1=0+&objL2=11010+11020+11030+11040+11050+11060+11070+11080+11090+11100+11110+11120+11130+11140+11150+11160+11170+11180+11190+11200+11210+11220+11230+11240+11250+21010+21020+21030+21040+21050+21060+21070+21080+21090+21100+21110+21120+21130+21140+21150+21310+21510+22010+22020+22030+22040+22050+22060+22070+22310+22510+22520+23010+23020+23030+23040+23050+23060+23070+23080+23090+23310+23320+23510+23520+24010+24020+24030+24040+24050+25010+25020+25030+25040+25050+26010+26020+26030+26040+26310+26510+29010+31010+31011+31012+31013+31014+31020+31021+31022+31023+31030+31040+31041+31042+31050+31051+31052+31053+31060+31070+31080+31090+31091+31092+31100+31101+31102+31103+31104+31110+31120+31130+31140+31150+31160+31170+31180+31190+31191+31192+31193+31200+31210+31220+31230+31240+31250+31260+31270+31280+31310+31320+31330+31340+31350+31360+31370+31380+31550+31570+31580+32010+32020+32030+32040+32050+32060+32070+32310+32320+32330+32340+32350+32360+32370+32380+32390+32400+32410+32510+32520+32530+32540+32550+32560+32570+32580+32590+32600+32610+33010+33011+33012+33020+33030+33040+33041+33042+33043+33044+33310+33320+33330+33340+33350+33360+33370+33380+33390+33520+33530+33540+33550+33560+33570+33580+33590+34010+34011+34012+34020+34030+34040+34050+34060+34070+34080+34310+34320+34330+34340+34350+34360+34370+34380+34390+34400+34510+34530+34540+34550+34560+34570+34580+35010+35011+35012+35020+35030+35040+35050+35060+35310+35320+35330+35340+35350+35360+35370+35380+35510+35520+35530+35540+35550+35560+35570+35580+36010+36020+36030+36040+36060+36310+36320+36330+36350+36360+36370+36380+36390+36400+36410+36420+36430+36440+36450+36460+36470+36480+36510+36520+36530+36550+36560+36570+36580+36590+36600+36610+36620+36630+36640+36650+36660+36670+36680+37010+37011+37012+37020+37030+37040+37050+37060+37070+37080+37090+37100+37310+37320+37330+37340+37350+37360+37370+37380+37390+37400+37410+37420+37430+37520+37530+37540+37550+37560+37570+37580+37590+37600+37610+37620+37630+38010+38020+38021+38022+38030+38040+38050+38060+38110+38111+38112+38113+38114+38115+38070+38080+38090+38100+38310+38320+38330+38340+38350+38360+38370+38380+38390+38400+38510+38520+38530+38540+38550+38560+38570+38580+38590+38600+39010+39020+39310+39320+&objL3=ALL&objL4=&objL5=&objL6=&objL7=&objL8=&format=json&jsonVD=Y&prdSe=Y&newEstPrdCnt=1&outputFields=TBL_ID+TBL_NM+OBJ_ID+OBJ_NM+OBJ_NM_ENG+NM+NM_ENG+ITM_NM+ITM_NM_ENG+UNIT_NM+UNIT_NM_ENG+&orgId=101&tblId=DT_1B34E13"

4. Population Data

url_pop <- 
  "https://kosis.kr/openapi/Param/statisticsParameterData.do?method=getList&apiKey=인증키없음&itmId=T00+T60+&objL1=11010+11020+11030+11040+11050+11060+11070+11080+11090+11100+11110+11120+11130+11140+11150+11160+11170+11180+11190+11200+11210+11220+11230+11240+11250+21010+21020+21030+21040+21050+21060+21070+21080+21090+21100+21110+21120+21130+21140+21150+21510+22010+22020+22030+22040+22050+22060+22070+22510+22520+23010+23020+23030+23040+23050+23060+23070+23080+23090+23510+23520+24010+24020+24030+24040+24050+25010+25020+25030+25040+25050+26010+26020+26030+26040+26510+29010+31010+31011+31012+31013+31014+31020+31021+31022+31023+31030+31040+31041+31042+31050+31051+31052+31053+31060+31070+31080+31090+31091+31092+31100+31101+31103+31104+31110+31120+31130+31140+31150+31160+31170+31180+31190+31191+31192+31193+31200+31210+31220+31230+31240+31250+31260+31270+31280+31550+31570+31580+32010+32020+32030+32040+32050+32060+32070+32510+32520+32530+32540+32550+32560+32570+32580+32590+32600+32610+33020+33030+33040+33041+33042+33043+33044+33520+33530+33540+33550+33560+33570+33580+33590+34010+34011+34012+34020+34030+34040+34050+34060+34070+34080+34510+34530+34540+34550+34560+34570+34580+35010+35011+35012+35020+35030+35040+35050+35060+35510+35520+35530+35540+35550+35560+35570+35580+36010+36020+36030+36040+36060+36510+36520+36530+36550+36560+36570+36580+36590+36600+36610+36620+36630+36640+36650+36660+36670+36680+37010+37011+37012+37020+37030+37040+37050+37060+37070+37080+37090+37100+37510+37520+37530+37540+37550+37560+37570+37580+37590+37600+37610+37620+37630+38030+38050+38060+38070+38080+38090+38100+38110+38111+38112+38113+38114+38115+38510+38520+38530+38540+38550+38560+38570+38580+38590+38600+39010+39020+&objL2=ALL&objL3=000+&objL4=&objL5=&objL6=&objL7=&objL8=&format=json&jsonVD=Y&prdSe=Y&newEstPrdCnt=1&outputFields=TBL_ID+TBL_NM+OBJ_ID+OBJ_NM+OBJ_NM_ENG+NM+NM_ENG+ITM_ID+ITM_NM+ITM_NM_ENG+UNIT_NM+UNIT_NM_ENG+&orgId=101&tblId=DT_1IN1509"

Data Retrieval

Setting up API Access

library(tidycensuskr)
library(dplyr)
library(tidyr)
library(kosis)

# Set KOSIS API key (stored in a secure file)
kosiskey <- readLines("~/.kosiskey")[1]
tidycensuskr::set_kosis_key("~/.kosiskey")

Downloading Raw Data

# Download raw datasets from KOSIS API
df_tax <- kosis::getStatDataFromURL(url_tax_general)
df_tax_income <- kosis::getStatDataFromURL(url_tax_income)
df_mortality <- kosis::getStatDataFromURL(url_mortality)
df_pop <- kosis::getStatDataFromURL(url_pop)

# Load administrative district lookup table
sgg_lookup <- read.csv("inst/extdata/lookup_district_code.csv", fileEncoding = "EUC-KR")

Data Cleaning Workflow

1. Administrative Code Mapping

The first challenge is creating a consistent mapping between different administrative code systems used across datasets:

# Create administrative code mapping for provinces (sido)
sidocd_range <- tibble::tribble(
  ~sido_kr, ~sido_cd, ~sido_txcd,
  "서울특별시", "11", "02",
  "부산광역시", "21", "15", 
  "대구광역시", "22", "13",
  "인천광역시", "23", "03",
  "광주광역시", "24", "10",
  "대전광역시", "25", "06",
  "울산광역시", "26", "16",
  "세종특별자치시", "29", "09",
  "경기도", "31", "04",
  "강원특별자치도", "32", "05",
  "충청북도", "33", "07",
  "충청남도", "34", "08",
  "전라북도", "35", "11",
  "전라남도", "36", "12",
  "경상북도", "37", "14",
  "경상남도", "38", "17",
  "제주특별자치도", "39", "18"
)

2. Tax Data Processing

Transform raw tax data into a standardized format:

df_tax_compact <- df_tax %>%
  dplyr::transmute(
    sgg_tax_global = C1,           # Administrative code
    tax_global_total_milkrw = DT   # Tax value in million KRW
  ) %>%
  dplyr::inner_join(
    sgg_lookup[, c("sgg_tax_global", "sido_en", "sigun_en", "sigungu_cd")],
    multiple = "first"
  )

df_tax_income_compact <- df_tax_income %>%
  dplyr::transmute(
    sgg_tax_income = C1,
    tax_income_total_milkrw = DT
  ) %>%
  dplyr::inner_join(
    sgg_lookup[, c("sgg_tax_income", "sido_en", "sigun_en", "sigungu_cd")],
    multiple = "first"
  )

3. Population Data Processing

Clean and reshape population data with gender disaggregation:

df_pop2 <- df_pop %>%
  dplyr::mutate(
    sex = plyr::mapvalues(C2, c(0, 1, 2), c("total", "male", "female")),
    type = plyr::mapvalues(ITM_ID, c("T00", "T60"), c("population", "population_nonrelative"))
  ) %>%
  dplyr::select(C1, C1_NM, sex, type, DT) %>%
  tidyr::pivot_wider(
    names_from = c(type, sex),
    values_from = DT
  ) %>%
  dplyr::rename(
    sigungu_cd = C1,
    sigungu_kr = C1_NM
  ) %>%
  dplyr::mutate(
    sigungu_cd = as.integer(sigungu_cd)
  ) %>%
  dplyr::inner_join(
    sgg_lookup[, c("sigungu_cd", "sido_en", "sigungu_1_en")],
    by = "sigungu_cd",
    multiple = "first"
  )

4. Mortality Data Processing

Process mortality data with gender-specific rates:

df_mort_clean <- df_mortality %>%
  dplyr::transmute(
    sigungu_cd = as.integer(C2), 
    sex_cd = C3,
    category = "All causes",
    r_mortality_100k = DT
  ) %>%
  tidyr::pivot_wider(
    names_from = sex_cd,
    values_from = r_mortality_100k
  ) %>%
  dplyr::inner_join(
    sgg_lookup[, c("sigungu_cd", "sido_en", "sigungu_1_en")],
    by = "sigungu_cd",
    multiple = "first"
  )

Transforming to Tidy Format

The key feature of tidycensuskr is converting wide-format, multiple-dataset structure into a single, long-format tidy dataset:

1. Convert Each Dataset to Long Format

# Tax data to long format
df_tax_long <- df_tax_compact %>%
  dplyr::select(2:5) %>%
  tidyr::pivot_longer(
    cols = "tax_global_total_milkrw"
  ) %>%
  dplyr::mutate(
    type = "tax",
    class1 = "global",
    class2 = "total",
    unit = "million KRW"
  ) %>%
  dplyr::select(-name)

# Income tax data to long format  
df_tax_income_long <- df_tax_income_compact %>%
  dplyr::select(2:5) %>%
  tidyr::pivot_longer(
    cols = "tax_income_total_milkrw"
  ) %>%
  dplyr::mutate(
    type = "tax",
    class1 = "income", 
    class2 = "total",
    unit = "million KRW"
  ) %>%
  dplyr::select(-name)

# Population data to long format
df_pop_long <- df_pop2 %>%
  dplyr::select(-2) %>%
  tidyr::pivot_longer(
    cols = 2:7
  ) %>%
  tidyr::separate(col = "name", into = c("type", "class1", "class2"), sep = "_")

# Mortality data to long format
df_mort_long <- df_mort_clean %>%
  tidyr::pivot_longer(
    cols = 3:5
  ) %>%
  dplyr::mutate(
    class2 = dplyr::case_when(
      name == "0" ~ "total",
      name == "1" ~ "male", 
      name == "2" ~ "female",
      TRUE ~ NA_character_
    ),
    unit = "per 100k population",
    type = "mortality"
  ) %>%
  dplyr::rename(class1 = category) %>%
  dplyr::select(-name)

2. Combine into Single Tidy Dataset

# Bind all datasets into one comprehensive long-format dataset
censuskor <- dplyr::bind_rows(
    df_tax_long,
    df_tax_income_long,
    df_pop_long,
    df_mort_long
  ) %>%
  dplyr::rename(
    adm1 = sido_en,
    adm2 = sigungu_1_en,
    adm2_other = sigun_en,
    adm2_code = sigungu_cd
  ) %>%
  dplyr::mutate(
    year = 2020
  ) %>%
  dplyr::select(
    year, adm1, adm2, adm2_other, adm2_code,
    type, class1, class2, unit, value
  )

The Result: A Tidy Dataset

The final censuskor dataset has a consistent structure:

year: Census year (2020)
adm1: Province-level administrative name (in English)
adm2: District-level administrative name (in English)
adm2_other: Alternative district name
adm2_code: Numeric administrative code
type: Data type (“population”, “tax”, “mortality”)
class1: First-level classification (e.g., “global”, “income”, “All causes”)
class2: Second-level classification (e.g., “total”, “male”, “female”)
unit: Unit of measurement
value: Numeric value

Example Usage

library(tidycensuskr)
library(dplyr)

# Load the cleaned dataset
data(censuskor)

# View the structure
head(censuskor)

# Filter for Seoul population data
seoul_pop <- censuskor %>%
  filter(adm1 == "Seoul", type == "population", class1 == "population") %>%
  select(adm2, class2, value) %>%
  pivot_wider(names_from = class2, values_from = value)

head(seoul_pop)

Benefits of This Approach

Consistent Structure: All datasets follow the same tidy format principles
Easy Filtering: Users can easily filter by geography, data type, or demographic group
Analysis-Ready: No additional cleaning required for most analyses
Flexible: Can be easily reshaped for specific visualization or modeling needs
Extensible: New data types can be added following the same pattern

Conclusion

While the kosis package provides general access to Korean statistical data, tidycensuskr fills the specific need for cleaned, tidy, analysis-ready Korean census and administrative data. The extensive data cleaning workflow demonstrated in this vignette transforms complex, multilingual, multi-format raw data into a consistent, research-friendly format that follows tidy data principles.

This preprocessing saves researchers significant time and reduces the potential for errors in data handling, allowing them to focus on analysis rather than data wrangling.

Future Directions

Integration of additional years of census data (2015, 2010)
Addition of economic indicators and environmental data
Real-time data updates as new census releases become available
Expanded geographic coverage including sub-district (eup/myeon/dong) level data

Note: This vignette shows the data cleaning process for educational purposes. The actual censuskor dataset in the package is the result of this workflow and is ready for immediate use.

2025-08-06