Methodological Approaches for Multidimensional Personal Data Creation

  • Vasil Marchev
  • Angel Marchev JR
  • Kaloyan Haralampiev
  • Alexander Efremov
  • Boyan Markov
  • Dimitar Lyubchev
  • Milena Piryankova
  • Bogomil Filipov
  • Daniel Masarliev
  • Valentin Mitkov
Keywords: Synthetic data, data generation, statistical distributions, business logic, correlations, simulation

Abstract

This paper provides information on the description of metadata when using an algorithm to generate a multidimensional synthetic dataset. And addresses the challenges associated with collecting and using extensive datasets for scientific research, particularly in the context of sensitive information governed by legal frameworks such as the GDPR and the Bank Secrecy Act. The methodology under consideration employs simulation techniques to create a dataset comprising 36 distinct variables categorized into demographic, personal, and banking characteristics. This synthetic dataset is essential for empirical studies where data availability is restricted due to legal constraints. The research draws on diverse data sources, including the Bulgarian Census 2021, the National Statistical Institute, and the Bulgarian National Bank, ensuring comprehensive coverage for deriving the distributions. We emphasize the importance of validating the generated data to meet quality standards and support effective modeling. This study contributes to the ongoing discourse on data synthesis in data science, highlighting innovative strategies for addressing data shortages while at the same time following Eurostat's best practices for describing metadata, by making a detailed breakdown of all variables and analyzing the need for their inclusion in the summarized set of information, in view of the objectives of the study.

References

1. NSI, 2021, Census 2021, Sofia, https://census2021.bg
2. Tipatov, N., 2009, "Statistika po kommentariyam k testu Ayzenka", Biznes Trener, http://b-t.com.ua/test_ayzenk_opis.html (available in Russian language)
3. Marchev, V., Marchev, A., Jr. (2024). Anonymizing Personal Information Using Distribution-based Data Synthesis, XXII INTERNATIONAL SCIENTIFIC CONFERENCE “MANAGEMENT AND ENGINEERING’24”, Sozopol, Bulgaria (in publishing)
4. Nikhil. 2024. Google AI Introduces CodecLM: A Machine Learning Framework for Generating High-Quality Synthetic Data for LLM Alignment], [How to Use Synthetic and Simulated Data Effectively; https://towardsdatascience.com/how-to-use-synthetic-and-simulated-data-effectively-04d8582b6f88
5. Pearson, K., 1936. Method of Moments and Method of Maximum Likelihood, Biometrika 28(1/2), 35–59.
6. Hansen, L., P., 1982. Large Sample Properties Of Generalized Method Of Moments Estimators, Econometrica, Vol. 50, No. 4 (July 1982)
7. Marchev, V., Marchev, A., 2021. “Methods for Simulating Multi-dimensional Data for Financial Services Recommendation”, Bulgarian economic paper, ISSN: 2367-7082
8. Marchev, A., Marchev, V., 2022. Synthesizing multi-dimensional personal data sets, AIP Conference Proceedings, 2505 (1): 020012. https://doi.org/10.1063/5.0100615, 2022
9. NSI, Demographic statistics, 2024, https://www.nsi.bg/en/content/21307/%D0%BF%D1%80%D0%B5%D1%81%D1%81%D1%8A%D0%BE%D0%B1%D1%89%D0%B5%D0%BD%D0%B8%D0%B5/population-and-demographic-processes-2023
10. BNB, 2024, Statistical Database, Selection of statistics, https://www.bnb.bg/statistics/index.htm?toLang=_EN
11. FSC, 2024, Insurance Activity, Statistics, https://www.fsc.bg/en/insurance-activity/statistics/
12. Ministry of Finance, 2024, Economic Policy, Analyses and Research, https://www.minfin.bg/en/865
13. Ministry of Agriculture and Food, 2024, Statistics and Analyses, https://www.mzh.government.bg/en/statistics-and-analyses/
14. Simard, R., L’Ecuyer, P., 2010, “Computing the Two-Sided Kolmogorov-Smirnov Distribution”, Journal of Statistical Software.
15. Brown, J., Harvey, M., 2008, “Rational Arithmetic Mathematica Functions to Evaluate the Two-Sided One Sample K-S Cumulative Sampling Distribution”, Journal of Statistical Software, Volume 26, Issue 2.
16. Marchev, V., Marchev, A., Piryankova, M., Masarliev, D., & Mitkov, V., 2023. Synthesizing an anonymized multidimensional dataset featuring financial, economic, demographic, and personal traits data. Vanguard Scientific Instruments in Management, vol. 19, no. 1, 2023, ISSN 1314-0582, 79-99.
17. The final validated dataset is available to anyone in a public repository at a DOI address https://doi.org/10.57967/hf/3701
18. Infostat, TERTIARY EDUCATION GRADUATES BY EDUCATIONAL-QUALIFICATION DEGREE, SEX AND NARROW FIELD OF EDUCATION (FOET), 2001 – 2016, Report result
Published
2025-02-25
How to Cite
Marchev, V., Marchev JR, A., Haralampiev, K., Efremov, A., Markov, B., Lyubchev, D., Piryankova, M., Filipov, B., Masarliev, D., & Mitkov, V. (2025). Methodological Approaches for Multidimensional Personal Data Creation. Vanguard Scientific Instruments in Management, 20, 108-131. Retrieved from https://www.vsim-journal.info/index.php?journal=vsim&page=article&op=view&path[]=544