Methodological Approaches for Multidimensional Personal Data Creation
Abstract
This paper provides information on the description of metadata when using an algorithm to generate a multidimensional synthetic dataset. And addresses the challenges associated with collecting and using extensive datasets for scientific research, particularly in the context of sensitive information governed by legal frameworks such as the GDPR and the Bank Secrecy Act. The methodology under consideration employs simulation techniques to create a dataset comprising 36 distinct variables categorized into demographic, personal, and banking characteristics. This synthetic dataset is essential for empirical studies where data availability is restricted due to legal constraints. The research draws on diverse data sources, including the Bulgarian Census 2021, the National Statistical Institute, and the Bulgarian National Bank, ensuring comprehensive coverage for deriving the distributions. We emphasize the importance of validating the generated data to meet quality standards and support effective modeling. This study contributes to the ongoing discourse on data synthesis in data science, highlighting innovative strategies for addressing data shortages while at the same time following Eurostat's best practices for describing metadata, by making a detailed breakdown of all variables and analyzing the need for their inclusion in the summarized set of information, in view of the objectives of the study.
References
2. Tipatov, N., 2009, "Statistika po kommentariyam k testu Ayzenka", Biznes Trener, http://b-t.com.ua/test_ayzenk_opis.html (available in Russian language)
3. Marchev, V., Marchev, A., Jr. (2024). Anonymizing Personal Information Using Distribution-based Data Synthesis, XXII INTERNATIONAL SCIENTIFIC CONFERENCE “MANAGEMENT AND ENGINEERING’24”, Sozopol, Bulgaria (in publishing)
4. Nikhil. 2024. Google AI Introduces CodecLM: A Machine Learning Framework for Generating High-Quality Synthetic Data for LLM Alignment], [How to Use Synthetic and Simulated Data Effectively; https://towardsdatascience.com/how-to-use-synthetic-and-simulated-data-effectively-04d8582b6f88
5. Pearson, K., 1936. Method of Moments and Method of Maximum Likelihood, Biometrika 28(1/2), 35–59.
6. Hansen, L., P., 1982. Large Sample Properties Of Generalized Method Of Moments Estimators, Econometrica, Vol. 50, No. 4 (July 1982)
7. Marchev, V., Marchev, A., 2021. “Methods for Simulating Multi-dimensional Data for Financial Services Recommendation”, Bulgarian economic paper, ISSN: 2367-7082
8. Marchev, A., Marchev, V., 2022. Synthesizing multi-dimensional personal data sets, AIP Conference Proceedings, 2505 (1): 020012. https://doi.org/10.1063/5.0100615, 2022
9. NSI, Demographic statistics, 2024, https://www.nsi.bg/en/content/21307/%D0%BF%D1%80%D0%B5%D1%81%D1%81%D1%8A%D0%BE%D0%B1%D1%89%D0%B5%D0%BD%D0%B8%D0%B5/population-and-demographic-processes-2023
10. BNB, 2024, Statistical Database, Selection of statistics, https://www.bnb.bg/statistics/index.htm?toLang=_EN
11. FSC, 2024, Insurance Activity, Statistics, https://www.fsc.bg/en/insurance-activity/statistics/
12. Ministry of Finance, 2024, Economic Policy, Analyses and Research, https://www.minfin.bg/en/865
13. Ministry of Agriculture and Food, 2024, Statistics and Analyses, https://www.mzh.government.bg/en/statistics-and-analyses/
14. Simard, R., L’Ecuyer, P., 2010, “Computing the Two-Sided Kolmogorov-Smirnov Distribution”, Journal of Statistical Software.
15. Brown, J., Harvey, M., 2008, “Rational Arithmetic Mathematica Functions to Evaluate the Two-Sided One Sample K-S Cumulative Sampling Distribution”, Journal of Statistical Software, Volume 26, Issue 2.
16. Marchev, V., Marchev, A., Piryankova, M., Masarliev, D., & Mitkov, V., 2023. Synthesizing an anonymized multidimensional dataset featuring financial, economic, demographic, and personal traits data. Vanguard Scientific Instruments in Management, vol. 19, no. 1, 2023, ISSN 1314-0582, 79-99.
17. The final validated dataset is available to anyone in a public repository at a DOI address https://doi.org/10.57967/hf/3701
18. Infostat, TERTIARY EDUCATION GRADUATES BY EDUCATIONAL-QUALIFICATION DEGREE, SEX AND NARROW FIELD OF EDUCATION (FOET), 2001 – 2016, Report result

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
By submitting a paper for publishing the authors hereby comply with the following provisions: 1. The authors retain the copyrights and only give the journal the right for first publication while licensing the work under Creative Commons Attribution License, which grants permissions to others to share the contribution citing this journal as first publication of the text. 2. The authors may enter separate, additional contractual relations for non-exclusive distribution of the published version of the work in this journal (e.g. to upload it in an institutional depository, or to be published in a book), given that they cite the first publication in this journal. 3. The authors are allowed and are encouraged to publish their works online (e.g. to upload it in an institutional depository, personal websites, social networks, etc.) before, during, and after the submission of the paper here, because this may lead to productive exchange, as well as earlier and larger referencing of the published works (see The Effect of Open Access).