Synthesizing an anonymized multidimensional dataset featuring financial, economic, demographic, and personal traits data

  • Vasil Marchev
  • Angel Marchev
  • Milena Piryankova
  • Daniel Masarliev
  • Valentin Mitkov
Keywords: Synthetic Data Generation, Cholesky Decomposition, Kolmogorov-Smirnov Test

Abstract

This paper presents a novel approach to generating synthetic data arrays that address the scarcity of datasets containing sensitive information due to restrictions imposed by legislation such as the GDPR and the Bank Secrecy Act. By integrating statistical methods, including Monte-Carlo simulation and Cholesky decomposition, with business logic, the study outlines a comprehensive methodology for the creation of multidimensional synthetic data sets. These datasets incorporate demographic, personality, financial, and banking variables to simulate the profiles of financially active individuals. This alternative to traditional data collection methods offers a solution to the challenges of accessing sensitive data while maintaining compliance with legal frameworks. The use of synthetic data allows for the preservation of variable interrelationships and provides a secure testing environment, despite the inherent complexities in generating high-quality synthetic databases. Validation of the synthesized data through the Kolmogorov-Smirnov test ensures their accuracy and relevance. This approach not only facilitates the advancement of data-driven models in fields where access to sensitive data is limited but also promotes the ethical use of data by adhering to privacy regulations. The paper demonstrates the potential of synthetic data to serve as a viable resource for scientific research, offering a detailed exploration of its generation process and the implications for future applications in sensitive areas of study.

References

1. Marchev, V., Marchev, A., A., 2019, Simulation of a multi-criteria database for banking services. Algorithm and business logic, New information technologies and big data: opportunities and perspectives in analyses and management decisions in business, economy and social sphere, UNSS, pp. 179 – 190;
2. Marchev, A., Marchev, V., 2023, Automated Algorithm for Multi-variate Data Synthesis with Cholesky Decomposition, ICACS 2023: the 7th International Conference on Algorithms, Computing and Systems, Larissa Greece, Association for Computing Machinery, New York, pp. 1 – 6, ISBN: 979-8-4007-0909-8;
3. Hansen, L., 1982, Large sample properties of generalized method of moments estimators, Econometrica, Vol. 50, No. 4 (JULY, 1982);
4. Julier, S., Uhlmann, J., 1996, A general method for approximating nonlinear transformation of probability distributions
5. Marchev, V, Marchev, A., 2021, “Methods for Simulating Multi-dimensional Data for Financial Services Recommendation”, Bulgarian Economic Papers, Center for economic thеories and policies, ISSN: 2367-7082, BEP 02-2021, Feb. 2021, http://www.bep.bg
6. Moral, P, Doucet, A., Jasra, A., 2006. SEQUENTIAL MONTE CARLO SAMPLERS. J. R. STATIST. SOC. B (2006) 68, PART 3, PP. 411–436
7. Dereniowsky, D., Kubale, M., 2003. Cholesky factorization of matrices in parallel and ranking of graphs, parallel processing and applied mathematics, 5TH INTERNATIONAL CONFERENCE, PPAM 2003, Czestochowa, Poland, Sep 7-10, 2003
8. Qu, W., Liu, H. & Zhang, Z., 2020, A method of generating multivariate non-normal random numbers with desired multivariate skewness and kurtosis. Behav Res 52, 939€“946
9. Bulgarian National Bank, 2023, Home, Statistics, Monetary and Interest Rate Statistics, Loans and Deposits by Amount Category and Economic Activity
10. Financial Supervision Commission, 2022, Insurance Activity, Statistics
11. b-t.com.ua, 2009, Statistika po kommentariyam k testu Ayzenka, http://b-t.com.ua/test_ayzenk_rez_komment.html
12. Trading Economics, 2022, Home Ownership Rate in Bulgaria, https://tradingeconomics.com/bulgaria/home-ownership-rate
13. Porozhanov, R., Broy na sobstvenitsite na zemedelski zemi i pritezhavanite ot tyah ploshti, agri.bg, 25.06.2018,
14. NSI, Home, Demographic and social statistics
15. NSI, 2023, Home, Business statistics R&D, Innovations and Information Society Information Society, INDIVIDUALS USING THE INTERNET BY PURPOSES;
16. NSI, 2021, Census 2021, Sofia, https://census2021.bg
17. Infostat.bg, Home, Demographic and social statistics, Population, Dec. 2023, infostat.nsi.bg/infostat/pages/module.jsf?x_2=80
Published
2023-12-22
How to Cite
Marchev, V., Marchev, A., Piryankova, M., Masarliev, D., & Mitkov, V. (2023). Synthesizing an anonymized multidimensional dataset featuring financial, economic, demographic, and personal traits data. Vanguard Scientific Instruments in Management, 19, 79-99. Retrieved from https://www.vsim-journal.info/index.php?journal=vsim&page=article&op=view&path[]=515