Data Transfter

There are many ways to load text data file into the database. In general, we would use applications that database service providers' product. For example, in Oracle, we have sqlldr, or use sqlcmd in MSSQL. This practice was based on the case of loading 30GB text data file into a database, which has details in the links in the reference section.

 

 

Today, we would like to explorer a Pro*C/C++ option to import data into the database. The application was built on Oracle database. Below are the elements that used in the application.

1. Multi-threads application

2. Stack check to ensure we have enough memory to run the application.

3. Connection pool

4. File read and database insert

 

We have following specfied format of the text file.

AA||BB||CC||DD

601330||8.43||0.08||2.45

600751||2.64||0.05||16.81

603939||95.34||2.31||3.41

601702||8.42||0.19||44.35

There are 4 columns, and each column are separated by “||”. This time, we have 4 text files named “sameplefile0001.txt”, “sameplefile0002.txt”,“sameplefile0003.txt”,“sameplefile0004.txt”. Each file has about 15MB in size, which is 584778 lines of data.

 

We use 4 threads to load the data. It means that each thread processes one text file. The connection pool parameter is used; there are 2 connections when the application starts. In total, we have 4 connections. Oracle offers connection pool technology, which is to improve client side performance. In this case, this technology would not provide that much performance gain due to the nature of the application, which is mainly I/O read from files and write to the database. On contrary, it is more like a resource controller.

 

We load data in batches. The application has the capability of processing data with full batch and leftover operations. Data conversion is in the process. However, it can also be done in many ways, such as converting data types via functions coded by C/C++, Java, or PL/SQL after they were in the database.

 

Three tests had been completed. 

Test#1: There are 25 lines/batch, and 4 files; each file has about 50 lines of data. The total time taken of the program execution was 1.097034 seconds.

Test#2: 50,000 lines/batch, 4 text files; each file has about 584778 lines and about 15MB in size; total time taken was 35.124050 seconds.

Test#3: 50,000 lines/batch, 10 text files, and each file has 584778 lines; and total time taken was 102.414746 seconds. 

 

We can use this program to load data in different workstations or different time frames. The one on server would have a better performance because it would not be impacted by any network traffic. We also could have more numbers of threads or connections increased, that would lead to shorter loading time, but will required more resources like CPU, memory, faster disks.

 


Reference:

https://www.cnblogs.com/goodAndyxublog/p/14182457.html

https://studyidea.cn/import_large_file

 

posted @ 2020-12-31 19:40  blogger2020  阅读(177)  评论(0编辑  收藏  举报