DATA MIGRATION FROM LEGACY SYSTEMS – Part2

DATA MIGRATION FROM LEGACY SYSTEMS – Part2


In this blog, I will walk you throw an example of how to migrate data from csv file into MySQL, this is not the only way you can do this but I found this simple and easy in the case I was handling

Data migration/conversion goes throw the following: -

1. data preparation: loading data from files into temp tables

2. data validation: identifying the good records form the bad ones before adding the data into the actual tables; also, you can write some code to make it easy on the load script by getting the table IDs

3. exceptions: dumping all bad records in a file

4. data load: here goes the actual load of the data from the file into the actual table

Let's assume that we have three tables in our database, 'users', 'courses' and 'transcript', let's also assume the following structure for each table

create table users (
`id` int(10) unsigned NOT NULL auto_increment,
`login_id` varchar(64)  NULL,         
PRIMARY KEY  (`id`)
);

create table courses (
`id` int(10) unsigned NOT NULL auto_increment,
`course_id` varchar(64)  NULL,         
PRIMARY KEY  (`id`)
);

both users and courses tables are populated with data

insert into users (login_id) values ('jhondoe'), ('k.mkenz'),('l.w.weter');
insert into courses (course_id) values ('cs-001'), ('cs-002'),('cs-003');

The 'transcript' table has data that we don't care about at this point, our goal is to load new transcript data from a csv file into the transcript table.

1. create a tmp table to hold the loaded data from the csv file

drop table if exists tmp_transcript;

create table tmp_transcript (
`id` int(10) unsigned NOT NULL auto_increment,
`login_id` varchar(64)  NULL,         
`course_id` varchar(255) collate utf8_general_ci  NOT NULL,
`registration_date` datetime default NULL,
`completion_date` datetime default NULL,
`score` int null,
`sys_user_id` int,           # this is used to pull the actual system id
`sys_course_id` int,         # this is used to pull the actual system id
`error_msg` text default null,  # this is used to store error messages about bad data in columns
`error_flag` tinyint default 0,  # this is used to flag the records that will not be loaded
PRIMARY KEY  (`id`),  
KEY `dm_ix_course_code` (`course_id`),
KEY `dm_ix_username` (`login_id`));

2. load data from csv file into the tmp table

LOAD DATA LOCAL INFILE 'C:/data_load/tmp_transcript.csv'
   INTO TABLE tmp_transcript
   FIELDS TERMINATED BY ','
   OPTIONALLY ENCLOSED BY '"'
   LINES TERMINATED BY '\r\n'
 IGNORE 1 LINES
   (login_id, course_id, registration_date, completion_date, score);

3. make sure data was loaded

select * from tmp_transcript;

4. let's make some data clean, MySQL loads invalid dates as 0000-00-00 00:00:00, so we need to replace such values with NULL;

update tmp_transcript
set registration_date = null
where registration_date like '0000%';

update tmp_transcript
set completion_date = null
where completion_date like '0000%';

5. now, let's get the system IDs for each key entity ... users and courses

update tmp_transcript tmp
inner join users u
 on u.login_id = tmp.login_id
set tmp.sys_user_id = u.id;

update tmp_transcript tmp
inner join courses c
 on c.course_id = tmp.course_id
set tmp.sys_course_id = c.id;

and then see how the data looks like

select * from tmp_transcript;

6. now, we need to validate the data in our tmp table to flag the bad records and exclude them from the load

update tmp_transcript
set  error_flag=1,
 error_msg = concat(ifnull(error_msg, ""), "invalid login_id; ")
Where sys_user_id is null;

update tmp_transcript
set  error_flag=1,
 error_msg = concat(ifnull(error_msg, ""), "invalid course_id; ")
Where sys_course_id is null;

update tmp_transcript
set  error_flag=1,
 error_msg = concat(ifnull(error_msg, ""), "registration_date must be provided; ")
Where registration_date is null;

7. you can dump all invalid records into a file and re-work the data in the file OR you can fix the data directly in the tmp table and start from step 4.

select *
from tmp_transcript
where error_flag = 1;

7.1. this is optional -might be good with dealing with huge files- , after dumping the invalid records in a file, you can remove it from the tmp table so you work with less data.

delete from tmp_transcript
where error_flag = 1;

8. here we do the actual load of the data in the transcript table; notice we used WHERE error_flag = 0, which means we are focusing only on the good records, if you do step 7.1., then there is no need to include this condition as all data in the tmp table will be valid and good.

insert into transcript (
 user_id
 ,course_id
 ,registration_date
 ,completion_date
 ,score)
select 
 sys_user_id
 ,sys_course_id
 ,registration_date
 ,completion_date
 ,score
from tmp_transcript
where error_flag = 0;

Have any Question or Comment?

Leave a Reply

Your email address will not be published. Required fields are marked *