In most businesses, data migration/conversion becomes a necessary routine and is something that we deal with on a regular basis. To add to the problem, most business systems now hold far more data than in previous years and so there is far more data that need to be moved. Furthermore, many databases and information stores are expanding exponentially resulting in businesses having to expand their storage capacity with increasing frequency, and therefore migrating their data with increasing frequency.
When it comes to data migration/conversion, the adage "There's a right way, and there's a wrong way..." doesn't always apply. Given the environment, there could be many right ways, and many wrong ways. And sometimes what starts out as the right way can end up wrong as the process moves forward.
Problems and Questions
In our case, we had multi flavor systems with mixed front-end and back-end structure; we had more than 400 legacy systems that we needed to migrate, those legacy systems use different technologies and database systems; the target was .net-based system with MySQL as the back-end database.
Like any data migration, there are always problems (technical and business) where technical people and business people have to work on to resolve; some of the technical problems were:
* Do we have the necessary documentations
* Which technologies to use
* What approaches to use
* How quick and fast those chosen technologies and approaches
* How those approaches and technologies will perform
* Could the data migration be done during the on hours (while the system is up and running)
* Different schema design in the legacy side that needs to be mapped to one schema design in the new system
* Orphan data and/or corrupted data in the legacy system where we ended up doing data cleansing for some or in some cases we ignored
* Which migrations can be done during the on hours and which ones can be done during the off hours … etc
On the other hand, some of the business problems that we faced were:
* Do we have the necessary business documents
* How many clients are we migrating within a given period of time
* How many migrations we’ve done preparation for but those migrations did not happen
* What resources do we need and how many
* Do we have the client facing technical documents, how accurate are those documents
* How often should we update and change those documents so that they are in sync with the changes that are made to the data migration process
* Are all assigned project managers well trained, comfortable and satisfied with the process flow and their clients
* Are there other areas for the migration that does not include data migration, if so, what to do with those
* Is the client well trained on the new system
* When the client is going to start using their new system and when the data migration should happen
* Failure to validate data
* And more…etc
Our solution was simple and easy to use, we did not purchase any ETL tools but we implements our own, and in a high level here is what we did
1. Collected csv files with pre-defined format
2. Validated the data in those csv files to ensure data validity
3. Loaded the data into staging environment for the client to validate and verify
4. Loaded the data into production environment
In the case of the migrations I worked on, we had two legacy data that we needed to migrate, legacy data from our previous system that is using SQL Server Database, and legacy data that came to us in the form of CSV files.
Legacy data from our previous system or can be called internal legacy migrations
In order to migrate data from our legacy systems we had to
1. create a technical gap document, this document is a client facing document that contains all features, modules and customizations that each client has in their legacy system; it also contains all data areas that can be migrated and that ones that cannot; also we highlight all necessary areas that we need client’s decision on.
2. we created DTS for each client that we are migrating their data, the DTS dumps data into a pre-defined CSV files format
Legacy data from other systems, or can be called external legacy migrations: for these types of migrations, we created a document (client specs) that contains all necessary files’ structure and their format; clients populate these files and send them over to us; for ease of use, we decided to make the files’ format for the two types of migrations to be the same so that they can be managed easily with less troublesome.
Once we have the CSV files available, we run all files against validation process to ensure data validity, validation process was set of MySQL scripts that visit all available CSV files and validate data in each file as well as ensuring that the relations between the files are in place, if any record in any file has any issue and does not satisfy with the rules mentioned in the client specs then that record will be captured as an exception and will be reported back to the client.
After several iterations and validations and depending on how clean the files are, we load the CSV files into staging environment for the client to verify their data, at this phase we initiate an issue log doc where we capture all issues the client might find as they review their data. If the client fails to validate their data in staging environment, then we will end up migrating wrong data or even corrupted data, which in turns
1. it may hurt the performance of the system
2. result in unexpected results and invalid reports
3. disk space is taken by the wrong data that was migrated, which it was not supposed to
4. it requires more time and resources on the business side as well as the technical side to figure out what needs to be done; should we reload from scratch, should we fix the bad data, when can we do that and how soon does the client need that fix.
After we figure out and fix all reported issues –if there were any-, we load those CSV files into production environment, which is basically the last step of the data migration.
As I mentioned in the beginning, different methods can be used with data migration, but probably the most frustrating thing about data migration is that there is no simple template to follow that circumvents all problems. Each environment is unique and will have its own set of challenges. But bringing as many heads together as possible will at least help to keep the major problems under control.
In order to develop a migration strategy, a business must evaluate not only the physical infrastructure, but also the criticality of the data. While all data is valuable, it’s not all equal in value. By establishing criticality, a business can determine which of its data requires the highest levels of protection and plan and purchase accordingly.