Date: 26 July 2022
Data migration projects are a virtual inevitability for organisations either looking to move into the cloud, accelerate their cloud operations, or modernise legacy systems and take advantage of more efficient, lower cost, centralised, and more agile solutions available today. However, as with anything, there is a right way to do it and plenty of wrong ways. Through recent experiences, I’m going to share some insights into how you can make the most from your data project – while avoiding some all-too-common pitfalls.
Why Catalog data? The business case
As outlined above, there are many reasons for data migration projects. Generally, big data projects have seen data collected for a specific initiative, with the notion subsequently arising that the resulting asset might be useful for various other projects, either with something specific in mind, or for unknown potential future initiatives.
That’s what makes cataloguing so important. This is the process of discovering and documenting the details about the data itself – the metadata (more on that here). With an organised inventory of data assets, you know what you hold and have a far better idea of how it can be used.
However, the funny thing is that some big data projects quite often don’t have a specific use case in mind at the time it is stored. By contrast, when moving data from these (or other) projects there generally should be a clear purpose of what you intend doing with it. In other words, make a business case.
Find a story for how the data can help; with a clear view on what you intend using the data for. This aids the migration process because you’re more likely to move the data to a platform suitable for the use case.
With that established, the crucial steps include:
- Migrating raw data to the cloud
- Data cleansing
- Cataloging the data
- Keeping it organised.
Importantly, catalogue and store both raw and clean data, so it can be accessed easily, discovered, and reused in any future projects.
Do the ETL…but retain the original
The Extract, transform and load (ETL) process is common to most data migration projects, particularly if operationalising the data is the underlying intention. ETL gets raw data into a more useable form in your data lake, data warehouse, or other suitable structure.
Here’s the thing with ETL, though. It can scrub some of the valuable stuff from the raw data, whether by accident or by design (because some data might not be useful for the immediate intended purpose). Cleanse the data and tag it clearly for easy identification and partition the stored data into logical groups for easy access.
Raw or ‘dark’ data contains ‘everything’. While ‘everything’ isn’t needed for specific use cases, it often becomes relevant again for future as-yet-undefined use cases. That’s why retaining a copy of the original data is such a good idea. It’s also a perfect hedge against making mistakes; if the ETL fails for any reason, isn’t fit for purpose, or even mysteriously disappears, you can always have another go.
Retaining the raw data leaves some room for stuffing up, and you should expect to stuff up. After all, a lot of data-related work is exploratory in nature, and that means a dead end or failed hypothesis is never far away. Fail fast, learn, do it all again. With cloud infrastructure setup taking little more than a few clicks these days, failing isn’t a big deal. With a streamlined framework around how your raw data is stored and a proper data roadmap, you’ll be OK with avoiding mistakes but setting them right with low consequences when they do happen.
Don’t leave out the Architects!
Data-centric projects often revolve around a Data Architect but don’t always include an Enterprise Architect. This is a bad idea because you need a generalist on the team who understands where and how the data project fits into the broader scheme of things. Where the Data Architect has a narrow focus (more or less on bits and bytes and their movement from A to B), the Enterprise Architect understands both the ecosystem and end-to-end domain, and Solution Architects would have an understanding of where data comes from and who consumes it, metrics around storage and retention, structure, and more. They also bring a crucial governance lens to the project, while supporting the creation of lasting value for the organisation.
Successful data projects are built on ingenuity (and budget)
One of the amazing things about data is that often, they contain the solutions to problems we don’t yet know we have. That’s amazing for a data scientist, but maybe less so for a CFO or whoever oversees paying for the project.
That said, there’s no getting away from it: those organisations getting great results from data projects put at least two common things into their efforts. The first is their recognition of the value of raw data, manifest in a commitment to collection and storage. The second is that the companies retaining (and often migrating) data tend to also retain data scientists.
Data scientists cost money, but that’s not where the investment ends. Tools, platforms, and facilities are part of the bill; the lesson here is that recognising data value is just the first step. Backing that with investment in people and infrastructure is how latent data potential is turned into actual, competitive advantage. Yes, justifying the investment can be difficult, because it requires acknowledging ‘known unknowns and unknown unknowns’.
But making the investment is literally putting forward your money when your mouth says, ‘data has value’. And any data migration project has that as an underlying principle, with the goal of exposing value for your organisation.