This paper was admitted to the 2023 IEEE BigData Conference in Sorrento, Italy.
Authors: Ehab Abdelhamid‡, Amirhossein Aleyasen‡, Michael Duller‡, Eric Foratier†, Vincent Fruleux†, Mirella Katch‡, Gourab Mitra‡, Rima Mutreja‡, Jozsef Patvarczki‡, Matthew Pope‡, Nikos Tsikoudis‡, F. Michael Waas‡
Abstract—We describe how multi-national retailer ADEO successfully employed database virtualization to migrate all workloads of a complex Enterprise Data Warehouse (EDW) from a legacy Teradata system to Google BigQuery. We demonstrate the generality of the technology and the approach.
Index Terms—database virtualization, data warehousing, database migrations, Teradata, Google BigQuery
I. Introduction
Conventional migrations are the bane of the database industry. According to the analyst firms Bloor Research and Gartner, over 80% of database migrations run late and exceed their budgets—often significantly. Worse, most migrations fail to achieve their objectives [1]. And yet, they are ubiquitous. Most major enterprises are currently considering moving their on-premises data warehouse to an equivalent cloud technology.
In our experience, the motivation for the migration of data warehouses is twofold. First, there is the constant mandate to make new technology available to the business. The underlying expectation is that it will enable new revenue streams. Second, there is an equally constant mandate to cut costs. Especially in times of economic downturns, reductions in Operational Expenditure (OpEx) are a priority for most enterprises. This translates into a demand for replacing legacy systems with modern technology across all industry sectors.
Database migrations are a significant logistical undertaking. In particular, applications written for a given database are rarely portable. So, when replacing the underlying database, the dependent applications need to be adjusted. The SQL embedded in the applications uses a dialect that is specific to the original source database. To work on the modern cloud database, the SQL terms must be replaced with equivalent terms in the dialect of the new destination. For example, Teradata SQL needs to be converted to the SQL dialect of BigQuery.
The rewriting of applications is a rather involved process. First, the SQL needs to be identified and extracted from the application. Then, the SQL code is converted from the source to the destination dialect and inserted into the application. Afterward, a rigorous test program needs to establish the correctness of the new SQL and ensure the integrity of the application. The process is highly manual and error-prone [2].
The database industry has a decades-long tradition of developing tools that aid with the conversion of SQL. However, the results are lacking in practice [1]. The process into which they are integrated remains manual and inherently not scalable. Add to that the functional limitations of the static translation of individual terms, and it becomes clear why the problem has remained unresolved for all practical purposes.
Modern AI tools such as ChatGPT are arguably an improvement on existing code converters. However, they depend equally on the manual process that extracts the code from the applications first and applies the results afterward [2]. Moreover, in their current state, they are far from being reliable enough for the purpose of database migrations.
As a simple rule of thumb, the cost of a database migration will total about ten times the annual run rate of the database one wants to replace. To understand the astronomical cost of migrations, one needs to analyze the dynamics of these projects. They are best described by the 80-20 Rule [3]: the hardest 20% of a migration will take up 80% of the time and resources. Unfortunately, too many organizations ignore the practical implications of this rule. As a result, they underestimate the effort and the cost of a migration by a significant factor.
In this paper, we describe how database virtualization (DBV) enabled multi-national retailer ADEO to sidestep much of the manual process of a conventional migration and transitioned successfully from a legacy data warehouse to Google BigQuery.
II. ADEO’S Database Migration Challenge
ADEO is a multi-national retailer with global operations. It established the home improvement and DIY industry vertical as the first such retailer. Today, ADEO is the European leader and third largest player globally. Through its network of over 1,000 stores, marketplaces, and platform partnerships, ADEO distributes its home improvement products and services in 15 countries.
ADEO achieved a turnover of C37.8bn in 2022. Its platform of companies includes household names like Leroy Merlin, Weldom, Bricoman, Obramax, and Zoˆdio, dedicated to home improvement. In total, ADEO comprises 24 affiliated and independent global companies.
ADEO’s subsidiaries and branches worldwide rely on a central, unified data warehouse. The access to data and the processing capabilities for it are maintained by the Central Data Group. Departments and business users run over 4,000 daily reports, resulting in 3,000,000 queries per day. The central data warehouse serves over 150 applications across ADEO’s businesses, delivering over 100,000 total reports.
This workload was developed initially for a Teradata system and represents a long-standing investment of over 20 years. However, the rigid structure of the database became increasingly limiting. Because it lacks deep integration with native cloud services, it cannot provide the flexibility the business needs to grow. [4]
Being a highly data-driven organization, ADEO sought to replace the aging Teradata system with an agile data platform that supported their business objectives. Specifically, ADEO was looking to manage all data centrally in a way that would support not only the current use cases but also position them well for the evolution and growth of the business for years to come.
ADEO determined Google BigQuery will enable them best to reach their goals. In addition to superior functionality, BigQuery also offered considerable savings in Total-Cost-of-Ownership (TCO) and an attractive price-performance ratio. To leverage BigQuery, the existing workloads would need to be migrated. Because the legacy appliance and BigQuery use different dialects of SQL, queries and reports would require rewriting. Based on industry comparables, a rewrite was expected to take upward of five years. To no surprise, ADEO considered a migration based on rewrites cost-prohibitive. Besides, the high failure rate of conventional migrations represented a risk ADEO was not willing to take. Moreover, the entire transition needed to be completed within 24 months to meet ADEO’s internal deadlines.
ADEO’s queries and their business logic are long-standing investments. Technical challenges aside, rewriting applications solely for moving would have meant to write off these investments. Re-inventing perfectly good business processes and applications at an enormous cost made no sense to the business. As ADEO looked for other options, Google introduced ADEO to Datometry as one of its partners in this space.
III. Database Virtualization
In this section, we outline the concepts of database virtualization (DBV) as far as they are required for this case study. We refer interested readers to the literature for specific technical details [5]–[7].
A. Hyper-Q virtualizer
Database Virtualization is a unique approach to database interoperability. A virtualizer is a software component, situated between the applications and the database; see Figure 1. The virtualizer translates incoming queries from the existing application in real time. The translated queries are then submitted to the new destination database for execution. The database executes the queries no different than any client query. Once the results from the operation are obtained by the virtualizer in the format of the new system, they are converted to the format of the legacy system. Finally, the results are returned to the application. As a result, the existing application works as-is via the virtualizer on the new destination.
Datometry Hyper-Q is a commercial product that implements a complete virtualizer. For the remainder of this paper, we will use the product name Hyper-Q and the concept DBV interchangeably.
Fig. 1. Sample architecture using Hyper-Q serving existing applications; newly developed applications connect directly to BigQuery.
B. Workload analysis
For the purpose of planning a full implementation using DBV, the analysis of historic workloads is instrumental. Understanding the frequency, complexity, and composition of workloads, informs sizing and capacity planning. Understanding the dependencies between workloads is critical to arrange the order in which workloads will be transferred.
Hyper-Q’s capabilities of parsing and analyzing SQL commands apply static workloads in the same way as it does to live workloads [8]. Therefore, Hyper-Q can process any given historic workload in the form of query logs. To do so, Hyper- Q ingests the schema that underlies the workload, and the workload itself. In addition to providing essential information for planning purposes, a full analysis of the future production workload tests and demonstrates the coverage of DBV. For a further discussion of these capabilities, see also [8].
C. Schema conversion
Another immediate application scenario for Hyper-Q is the conversion of the original database schema into one for the new destination. All modern databases export existing schemas in the form of DDL commands, i.e., individual SQL statements that, when executed via Hyper-Q, will recreate the schema.
In contrast to other schema converters, Hyper-Q can represent additional objects, data types, and properties that do not natively exist in the destination system. In addition to emitting DDL for the destination system, Hyper-Q generates metadata that describes any structure not natively supported by the destination system and includes information regarding its translation. Subsequently, Hyper-Q uses this enriched metadata when emulating the desired functionality. Using this technique, Hyper-Q can emulate even advanced features such as stored procedures, macros, updatable views, global temporary tables, etc.
D. Comparison with other approaches
DBV is a relatively new discipline, unlike that of database migrations. For the past half-century, various tools that facilitate schema or query rewrite have entered the market. Given an SQL term, these tools translate from the SQL dialect of the source system to the SQL dialect of the destination. These systems leave it up to the user to (1) extract the input from the application and (2) re-insert the converted term afterward. This translation procedure is entirely static, i.e., it happens in isolation from the system state at run time. Thus, we refer to the underlying principle as static conversion. In contrast, Hyper-Q operates in real time. Translations and emulations can take into account the state of the system and thus support complex constructs, including dynamic SQL, macros, or complex stored procedures. Due to the dynamic nature, Hyper-Q can even emulate operational concepts like global temporary tables, which depend on information only available at run time.
IV. Implementation
Hyper-Q is a stand-alone software component. Like other transformation projects, implementing a migration with Hyper- Q requires planning, logistics, staffing, and the support of stakeholders, users, and executives. For reasons unrelated to the project at hand, ADEO decided to rearchitect their ETL infrastructure and replace their existing components for data ingest. However, to preserve the investment in BI, reporting, and analytics, they decided to virtualize those parts using Hyper-Q. This approach illustrates the flexibility of Hyper- Q: (1) preserve existing applications where meaningful and (2) build and integrate new logic where suitable. Given the central role of the data warehouse system at ADEO, the Datometry implementation spanned over 20 business units within the company. To grant each unit a maximum degree of autonomy, the data of each business unit would reside in its own BigQuery project, see Figure 2.
Fig. 2. Target architecture, with multi-project setup for multiple BigQuery projects (BU = Business Unit).
A. Process and chronology
In the following, we describe the major elements of the implementation in chronological order so it can serve as a guideline for those who want to replicate ADEO’s success.
1) Workload Analysis: At the start of the project, Datometry provided ADEO with an in-depth analysis of their workloads, database objects, and their utilization using Hyper-Q’s analysis mode. Equipped with a detailed report, ADEO had a thorough understanding of the workload and its potential challenges up-front [4].
2) Schema conversion: Based on the workload analysis, Hyper-Q generated the schema for BigQuery comprising over 200,000 objects. Because of the enriched metadata, see Section III, the conversion was able to produce a complete schema with no requirements for manual modifications. The schema was automatically segmented by business units based on the lineage derived from the workload.
3) Data Migration: The transfer of the content of the database, once the schema is installed, is outside of the scope of DBV. We mention it here for completeness only as it is an important step during a Hyper-Q implementation.
4) Application testing: Once a first snapshot of the content of the database was live on BigQuery, testing of the full reporting and BI workloads commenced. Hyper-Q gives users the unique ability to test the old and the new stack side by side. Because both stacks deliver bit-identical results, the validation of functionality and data correctness is highly accelerated.
5) Transition: The final moment of truth came with the cutover from the old to the new system. Because of the rigorous testing leading up to this moment, ADEO was confident they could make the switch without disrupting the business. IT was able to execute the final transition over a weekend as no significant changes were required to the existing reporting infrastructure. The cutover was completely seamless to the business units and not a single support ticket was filed because of the transition.
Fig. 3. Realized savings as per TCO analysis: 81% savings in migration cost, 50% savings in project time.
B. Multi-project Setup
The mapping of users and queries to multiple backend databases is another powerful example of the benefits of a run-time emulator. Recall that ADEO wanted to break up the monolithic legacy data warehouse and assign each business unit its own BigQuery instance.
By using Hyper-Q’s ability to remap object names on the fly, ADEO was able to preserve the original query text. Hyper- Q maps object names transparently from the 2-part name convention ⟨schema.object⟩, as used in Teradata, to a 3-part name of the form ⟨project.dataset.object⟩ in BigQuery.
To this end, Hyper-Q was configured to map the original names to the new names in BigQuery and also assign each user to a default project based on their business unit. The queries would then be routed to the respective BigQuery instance according to the newly introduced object names.
Because Hyper-Q performs the mapping, users could keep using the central Hyper-Q port and have their queries modified and routed fully transparently.
C. Optimizations
Hyper-Q employs various optimization techniques to translate Teradata queries into BigQuery SQL dialect. One such optimization is materializing independent portions of a complex Teradata query in parallel. Hyper-Q implements a mechanism that creates smaller chunks of queries derived from the original query that have no dependencies on each other, and BigQuery allows pushing the results into tables concurrently. These tables are then substituted into the original query, producing a simplified translation. Materializing portions of the original query in parallel reduces the amount of cloud resources consumed by the translated query improves the overall query execution process.
V. Outcomes
ADEO successfully replaced its aging Teradata system with BigQuery. Given the high failure rates of conventional database migrations that use static conversion, this is a significant accomplishment. Better still, ADEO moved within its schedule, within budget, and with full fidelity. When it came time to switch over from the legacy appliance to BigQuery, ADEO moved all business users without any downtime as IT shielded the business from any disruption. To the best of our knowledge, no other organization has migrated a Teradata system of this size and complexity faster or more cost-effective than ADEO.
In a departure from the original monolithic architecture, ADEO uses multiple projects of BigQuery, one for each business unit. The new design provides autonomy for the different departments, as highlighted in Section IV-B.
ADEO realized considerable cost savings after replacing their legacy system with BigQuery in terms of the annual run rate. The cost savings gained from using DBV instead of migrating conventionally are equally impressive.
Above, we mentioned the rule of thumb that the switching cost is typically about ten times the annual database run rate. Assessing the total cost in the case of ADEO is more intricate than in other accounts, as ADEO did replace a considerable part of their ETL infrastructure in the process. However, based on a detailed joint analysis with the ADEO team, we determined that a full migration without Hyper-Q would have cost at least $7.5m.
With DBV, an entire component of the migration project, the consum1.5pmtion side, was eliminated. In its current configuration, we estimate ADEO was able to save approximately $5.5m in the process due to DBV. We also estimate the use of DBV shortened the entire project by at least two years. Furthermore, considerable opportunity cost was saved by finishing the project in less than 25% of the time of conventional migration.
In addition to the immediate monetary advantages of an accelerated migration, DBV also holds significant potential for future developments. For one, ADEO can automatically adjust their workloads to use new features and developments in BigQuery as they become available by simply upgrading to a new version of Hyper-Q.
ADEO is now also in a position to consolidate other systems onto BigQuery as well—either onto the same or new instances of BigQuery. This consolidation will allow them to break down silos and make data available to any application across the organization.
VI. Summary
In this paper, we presented the case study of ADEO, a multi-national retailer with global operations. We showed how DBV can mitigate the otherwise arduous task of migrating a sophisticated enterprise data warehouse. The benefits of DBV were a drastic reduction in both the time to value and the cost required to adopt a modern cloud data warehouse.
REFERENCES
[1] LumenData Inc., 2014. Data Migration: Put Your Data First or Your Migration Will Come Last, Whitepaper.
[2] F. M. Waas, 2023. Does AI Solve The Database Migration Problem?, Forbes.
[3] Wikipedia. 2023. Pareto principle.
[4] Datometry Inc., 2023. DTMY-1303.6: Global DIY Leader ADEO Tran-
sitions to Google BigQuery Using Datometry HyperQ
[5] L. Antova, D. Bryant, T. Cao, M. Duller, M. A. Soliman, F. M. Waas, 2018. Rapid Adoption of Cloud Data Warehouse Technology Using Datometry Hyper-Q. In Proc. of the 2018 ACM SIGMOD Int’l Conf.
on Management of Data. ACM, New York, NY, USA, 825-839.
[6] M. A. Soliman, L. Antova, M. Sugiyama, M. Duller, A. Aleyasen, G. Mitra, E. Abdelhamid, M. Morcos, M. Gage, D. Korablev, and F. M. Waas 2020. A Framework for Emulating Database Operations in Cloud Data Warehouses. In Proc. of the 2020 ACM SIGMOD Int’l Conf. on
Management of Data. ACM, New York, NY, USA, 1447–1461.
[7] E.Abdelhamid,N.Tsikoudis,M.Duller,M.Sugiyama,N.Marino,F.M. Waas. 2023. Adaptive Real-time Virtualization of Legacy ETL Pipelines in Cloud Data Warehouses. In Proc. of the 26th Int’l Conf. on Extending
Database Technology, EDBT, Ioannina, Greece, 765-772.
[8] A. Aleyasen, M. Morcos, L. Antova, M. Sugiyama, D. Korablev, J. Patvarczki, R. Mutreja, M. Duller, F. M. Waas, and M. Winslett. 2022. Intelligent Automated Workload Analysis for Database Replatforming. In Proc. of the 2022 ACM SIGMOD Int’l Conf. on Management of
Data. ACM, New York, NY, USA, 2273-2285.
About Datometry
Datometry is the leader in database system virtualization. With Datometry Hyper-Q, enterprises can run their existing applications directly on next-generation databases without needing costly and risk-laden database migrations. Datometry counts leading Fortune 500 and Global 2000 enterprises worldwide among their customers. For more information, visit www.datometry.com