Posts

Improving the Scalability, Flexibility, and Performance of an ETL Framework for Pharmaceutical Sales and Marketing

When a leading pharmaceutical sales and marketing company needed to replace its costly extract, transform, and load (ETL) data processing, InfoCepts created a metadata-driven framework with configurable metadata schema and shell scripts. The new ETL solution is cost effective, scalable, easy to use, flexible, and high performing.

Download

How to Ensure a Successful Data Migration: 6 Critical Validation Tests

Rupesh KatkamwarApril 15, 2016

In our last blog post, we’ve explored the planning and strategy steps necessary to prepare for a large data migration processes. In this post, we continue the discussion, by looking at another important part of the migration process—validation testing.

One of the biggest challenges in a data migration project is the ability to validate thousands of tables containing billions of records within defined timelines while also achieving the desired test coverage. In fact, Gartner has reported that more than 50 percent of data migration projects will exceed budget and/or result in some form of business disruption due to flawed execution.[1]

During data migration there is always a possibility of missing data or data corruption. Therefore, it is necessary to test whether the whole data set migrated successfully, looking at both historical and incremental data migration. Incremental loading, in particular, can be challenging because each database has its own structure. This makes it important to test that the fields and jobs are loaded correctly and that files are not corrupted.

To ensure both historical and incremental data is successfully migrated, requires running several validation tests, including:

Schema Validation
During data migration it is necessary to validate the schema, such as stored procedures, views, or user preferences as part of the data migration. However, as previously discussed, with thousands of tables, manually validating schema is too time-consuming. Therefore, it is necessary to automate the process by writing a script that can reveal any mismatches between the source and the target. By automating this process, you can export target metadata into a final table where the data can be compared with source metadata to identify any data mismatches prior to run time.

Cell-by-Cell Comparison using QuerySurge
Testing at this level of detail guarantees full validation of the data, helping to avoid time-consuming and expensive data quality issues often discovered in after a data migration. Automation is also essential for this type of testing and can be done using QuerySurge (see How to Ensure a Successful Data Migration: Assessment and Automation Strategies).

For starters, QuerySurge has enough database capacity to store intermediate results in order to join and compare the data from the source and target databases throughout the testing process. Secondly, QuerySurge provides detailed comparison results, highlighting data mismatches.

The one limitation with QuerySurge however, is that it does not pinpoint exactly where a data mismatch is located such as what column or cell it is in. However, this issue is easily overcome by writing a utility on top of the QuerySurge tool, which allows you to then automatically and quickly detect the specific location of data mismatches.

Reconciliation Checks
One of the most important aspects of validation is performing reconciliation checks on the source and target databases for all columns. This ensures that the data is not corrupted, date formats are maintained, and that the data is completely loaded.

As previously stated, given the size of the data being migrated, a manual process is not feasible. Additionally, manual reconciliation of missing columns can result in the removing columns that change the structure of the data. For these reasons, automation is once again required.

To automate the process, use a Unix shell script to create a list of tables in the text file to be tested and then exported the text file as parameter to the shell. As the script scans, it will pick up numeric columns from a particular table and will create sequels from all numeric columns. Later it will take those sequels, execute it, and store the results in a file. All steps should be performed on both the source database and the target database. Once finished, you can compare both the files generated on the source and target database to identify the mismatches.

NULL Validation
NULL validation is necessary to ensure the NULL is not accidently inserted in place of valid data. Automating this process is similar to other validation testing and requires writing an automation script to validate the count of nullable and non-nullables values present in the same columns between the source and target databases.

Ad Hoc Testing
The final step in the testing process, ad hoc testing ensures that the target business application and the newly migrated data are able to successfully process the data. Ad hoc testing may, therefore, include running a business query from customer or examining important data or frequently used tables to ensure that the source and target data match and the data migrated successfully.

Non-Functional Testing
One should also include non-functional testing of architectural components at different testing levels to validate system operations, including:

  • Security Validation – Security testing helps determine whether all the security user roles have successfully migrated to new database to prevent unauthorized user access to the system data. It includes the validation of tables, views, stored procedures, and functions.
  • Performance or Benchmark Testing – Benchmark testing helps identify if there is any impact on new/modified architectural components while fetching the data on the new system. Benchmark testing compares the workflow/job execution time in parallel loading, the complex query execution time, and the execution time between large table queries on the source and target databases. If possible, the best way to make comparisons is through running a report data query on both databases and then compare query execution time.

Following these strategies and testing procedures during your data migration project will help ensure that the migration goes smoothly. Not only helping to eliminate business disruptions due to flawed execution, but also saving time over manual processes, and ensuring that the entire project maintains its timeline—thereby avoiding expensive cost overruns.

However, even with these guidelines, external expert advice is recommended. Large data migrations are infrequent events and complex in nature, making solid expertise and guidance a critical component of success. If you would like to speak further with our InfoCepts team about your data migration project and how we can help guide the process, please get in touch.

 


[1] “Legacy Data Migration is a High-Risk Project – Be Prepared!,” Mei Selvage, Gartner, Dec 2014.

How to Ensure a Successful Data Migration: Assessment and Automation Strategies

Rupesh KatkamwarApril 5, 2016

With a number of emerging big data platforms now out on the market, many companies are looking to migrate their data from their current relational database management platform to a big data storage structure. However, large data migration projects are inherently complex.

While there are many reasons data migration projects run into trouble, our own experience has taught us that defining an effective and efficient testing strategy is critical to a project’s overall success. In this two-part post, we examine what a solid planning and testing strategy should include to ensure success and to avoid the most frequent failures that occur during data migration.

In this post, we focus on preparing for data migration through proper planning:

Conducting a Platform Difference Assessment
The core aspect of a data migration project is the accurate transfer of a source database to a target database. To ensure that migrated data is accurately represented in the target system, it is necessary to first assess where there may be discrepancies in the values/records of each column/table in the source and target databases. This analysis should include identifying any differences in the types of data types used, time and date formats, and differences between displaying control characters. It should also assess if there are issues with NULL, BLANK, and SPACE usage between the source and target database.

Identified differences should be documented along with the appropriate procedures for addressing these inconsistencies and approved by all key stakeholders before proceeding. Additionally, conducting an assessment is important for identifying the exact scope of the migration. The assessment should focus on both historical data migration and incremental data migration using some ETL jobs.

The assessment phase should also include an evaluation of what testing tool is most appropriate, including assessing a tool’s compatibility with both the source and target database and how the tool will be integrated. In a large database migration, you need a tool that is robust enough and has a large enough database capacity to handle the database comparisons. Many tools may work fine for small tables, but when comparing extremely large tables in one go, less robust tools can take a tremendous amount of time to analyze the data or can get hung up and stop working.

Use Automation Utilities
One of the biggest challenges in any data migration process, and particularly in a large data migration project, is validating the source data against the target data. Because a data migration will involve thousands of tables and hundreds of gigabytes (GB) of data, manual validation is highly inefficient and in many cases not realistic.

While sample testing and manual verification is possible, it still leaves large amounts of the new database untested. Therefore, automating the testing process is the best solution. To automate the process, test scripts are written that compare and validate the data present in both the legacy and new database. While there are several tools on the market than can help with the automation process, QuerySurge is one tool we have found to be highly successful in automating the validation process.

QuerySurge is a collaborative data testing solution built specifically to automate the testing of big data and data warehouses while providing a holistic view of the data’s health. QuerySurge not only ensures that the data extracted from sources remains intact in the target system by analyzing and pinpointing up to 100 percent of all data differences quickly, but also provides real-time and historical view of the data’s health over time.

Whatever method you determine is best however, automation is critical to for running validation tests on large swaths of data. Ultimately, automation ensures less errors throughout the validation testing process than manual small batch data testing.

To learn more about the validation types that are necessary and how to perform these tests through automated processes, stay tuned for our next post on data migration.

Achieve Real-Time Data Analysis with a Virtual Data Warehouse

Chetan DhapudkarJanuary 19, 2016

In today’s digital environment, the success of big data initiatives hinge not on simply collecting large volumes of data, but on a business’ ability to analyze the data and act on its insights. Companies with near real-time analytic capabilities can decipher information faster, find actionable results, and increase their overall competiveness.

 

Yet, despite the critical influence real-time data analysis can have on a company’s competitive advantage, many businesses still struggle to increase the velocity of their data. According to a PwC survey highlighted in CIO.com, 62 percent of respondents surveyed believe big data can give them a competitive advantage; however 58 percent agreed that moving from data to insight is a major challenge.

 

The Challenge of Real-Time Analytics

Why is it so hard for businesses to gain near real-time analysis? In the traditional database model, data is stored and indexed and then subsequently processed by queries, or batch processes. These queries are typically processed on a schedule—one to two times a day. Even when done frequently, there is significant lag time between the start and completion of the process.

For most companies, the only way to get near real-time data is to develop a custom solution. However, given the expense involved in maintaining custom code over time as well as limitations on functionality of the solution, most companies will find this approach not viable in the long-term.

Still, sticking with the status quo and doing nothing is not a good alternative—companies that do so risk losing their competitive edge.

 

Near Real-Time Analytics Achievable with a Virtual Database

To achieve near real-time analysis of data without the need for expensive custom coding, organizations should consider developing a virtual data warehouse using a tool like Informatica. This solution is compatible with any database or BI reporting tool such as MicroStrategy and is simple for anyone with database or ETL experience to use.

Here’s a look at how it works:

  • A virtual data warehouse allows the real-time processing of queries to distributed data sources.
  • Data is then made available in the form of virtual tables to facilitate real-time reporting.
  • Data checks are temporary and done on the fly.

Because this solution doesn’t require replication of data to a physical database, the processing time that is required in traditional batch processing or ETL is eliminated and businesses can achieve near real-time data and analysis. Additionally, a virtual data warehouse can work even in complicated data environments. In one use case, a virtual data warehouse is providing near real-time analysis with data in a multi-tenant environment involving 140 different time zones.

 

Benefits Beyond Real-Time Analysis

As already touched on, near real-time analysis has strong business benefits such as speeding decision-making and improving competitiveness. And for those reasons alone, a strong business case can be made that it’s worth investing in a solution that can provide near real-time analysis of data as needed.

 

But, there are other benefits that add additional value. These include:

  • Eliminates batch processing time – Because all real-time data is made available via data services, there is no additional time spent in ETL or batch processing execution, which typically has a long lag time and requires maintenance and follow-up.
  • Reusability – The virtual tables designed in data services are used to load data for historical reports, so there is no need to build additional ETL functionality for historical data.
  • Ease of coding – With Informatica all the data can be easily read, transformed, and made available for reports without the need to write complex map-reduce codes. The Developer tool also has an easy to use user interface (UI).
  • Maintenance and monitoring – Using a web-based Admin console, monitoring and maintenance is simple.
  • Scalability – Using an SOA-based architecture in Informatica, it’s easy to scale the infrastructure by adding nodes to the existing domain when needed.

From better understanding the customer experience to detecting fraud or security threats to tapping into the IoT to gain insights into how your organization is performing, near real-time analytics is applicable to a host of use cases that improve competiveness, operational efficiency, and to drive new revenue streams.

If your organization doesn’t have near real-time analytics capability, and are wondering how it might apply to your business, let’s schedule time to talk.

Automate Your Dynamic Mappings Rule Management System

Manish SengarNovember 17, 2015

Information is a critical business asset, but only when it’s extracted and utilized effectively. With the advent of the digital revolution and adoption of the Internet of Things (IoT), businesses face increasing challenges with information. By far, the biggest hurdles are:

  • Volume –The amount of data that can be effectively collected and analysed.
  • Velocity – The rate at which data should be ingested and processed.
  • Variety – Types of data that should be collected.
  • Veracity – Trustworthiness of the data.

In one way or another, all of these challenges relate to data quality, a significant concern right now that can potentially affect the success (or failure) of your company. As Ted Friedman and Michael Smith wrote in a Gartner report, “Poor data quality is a primary reason for 40 percent of all business initiatives failing to achieve their targeted benefits.” If your data quality suffers, you can expect even bigger challenges that affect the operational side of your business (meaning a significant correction effort is needed), as well as your company’s overall viability (if not detected and corrected in time, incorrect knowledge and potentially bad business decisions can ensue).

Where do data quality problems come from?

Data quality issues can arise at any stage or part of your business intelligence system. These include:

  • Data modelling — Incorrect representation of the business model.
  • Data sources — Issues with data provided by upstream source systems.
  • Data integration and profiling — Problems with data scenarios analyzed and the design of processes.

In addition, one of the most frequently encountered data quality issues arises at the ETL stage — specifically, when ETL implementation is not in sync with the current business definition of data mappings. This issue typically occurs when a large number of data mappings are changed often by a business, as indicated by their requirements (e.g., an inventory group mapping for digital advertisers or item group mapping for retailers). It  can occur in isolation with other data quality issues or even when no problems with source data and ETL processes exist.

Most often, data mappings are defined and maintained manually by businesses, and then passed on to IT to be implemented in the business intelligence system, using ad-hoc ETL processes. In many cases, organizations do not prioritize this process as much as needed, relying instead on manual maintenance. This, in turn, leads to these challenges:

  • A higher risk of human errors due to manual efforts.
  • A longer turnaround time.
  • The potential for inter-team conflict (e.g., between business and IT).
  • Lower overall confidence on the final BI content.

What can help?
At InfoCepts, our team came up with a solution known as Dynamic Mappings Rule Management (DMRM). Essentially, DMRM is a modular, customizable system that can be easily used by business users to create and maintain data mappings for data transformations, calculations, and clean up — with minimal IT intervention. Components include:

  • A user interface that allows users to:
    •  Create new mapping rules.
    •  Modify existing mapping rules.
    •  Delete existing mapping rules.
    •  Restore previous versions of mapping rules.

This can be implemented in a technology of choice, whether with Java-based web UI or MicroStrategy’s transaction report.

  • A DMRM schema that involves a set of backend tables to store user changes and help generate dynamic queries.
  • A DMRM engine, which serves as a core component of the solution. Currently, the engine is implemented in Informatica, basically as an automated workflow to analyze DMRM schema, generate dynamic SQL queries, and apply changes to target tables by executing the SQL queries against the data warehouse. It can also be implemented in an alternate ETL tool of choice.

Key highlights of DMRM
Our team integrated a number of features of make DMRM highly effective, such as:

  • A customizable design. Since the solution uses implementation agnostic schema design and dynamic SQL queries, it can be customized to meet the needs of a variety of scenarios. To date, we have implemented it successfully for media and retail clients, with minimal changes from one implementation to another.
  • A technology-independent modular design. Due to the solution’s modular design, minimal changes are needed for new implementations. For instance, the existing UI (or existing MicroStrategy set up) can be used, while tables in the schema design can be implemented in any database, using any modelling. In addition, the DMRM engine can be implemented in shell scripting or any other ETL tool.

How it’s used, what benefits it brings
DMRM is a particularly effective solution for organizations with a business model that requires a large number of frequently changing data mappings. Likewise, it can help companies that require faster access to data with modified mappings, and that currently use ad-hoc manual processes to handle data mapping changes.

With DMRM in place, companies can gain:

  • Significantly improved data quality due to zero manual intervention.
  • Faster turnaround time due to automated processing of requested mapping changes.
  • Better coordination between business and IT team, given that the ownership of data quality (accuracy and timeliness) is moved to business.

DMRM enables users to define their own rules and experiment with resulting mapping changes without going back to IT. It also eliminates the requirements of writing complicated Excel formulas and transformations to derive and merge data from multiple sources. Ultimately, this leads to increased business confidence in the information and analysis delivered in BI reports, along with increased end user satisfaction — a primary goal of any BI team.

Limitations
Due to the customizable and modular design, DMRM can be modified and implemented with minimal efforts. There are, however, a few considerations to keep in mind. These include:

  • A UI design that is specific to an organization’s requirements, which means the actual effort to implement can’t be estimated before requirements are available.
  • A shift in the ownership of data quality (accuracy and timeliness) to the business side of your company, which may require process changes to ensure a smooth adoption.
  • Additional awareness or training, given that your data quality is only as good as the changes made by business users.

Get in touch to learn more about DMRM — and to find out if it’s the right solution for you.

Why Automating Pass-Through Mapping Cuts Costs and Time

Chetan DhapudkarOctober 29, 2015

Historically, pass-through mapping has been a manual process. But for companies with a large volume of data in diverse formats, manual pass-through mapping is a time-consuming, costly process that is also prone to errors. Consider, for example, an international retail chain with hundreds of stores worldwide and point of sale data saved in multiple servers. The company needs to integrate the data and create a central repository, but with close to 200 tables of data in various formats, it would take a small army of developers to map the data manually.

Issues like these make the manual approach to mapping a challenge for many enterprise-level businesses. Yet the large majority (by my own estimate, 90 percent) rely on manual mapping, regardless of the volume of data and timeline of the project. By automating the process, however, your company can achieve higher quality data in significantly less time and for less cost.

How Pass-Through Mapping Automation Works
While data integration tools do not offer a simple method to automate pass-through mapping, automation is possible. It requires a deep understanding of the operating principles of the data integration tool, along with the ability to understand and replicate the XML code that the tool creates. Through code replication, one can develop a code generating utility within the tool.

How does this work? The code generation utility created through this process facilitates the synchronization of XML code and can generate the XML code for each table quickly — typically in one-tenth or less of the time it takes to do manually. Additionally, the code allows for a uniform style throughout the mapping process, eliminating human errors, such as forgetting to map a certain row or column.

Finally, because most of the leading data integration tools, such as Informatica, DataStage, and Talend, operate using metadata, the same automation process can be adopted and deployed across a large number of data integration tools and for a variety of data mapping processes (not just pass-through mapping).

Benefits of Automation
Automating the pass-through mapping process offers a number of benefits and eliminates many of the challenges of manual mapping. These include:

  • Time savings. Manual efforts, particularly for large volume or time-sensitive pass-through mapping projects, can require significant resources and manpower to complete in a timely manner. It simply may not be possible to “scale up” temporary resources for these types of projects. And even if it is, the management challenges of overseeing a large team of temporary developers may prove unruly.
    Done manually, a single mapping takes roughly four hours to create. With automation, however, the process takes around 5 minutes, meaning you could end up creating 100 mappings in a day, once the automation principle is in place.
  • Cost savings. The costs of large-scale mapping projects run high and may even be prohibitive. Say, for instance, you need to create 200 mappings in a short time period. Since each mapping requires four hours of a developer’s time, you would need to pay a team of developers for 800 hours to complete the project. On top of developer compensation, you would need to factor in recruiting, hiring, and management costs — all of which add up.
    Pre-made automation tools do exist, but they’re also expensive and still require a developer to run the process. Automation within the data integration tool itself, as described above, can therefore provide significant cost savings.
  • Better data. With manual mapping, if someone misses a column or imports a table and accidentally gives it the wrong name (e.g., T3495 instead of T394), significant consequences will ensue. Automation ensures uniform processes, which eliminates human error and produces higher quality data. Additionally, companies can avoid spending the time or resources required to find and fix human errors.

Automating pass-through mapping within the data integration tool isn’t easy and comes with plenty of challenges. It’s also a new endeavor that most developers are still struggling to figure out and successfully achieve. But for companies with large volume or time-sensitive pass-through mapping projects, it’s by far your best, most cost-efficient bet.

At InfoCepts, we’ve developed our own code to automate pass-through mapping — and could help you, too. Reach out to learn how.

Creating an Efficient, Error-Free MicroStrategy Environment by Scrubbing Old Metadata

A major specialty retailer needed a faster-performing, less error-prone MicroStrategy environment to run complex BI reports. To help, we scrubbed the old metadata and created a system that delivers critical, reliable business information in a timely manner.

Download