If a data warehouse was a mechanical machine, the ETL process would provide the belts, pulleys, and gears that would make the data warehouse work processing inputs into outputs of this machine. ETL stands for Extraction, Transformation and Loading. Simply put, ETL gets the transactional business data from the OLTP data stores, transforms the data as required to meet the needs of its new business purpose, and loads it into the data warehouse in a new form which has been optimized for business analytical processing.
Extraction is the process of connecting to remote data sources and, in some cases, establishing live data connections for immediate update of the data warehouse; but, the process of extraction is more commonly conducted with the planning of batch processing of periodic updates to the data warehouse. Remote data sources can come in many shapes and sizes. Remote data sources can be your business transactional data from customers, order processing, material requirements planning, manufacturing and more. Remote data sources could also be established up and down your supply chain with collaborative supply partners. Remote data connections can be established with the social media, business web pages, Google Analytics, and more, more, more. Batch processing of the data extraction, instead of collecting via a live stream connection, often minimizes the load on both the sending and receiving computers and allows for execution of the automated extraction during server off-peak hours. Software scripts are often written using advanced Application Programming Interfaces (APIs) with common programming languages like Python, Java, C++, PHP or others in order to create scheduled tasks or CRON jobs on local server platforms. A solid understanding of API’s is quite critical to the successful collection of data from remote servers.
Generally, the data collected via simple extraction is not in a format which is optimized for analytical purposes. Some textual data which has been collected needs to be transformed for loading into a SQL or NO-SQL server. Likewise, some SQL data that is collected needs to be transformed to textual data. It is also possible that data collected during the extraction process is too extensive and thus can be “filtered” in order to save critical processing time and storage demands. It is also common that SQL tables which have been “normalized” for use in an OLTP environment must now be “denormalized” for optimal use in an OLAP environment. These transformation processes are often referred to as “data wrangling” or “data munging”. Again, software engineers are commonly asked to write server-based scripts that will automatically execute this transformation process. As with the data loading steps, data transformation is often planned and scheduled in order to balance the working load on the data warehouse.
Finally, the loading of the data. Once data has been transformed to meet the needs of the new analytical purpose, whether it be in a database, text file, BLOB, or otherwise, this data must be loaded via automated task. Loading of data into the data warehouse requires careful planning for file size, data transfer time, query speed for subsequent analysis, and more. A data warehouse architect must plan for the scalability of the data warehouse. Data coming in can surpass the capacity of the data warehouse in no time if not planned carefully. Data which has only a short life-cycle inside the data warehouse must be planned for elimination or archive. Data integrity and security is also of vital importance. At the heart of the process of building a data warehouse is the ETL process. Lots and lots of data is available to us all as business managers, and it has to be strategically transformed to meet the unique needs of each of our businesses.
1. Overview of Extraction, Transformation and Loading. Oracle.com. Retrieved from https://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm, September 20, 2016.